Wednesday, July 21, 2010

Wikisource for Manuscript Transcription

Of the crowdsourcing projects that have real users doing manuscript transcription, one of the largest is an offshoot of Wikisource. ProofreadPage was an extension to MediaWiki created around 2006 on the French-language Wikisource as a Wikisource/Internet Archive replacement for Project Gutenberg's Distributed Proofreaders. They were taking DjVu files from the InternetArchive and using them as sources (via OCR and correction) for WikiSource pages. This spread to the other Wikisource sites around 2008, radically changing the way Wikisource worked. More recently the German Wikisource has started using ProofreadPage for letters, pamphlets, and broadsheets.


The best example of ProofreadPage for handwriting is Winkler's Remarks on the Russian Campaign 1812-1813. First, the presentation is lovely. They've dealt with a typographically difficult text and are presenting alternate typefaces, illustrations, and even marginalia in a clear way in the transcription. The page numbers link to the images of the pages, and they're come up with transcription conventions which are clearly presented at the top of the text. This is impressive for a volunteer-driven edition!

Technically, the Winkler example illustrates ProofreadPage's solution to a difficult problem: how to organize and display pages, sections, and works in the appropriate context. This is not an issue that I've encountered with FromThePage—the Julia Brumfield Diaries are organized with only one entry per page—but I've worried about it since XML is so poorly suited to represent overlapping markup. When viewing Winkler as a work, paragraphs span multiple manuscript pages but are aggregated seamlessly into the text: search for "sind den Sommer" and you'll find a paragraph with a page-break in the middle of it, indicated by the hyperlink "[23]". Clicking on the page in which that paragraph begins shows the page and page image in isolation, along with footnotes about the page source and page-specific information about the status of the transcription. This is accomplished by programmatically stitching pages together into the work display while excluding page-specific markup via a noinclude tag.

But the transcription of Winkler also highlights some weaknesses I see in ProofreadPage. All annotation is done via footnotes which—although they are embedded within the source—are a far cry from the kind of markup we're used to with TEI or indeed HTML. In fact, aside from the footnotes and page numbers, there are no hyperlinks in the displayed pages at all. The inadequacies of this for someone who wants extensive text markup are highlighted by this personal name index page — it's a hand-compiled index! Had the tool (or its users) relied on in-text markup, such an index could be compiled by mining the markup. Of course, the reason I'm critical here is that FromThePage was inspired by the possibilities offered by using wiki-links within text to annotate, analyze, edit and index, and I've been delighted by the results.

When I originally researched ProofreadPage, one question perplexed me: why aren't more manuscripts being transcribed on Wikisource? A lot has happened since I last participated in the Wikisource community in 2004, especially within the realm of formalized rules. There now is a rule on the English, French, and German Wikisource sites banning unpublished work. Apparently the goal was to discourage self-promoters from using the site for their own novels or crackpot theories, and it's pretty drastic. The English language version specifies that sources must have been previously published on paper, and the French site has added "Ne publiez que des documents qui ont été déjà publiés ailleurs, sur papier" to the edit form itself! It is a rare manuscript indeed that has already been published in a print form which may be OCRed but which is worth transcribing from handwriting anyway. As a result, I suspect that we're not likely to see much attention paid to transcription proper within the ProofreadPage code, short of a successful non-Wikisource Mediawiki/ProofreadPage project.

Aside from FromThePage (which is accepting new transcription projects!) ProofreadPage/Mediawiki is my favorite transcription tool. Its origins outside the English-language community and Wikisource community policy have obscured its utility for transcribing manuscripts, which is why I think it's been overlooked. It's got a lot of momentum behind it, and while it is still centered around OCR, I feel like it will work for many needs. Best of all, it's open-source, so you can start a transcription project by setting up your own private wikisource instance.

Thanks to Klaus Graf at Archivalia for much of the information in this article.