Wednesday, December 7, 2011

Developments in Wikisource/ProofreadPage for Transcription

Last year I reviewed Wikisource as a platform for manuscript transcription projects, concluding that the ProofreadPage plug-in was quite versatile, but that unfortunately the en.wikisource.org policy prohibiting any text not already published on paper ruled out its use for manuscripts.

I'm pleased to report that this policy has been softened. About a month ago, NARA started to partner with the Wikimedia Foundation to to host material—including manuscripts—on Wikisource.  While I was at MCN, I discussed this with Katie Filbert, the president of Wikimedia DC, who set me straight.  Wikisouce is now very interested in partnering with institutions to host manuscripts of importance, but it is still not a place for ordinary people to upload great-grandpa's journal from World War I.

Once you host a project on Wikisource, what do you do with it?  Andie, Rob and Gaurav over at the blog So You Think You Can Digitize?—and it's worth your time to read at least the last six posts—have been writing on exactly that subject.  Their most recent post describes their experience with Junius Henderson's Field Notes, and although it concentrates on their success flushing out more Henderson material and recounts how they dealt with the wikisource software, I'd like to concentrate on a detail:
What we currently want is a no-cost, minimal effort system that will make scans AND transcriptions AND annotations available, and that can facilitate text mining of the transcriptions.  Do we have that in WikiSource?  We will see.  More on annotations to follow in our next post but some father to a sister of some thoughts are already percolating and we have even implemented some rudimentary examples.
This is really exciting stuff.  They're experimenting with wiki mark-up of the transcriptions  with the goal of annotation and text-mining.  I tried to do this back in 2005, but abandoned the effort because I never could figure out how to clearly differentiate MediaWiki articles about subjects (i.e. annotations) from articles that presented manuscript pages and their transcribed text.   The lack of wiki-linking was also the one of my criticisms most taken to heart by the German Wikisource community last October.

So how is the mark-up working out?  Gaurav and the team have addressed the differentiation issue by using cross-wiki links, a standard way of linking from an article on one Wikimedia project to another.  So the text "English sparrows" in the transcription is annotated [[:w:Passer domesticus|English sparrows]], which is wiki-speak for Link the text "English sparrows" to the Wikipedia article "Passer domesticus". Wikipedia's redirects then send the browser off to the article "House Sparrow".

So far so good.  The only complaint I can make is that—so far as I can tell—cross-wiki links don't appear in the "What links here" screen tool on Wikipedia, neither for Passer domesticus, nor for House Sparrow.  This means that the annotation can't provide an indexing function, so that users can't see all the pages that reference possums, nor read a selection of those pages.  I'm not sure that the cross-wiki link data isn't tracked, however — just that I can't see it in the UI.  Tantalizingly, cross-wiki links are tracked when images or other files are included in multiple locations: see the "Global file usage" section of the sparrow image, for example.  Perhaps there is an API somewhere that the Henderson Field Note project could use to mine this data, or perhaps they could move their links targets from Wikipedia articles to some intermediary in a different Wikisource namespace.

Regardless, the direction Wikisource is moving should make it an excellent option for institutions looking to host documentary transcription projects and experiment with crowdsourcing without running their own servers.  I can't wait to see what happens once Andie, Rob, and Gaurav start experimenting with PediaPress!

Friday, November 18, 2011

Crowdsourcing Transcription at MCN 2011

These are links to the papers, websites, and systems mentioned in my presentation at the Museum Computer Network 2011 conference.

Friday, August 5, 2011

Programmers: Wikisource Needs You!

Wikisource is powered by a MediaWiki extension which allows page images to be displayed beside the wiki editing form. This extension also handles editorial workflow by allowing pages, chapters, and books to be marked as unedited, partially edited, in need of review, or finished. It's a fine system, and while the policy of the English language Wikisource community prevents it from being used for manuscript transcription, there are active manuscript projects using the software in other communities.

Yesterday, Mark Hershberger wrote this in a comment: For what its worth the extension used by WikiSource, ProofreadPage, now needs a maintainer. I posted about this here: http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/54831

While I'm sorry to hear it, this is an excellent opportunity for someone with Mediawiki skills to do some real good.

Tuesday, July 26, 2011

Can a Closed Crowdsourcing Project Succeed?

Last night, the Zooniverse folks announced their latest venture: Ancient Lives, which invites the public to help analyze the Oxyrhynchus Papyri. The transcription tool meets the high standards we now expect from the team who designed Old Weather, but the project immediately stirred some controversy because of its terms of use:


Sean is referring to this section of the copyright statement (technically, not a terms of use), which is re-displayed from the tutorial:
Images may not be copied or offloaded, and the images and their texts may not be published. All digital images of the Oxyrhynchus Papyri are © Imaging Papyri Project, University of Oxford. The papyri themselves are owned by the Egypt Exploration Society, London. All rights reserved.
Future use of the transcriptions may be hinted at a bit on the About page:
The papyri belong to the Egypt Exploration Society and their texts will eventually be published and numbered in Society's Greco-Roman Memoirs series in the volumes entitled The Oxyrhynchus Papyri.
It should be noted that the closed nature of the project is likely a side-effect of UK copyright law, not a policy decision by the Zooniverse team. In the US, a scan or transcription of a public domain work is also public domain and not subject to copyright. In the UK, however, scanning an image creates a copyright in the scan, so the up-stream providers automatically are able to restrict down-stream use of public domain materials. In the case of federated digitization projects this can create a situation like that of the Old Bailey Online, where different pieces of a seemingly-seamless digital database are owned by entirely different institutions.

I will be very interested to see how the Ancient Lives project fares compared to GalaxyZoo's other successes. If the transcriptions are posted and accessible on their own site, users may not care about the legal ownership of the results of their labor. They've already had 100,000 characters transcribed, so perhaps these concerns are irrelevant for most volunteers.

Wednesday, July 20, 2011

Crowdsourcing and Variant Digital Editions

Writing at the JISC Digitization Blog, Alastair Dunning warns of "problems with crowdsourcing having the ability to create multiple editions."

For example, the much-lauded Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) are now beginning to appear on many different digital platforms.

ProQuest currently hold a licence that allows users to search over the entire EEBO corpus, while Gale-Cengage own the rights to ECCO.

Meanwhile, JISC Collections are planning to release a platform entitled JISC Historic Books, which makes licenced versions of EEBO and ECCO available to UK Higher Education users.

And finally, the Universities of Michigan and Oxford are heading the Text Creation Partnership (TCP), which is methodically working its way through releasing full-text versions of EEBO, ECCO and other resources. These versions are available online, and are also being harvested out to sites like 18th Century Connect.

So this gives us four entry points into ECCO – and it’s not inconceivable that there could be more in the future.

What’s more, there have been some initial discussions about introducing crowdsourcing techniques to some of these licensed versions; allowing permitted users to transcribe and interpret the original historical documents. But of course this crowdsourcing would happen on different platforms with different communities, who may interpret and transcribe the documents in different way. This could lead to the tricky problem of different digital versions of the corpus. Rather than there being one EEBO, several EEBOs exist.

Variant editions are indeed a worrisome prospect, but I don't think that it's unique to projects created through crowdsourcing. In fact, I think that the mechanism of producing crowdsourced editions actually reduces the possibility for variants to emerge. Dunning and I corresponded briefly over Twitter, then I wrote this comment to the JISC Digitization blog. Since that blog seems to be choking on the mark-up, I'll post my reply here:
benwbrum Reading @alastairdunning's post connecting
crowdsourcing to variant editions: bit.ly/raVuzo Feel like Wikipedia
solved this years ago.

benwbrum If you don't publish (i.e. copy) a "final" edition of a crowdsourced transcription, you won't have variant "final" versions.

benwbrum The wiki model allows linking to a particular version of an article. I expanded this to the whole work: link

alastairdunning But does that work with multiple providers offering restricted access to the same corpus sitting on different platforms?

alastairdunning ie, Wikipedia can trace variants cause it's all on the same platform; but there are multiple copies of EEBO in different places

benwbrum I'd argue the problem is the multiple platforms, not the crowdsourcing.

alastairdunning Yes, you're right. Tho crowdsourcing considerably amplifies the problem as the versions are likely to diverge more quickly

benwbrum You're assuming multiple platforms for both reading and editing the text? That could happen, akin to a code fork.

benwbrum Also, why would a crowd sourced edition be restricted? I don't think that model would work.
I'd like to explore this a bit more. I think that variant editions are less likely in a crowdsourced project than in a traditional edition, but efforts to treat crowdsourced editions in a traditional manner can indeed result in the situation you warn against.

When we're talking about crowdsourced editions, we're usually talking about user-generated content that is produced in collaboration with an editor or community manager. Without exception, this requires some significant technical infrastructure -- a wiki platform for transcribing free-form text or an even more specialized tool for transcribing structured data like census records or menus. For most projects, the resulting edition is hosted on that same platform -- the Bentham wiki which displays the transcriptions for scholars to read and analyze is the same tool that volunteers use to create the transcriptions. This kind of monolithic platform does not lend itself to the kind of divergence you describe: copies of the edition are always dated as soon as they are separated from the production platform, and making a full copy of the production platform requires a major rift among the editors and volunteer community. These kind of rifts can happen--in my world of software development, the equivalent phenomenon is a code fork--but they're very rare.

But what about projects which don't run on a monolithic platform? There are a few transcription projects in which editing is done via a wiki (Scripto) or webform (UIowa) but the transcriptions are posted to a content management system. There is indeed potential for the "published" version on the CMS to drift from the "working" version on the editing platform, but in my opinion the problem lies not in crowdsourcing, but in the attempt to impose a traditional publishing model onto a participatory project by inserting editorial review in the wrong place:

Imagine a correspondence transcription project in which volunteers make their edits on a wiki but the transcriptions are hosted on a CMS. One model I've seen often involves editors taking the transcriptions from the wiki system, reviewing and editing them, then publishing the final versions on the CMS. This is a tempting work-flow -- it makes sense to most of us both because the writer/editor/reader roles are clearly defined and because the act of copying the transcription to the CMS seems analogous to publishing a text. Unfortunately, this model fosters divergence between the "published" edition and the working copy as voluteers continue to make changes to the transcriptions on the wiki, sometimes ignoring changes made by the reviewer, sometimes correcting text regardless of whether a letter has been pushed to the CMS. The alternative model has reviewers make their edits within the wiki system itself, with content pushed to the CMS automatically. In this model, the wiki is the system-of-record; the working copy is the official version. Since the CMS simply reflects the production platform, it does not diverge from it. The difficulty lies in abandoning the idea of a final version.

It's not at all clear to me how EEBO or ECCO are examples of crowdsourcing, rather than traditional restricted-access databases created and distributed through traditional means, so I'm not sure that they're good examples.

Tuesday, February 15, 2011

My Goals for FromThePage

A couple of recent interactions have made me realize that I've never publicly articulated my goals in developing FromThePage. Like anyone else managing a multi-year project, my objectives have shifted over time. However, there are three main themes of my work developing web-based software for transcribing handwritten documents:
  1. Transcribing and publishing family diaries. FromThePage was developed to allow me and my immediate family to collaboratively transcribe the diaries of Julia Craddock Brumfield (fl. 1915-1936), my great-great grandmother. This objective has drifted over time to include the diaries of Jeremiah White Graves (fl. 1823-1878) as well, but despite that addition the effort is on track to achieve this original goal. Since the website was announced to my extended family in early 2009, Linda Tucker—a cousin whom I'd never met—has transcribed every single page I've put online, then located and scanned three more diaries that were presumed lost. The only software development work remaining to complete this goal is the integration of the tool with a publish-on-demand service so that we may distribute the diaries to family members without Internet access.
  2. Creating generally useful transcription software. As I developed FromThePage, I quickly realized that the tool would be useful to anyone transcribing, indexing and annotating handwritten material. It seemed a waste to pour effort into a tool that was only accessible to me, so a new goal arose of converting FromThePage into a viable multi-user software project. This has been a more difficult endeavor, but in 2010 I released FromThePage under the AGPL, and it's been adopted with great enthusiasm by the Balboa Park Online Collaborative for transcription projects at the San Diego Natural History Museum.
  3. Providing access to privately-held manuscripts. The vision behind FromThePage is to generalize my own efforts digitizing family diaries across the broader public. There is what I call an invisble archive--an enormous collection of primary documents that is distributed among the filing cabinets, cedar chests, and attics of the nation's nostalgic great aunts, genealogists, and family historians. This archive is inaccessible to all but the most closely connected family and neighbors of the documents' owners — indeed it's most often not merely inaccessible but entirely unknown. When effort is put into researching this archival material, it's done by amateurs like myself, and more often than not the results are naïve works of historical interpretation: rather than editing and annotating a Civil War diary, the researcher draws from it to create yet another Lost Cause narrative. I would love for FromThePage to transform this situation, channeling amateur efforts into digitizing and sharing irreplaceable primary material with researchers and family members alike. This has proven a far greater challenge than my proximate or intermediate goals: technically speaking the processing and hosting of page scans has been costly and difficult, while my efforts to recruit from the family history community have met with little success. Nevertheless I remain hopeful that events like this month's RootsTech conference will build the same online network among family researchers that THATCamp has among professional digital humanists.

Wednesday, February 2, 2011

2010: The Year of Crowdsourcing Transcription

2010 was the year that collaborative manuscript transcription finally caught on.

Back when I started work on FromThePage in 2005, I got the same response from most people I told about the project: "Why don't you just use OCR software?" To say that the challenges of digitizing handwritten material were poorly understood might be inaccurate—after all the TEI standard included an entire chapter on manuscripts—but there were no tools in use designed for the purpose. Five years later, half a dozen web-based transcription projects are in progress and new projects may choose from several existing tools to host their own. Crowdsourced transcription was even the written up in the New York Times!

I'm going to review the field as I see it, then make some predictions for 2011.

Ongoing Structured Transcription Projects
By far the most successful transcription project is FamilySearch Indexing. In 2010, 185,900,667 records were transcribed from manuscript census forms, parish registers, tithe lists, and other sources world-wide. This brings the total up to 437,795,000 records double-keyed and reconciled by more than four hundred thousand volunteers — itself an awe-inspiring number with an equally impressive support structure.

October saw the launch of OldWeather, a project in which GalaxyZoo applied its crowdsourcing technology to transcription of Royal Navy ship's logs from WWI. As I write, volunteers have transcribed an astonishing 308169 pages of logs — many of which include multiple records. I hope to do a more detailed review of the software soon, but for now let me note how elegantly the software uses the data itself to engage volunteers, so that transcribers can see the motion of "their ship" on a map as they enter dates, latitudes and longitudes. This leverages the immersive nature of transcription as an incentive, projecting users deep within history.

The North American Bird Phenology Program transcribed nearly 160,000 species sighting cards between December 2009 and 2010 and maintained their reputation as a model for crowdsourcing projects by publishing the first user satisfaction survey for a transcription tool. Interestingly the program seems to have followed a growth pattern a bit similar to Wikipedia's, as the cards transcribed rose from 203,967 to 362,996 while the number of volunteers only increased from 1,666 to 2,204 (32% vs 78%) — indicating that a core of passionate volunteers remain the most active contributors.

I've only recently discovered Demogen, a project operated by the Belgium Rijksarchief to enlist the public to index handwritten death notices. Although most of the documentation is in Flemish, the Windows-based transcription software will also operate in French. I've had trouble finding statistics on how many of the record sets have been completed (a set comprising a score of pages with half a dozen personal records per page). By my crude estimate, the 4000ish sets are approximately 63% indexed — say a total of 300,000 records to date. I'd like to write a more detailed review of Demogen/Visu and would welcome any pointers to project status and community support pages.

Ancestry.com's World Archives Project has been operating since 2008, but I've been unable to find any statistics on the total number of records transcribed. The project allows volunteers to index personal information from a fairly heterogeneous assortment of records scanned from microfilm. Each set of records has its own project page with help and statistics. The keying software is a Windows-based application free for download by any Ancestry.com registered user, while support is provided through discussion boards and a wiki.

Ongoing Free-form Transcription Projects
While I've written about Wikisource and its ProofreadPage plug-in before, it remains worth very much following. More than two hundred thousand scanned pages have been proofread, had problems reconciled, and been reviewed out of 1.3 million scanned pages. Only a tiny percent of those are handwritten, but that's still a few thousand pages, making it the most popular automated free-form transcription tool.

This blog was started to track my own work developing FromThePage to transcribe Julia Brumfield's diaries. As I type, beta.fromthepage.com hosts 1503 transcribed pages—of which 988 are indexed and annotated—and volunteers are now waiting on me to prepare and upload more page images. Major developments in 2010 included the release of FromThePage on GitHub under a Free software license and installation of the software by the Balboa Park Online Collaborative for transcription projects by their member institutions.

Probably the biggest news this year was TranscribeBentham, a project at University College London to crowdsource the transcription of Jeremy Bentham's papers. This involved the development of Transcription Desk, a MediaWiki-based tool which is slated to be released under an open-source license. The team of volunteers had transcribed 737 pages of very difficult handwriting when I last consulted the Benthamometer. The Bentham team has done more than any other transcription tool to publicize the field -- explaining their work on their blog, reaching out through the media (including articles in the Chronicle of Higher Education and the New York Times), and even highlighting other transcription projects on Melissa Terras's blog.

Halted Transcription Projects
The Historic Journals project is a fascinating tool for indexing—and optionally transcribing—privately-held diaries and journals. It's run by Doug Kennard at at Brigham Young University, and you can read about his vision in this FHT09 paper. Technically, I found a couple of aspects of the project to be particularly innovative. First, the software integrates with ContentDM to display manuscript page images from that system within its own context. Second, the tool is tightly integrated with FamilySearch, the LDS Church's database of genealogical material. It uses the FamilySearch API to perform searches for personal or place names, and can then use the FamilySearch IDs to uniquely identify subjects mentioned within the texts. Unfortunately, because the FamilySearch API is currently limited to LDS members, development on Historic Journals has been temporarily halted.

Begun as a desktop application in 1998, the uScript Transcription Assistant is the longest-running program in the field. Recently ported over to modern web-based technologies, the system is similar to Img2XML and T-PEN in that it links individual transcribed words to the corresponding images within the scanned page. Although the system is not in use and the source-code is not accessible outside WPI, you can read papers describing it by WPI students in 2003 or in 2005 by Fabio Carrera (the faculty member leading the project). Unfortunately, according to Carrera's blog work on the project has stopped for lack of funding.

According to the New York Times article, there was an attempt to crowdsource the Papers of Abraham Lincoln. The article quotes project director Daniel Stowell explaining that nonacademic transcribers "produced so many errors and gaps in the papers that 'we were spending more time and money correcting them as creating them from scratch.'" The prototype transcription tool (created by NCSA at UIUC) has been abandoned.

Upcoming Transcription ProjectsThe Center for History and New Media at George Mason University is developing a transcription tool called Scripto based on MediaWiki and architected around integration with an external CMS for hosting page images. The initial transcription project will be their Papers of the War Department site, but connector scripts for other content management systems are under development. Scripto is being developed in a particularly open manner, with the source code available for immediate inspection and download on GitHub and a project blog covering the tool's progress.

T-PEN is a tool under development by Saint Louis Univiersity to enable line-by-line transcription and paleographic annotation. It's focused on medieval manuscripts, and automatically identifies the lines of text within a scanned page — even if that page is divided into columns. The team integrated crowdsourcing into their development process by challenging the public to test and give feedback on their line identification algorithm, gathering perhaps a thousand ratings in a two week period. There's no word on whether T-PEN will be released under a free license. I should also mention that they've got the best logo of any transcription tool.

I covered Militieregisters.nl at length below, but the most recent news is that a vendor has been picked to develop the VeleHanden transcription tool. I would not be at all surprised if 2011 saw the deployment of that system.

The Balboa Park Online Collaborative is going into collaborative transcription in a big way with the Field Notes of Laurence Klauber for the San Diego Natural History Museum. They've picked my own FromThePage to host their transcriptions, and have been driving a lot of the development on that system since October through their enthusiastic feature requests, bug reports, and funding. Future transcription projects are in the early planning stages, but we're trying to complete features suggested by the Klauber material first.

The University of Iowa Libraries plan to crowdsource transcription of their Historic Iowa Children's Diaries. There is no word on the technology they plan to use.

The Getty Research Institute plans to crowdsource transcription of J. Paul Getty's diaries. This project also appears to be in the very early stages of planning, with no technology chosen.

Invisible Australians is a digitization project by Kate Bagnall and Tim Sherratt to explore the lives of Australians subjected to the White Australia policy through their extensive records. While it's still in the planning stages (with only a set of project blogs and a Zotero library publicly visible), the heterogeneity of the source material make it one of the most ambitious documentary transcription projects I've seen. Some of the data is traditionally structured (like government forms), some free-form (like letters), and there are photographs and even hand-prints to present alongside the transcription! Invisible Australians will be a fascinating project to follow in 2011.

Obscure Transcription Projects
Because the field is so fragmented, there are a number of projects I follow that are not entirely automated, not entirely public, not entirely collaborative, moribund or awaiting development. In fact, some projects have so little written about them online that they're almost mysterious.
  • Commenters to a blog post at Rogue Classicism are discussing this APA job posting for a Classicist to help develop a new GalaxyZoo project transcribing the Oxyrhynchus Papyri.
  • Some cryptic comments on blog posts covering TranscribeBentham point to FadedPage, which appears to be a tool similar to Project Gutenberg's Distributed Proofreaders. Further investigation has yielded no instances of it being used for handwritten material.
  • A blog called On the Written, the Digital, and the Transcription tracks development of WrittenRummage, which was apparently a crowdsourced transcription tool that sought to leverage Amazon's Mechanical Turk.
  • Van Papier Naar Digitaal is a project by Hans den Braber and Herman de Wit in which volunteers photograph or scan handwritten material then send the images to Hans. Hans reviews them and puts them on the website as a PDF, where Herman publicizes them to transcription volunteers. Those volunteers download the PDF and use Jacob Boerema's desktop-based Transcript software to transcribe the records, which are then linked from Digitale Bronbewerkinge Nederland en België. With my limited Dutch it is hard for me to evaluate how much has been completed, but in the years that the program has been running its results seem to have been pretty impressive.
  • BYU's Immigrant Ancestors Project was begun in 1996 as a survey of German archival holdings, then was expanded into a crowdsourced indexing project. A 2009 article by Mark Witmer predicts the immanent roll-out of a new version of the indexing software, but the project website looks quite stale and says that it's no longer accepting volunteers.
  • In November, a Google Groups post highlighted the use of Islandora for side-by-side presentation of a page image and a TEI editor for transcription. However I haven't found any examples of its use for manuscript material.
  • Wiktenauer is a MediaWiki installation for fans of western martial arts. It hosts several projects transcribing and translating medieval manuals of fighting and swordsmanship, although I haven't yet figured out whether they're automating the transcription.
  • Melissa Terras' manuscript transcription blog post mentioned a Drupal-based tool called OpenScribe, built by the New Zealand Electronic Text Centre. However, the Google Code site doesn't show any updates since mid-2009, so I'm not sure how active the project is. This project is particularly difficult to research because "OpenScribe" is also the name chosen for an audio transcription tool hosted on SourceForge as well as a commercial scanning station.
I welcome any corrections or updates on these projects.

Predictions for 2011

Emerging Community
Nearly all of the transcription projects I've discussed were begun in isolation, unaware of previous work towards transcription tools. While I expect this fragmented situation to continue--in fact I've seen isolated proposals as recently as Shawn Moore's October 12 HASTAC post--it should lessen a bit as toolmakers and project managers enter into dialogue with each other on comment threads, conference panels or GitHub. Tentative steps were made towards overcoming linguistic division in 2010, with Dutch archivists covering TranscribeBentham and a scattered bit of bloggy conversation between Dutch, German, English and American participants. The publicity given to projects like OldWeather, Scripto, and TranscribeBentham can only help this community form.

No Single Tool
We will not see the development of a single tool that supports transcription of both structured and free-form manuscripts, nor both paleographic and semantic annotation in 2011. The field is too young and fragmented -- most toolmakers have enough work providing the basic functionality required by their own manuscripts.

New Client-side Editors
Although I don't foresee convergence of server tools, there is already some exciting work being done towards Javascript-based editors for TEI, the mark-up language that informs most manuscript annotation. TEILiteEditor is an open-source WYSIWYG for editing TEI, while RaiseXML is an open-source editor for manipulating TEI tags directly. Both projects have seen a lot of activity over the past few weeks, and it's easy to imagine a future in which many different transcription tools support the same user-facing editor.

External Integration
2010 already saw strides being made towards integration with external CMSs, with BYU's Historic Journals serving page images from ContentDM and FromThePage serving page images from the Internet Archive. Scripto is apparently designed entirely around CMS integration, as it does not host images itself and is architected to support connectors for many different content management systems. I feel that this will be a big theme of transcription tool development in 2011, with new support for feeding transcriptions and text annotations back to external CMSs.

Outreach/Volunteer Motivation
We're learning that a key to success in crowdsourcing projects is recruiting volunteers. I think that 2011 will see a lot of attention paid to identifying and enlisting existing communities interested in the subject matter for a transcription project. In addition to finding volunteers, projects will better understand volunteer motivation and the trade-offs between game-like systems that encourage participation through score cards and points on the one hand, and immersive systems that enhance the volunteers' engagement with the text on the other.

Taxonomy
As number of transcription projects multiplies, I think that we will be able to start generalizing from the unique needs of each collection of manuscript material to form a sort of taxonomy of transcription projects. In the list above, I've separated the projects indexing structured data like militia rolls from those dealing with free-form text like diaries or letters. I think that in 2011 we'll be able to classify projects by their paleographic requirements, the kinds of analysis that will be performed on the transcribed texts, the quantity of non-textual images that must be incorporated into the transcription presentation, and other dimensions. It's possible that the existing tools will specialize in a few of these areas, providing support for needs similar to those of their original project so that a sort of decision tree could guide new projects toward the appropriate tool for their manuscript material.

2011 is going to be a great year!

Tuesday, January 4, 2011

Progress Report: GitHub, Archive.org Integration, and General Availability

2010 saw big changes in FromThePage.
  • The Balboa Park Online Collaborative started using FromThePage to transcribe the field notes of herpetologist Laurence Klauber. Perian Sully, Rich Cherry, and all the other folks there have been fantastic to work with: full of enthusiasm and new ideas for the system while patient with the bugs that we've discovered. This is the first institution to install FromThePage, and their needs have driven a lot of development since October, including
  • Internet Archive integration: As you can see on the Klauber site, FromThePage now integrates directly with books hosted on the Internet Archive. This means that FromThePage gets to use the BookReader (in modified form) with its spiffy zoom and pan capabilities while delegating the expensive work of image hosting to Archive.org. It also reduces duplication of data and may enhance findability of the transcriptions. Best of all, the tedious process of uploading, assembling, and titling page images can be skipped, as FromThePage now imports the book structure and even the OCRed page titles from Archive.org derivative files.
  • As you can see from that last link, I've transferred FromThePage over to GitHub, released it under the Affero GPL, and created some extensive documentation on the wiki. So FromThePage is officially Free software, available for immediate use.
If you're interested in hosting a transcription project on FromThePage, drop me a line at benwbrum@gmail.com and I'll help you get started.