Collaborative Manuscript Transcription: March 2009

Monday, March 23, 2009

Feature: Mechanical Turk Integration

At last week's Austin On Rails SXSW party, my friend and compatriot Steve Odom gave me a really neat feature idea. "Why don't you integrate with Amazon's Mechanical Turk?" he asked. This is an intriguing notion, and while it's not on my own road map, it would be pretty easy to modify FromThePage to support that. Here's what I'd do to use FromThePage on a more traditional transcription project, with an experienced documentary editor at the head and funding for transcription work:

Page Forks: I assume that the editor using Mechanical Turk would want double keyed transcriptions to maintain quality, so the application needs to present the same, untranscribed page to multiple people. In the software world, when a project splits, we call this forking, and I think that the analogy applies here. This feature needs to be able to track an entirely separate edit history for the different forks of a page. This means a new attribute on the master page record describing whether more than one fork exists, and a separate edit history for each fork of a page that's created. There's no reason to limit these transcriptions to only two forks, even if that's the most common use case, so I'd want to provide a URL that will automatically create a new fork for a new transcriber to work in. The Amazon HIT (Human Intelligence Task) would have a link to that URL, so the transcriber need never track which fork they're working in, or even be aware of the double keying.

Reconciling Page Forks: After a page has been transcribed more than one time, the application needs to allow the editor to reconcile the transcriptions. This would involve a screen displaying the most recent version of two transcriptions alongside the scanned page image. Likely there's a decent Rails plug in already for displaying code diffs, so I could leverage that to highlight differences between the two transcriptions. A fourth pane would allow the editor to paste in the reconciled transcription into the master page object.

Publishing MTurk HITs: Since each page is an independent work unit, it should be possible to automatically convert an untranscribed work into MTurk HITs, with a work item for each page. I don't know enough about how MTurk works, but I assume that the editor would need to enter their Amazon account credentials to have the application create and post the HITs. The app also needs to prevent the same user from re-transcribing the same page in multiple forks.

In all, it doesn't sound like more than a month or two worth of work, even performed part-time. This isn't a need I have for the Julia Brumfield diaries, so I don't anticipate building this any time soon. Nevertheless, it's fun to speculate. Thanks, Steve!

Wednesday, March 18, 2009

Progress Report: Page Thumbnails and Sensitive Tags

As anyone reading this blog through the Blogspot website knows, visual design is not one of my strengths. One of the challenges that users have with FromThePage is navigation. It's not apparent from the single-page screen that clicking on a work title will show you a list of pages. It's even less obvious from the multi-page work reading screen that the page images are accessible at all on the website.

Last week, I implemented a suggestion I'd received from my friend Dave McClinton. The work reading screen now includes a thumbnail image of each manuscript page beside the transcription of that page. The thumbnail is a clickable link to the full screen view of the page and its transcription. This should certainly improve the site's navigability, and I think it also increases FromThePage's visual appeal.

I tried a different approach to processing the images from the one I'd used before. For transcribable page images, I modified the images offline through a batch process, then transferred them to the application, which serves them statically. The only dynamic image processing the FromThePage software did for end-users was involved in zoom. This time, I added a hook to the image link code, so that if a thumbnail was requested by the browser, the application would generate it on the fly. This turned out to be no harder to code than a batch process, and the deployment was far easier. I haven't seen a single broken thumbnail image yet, so it looks like it's fairly robust, too.

The other new feature I added last week was support for sensitive tags. The support is still fairly primitive -- enclose text with and it will only be desplayed to users authorized to transcribe the work -- but it gets the job done and solves some issues that had come up with Julia Brumfield's 1919 diary. Happily, this took less than 10 minutes to implement.

Sunday, March 15, 2009

Feature: Editorial Toolkit

I'm pleased to report that my cousin Linda Tucker has finished transcribing the 1919 diary. I've been trying my best to keep up with her speed, but she's able to transcribe two pages in the amount of time it takes me to edit and annotate a single, simple page. If the editing work requires more extensive research, or (worse) reveals the need to re-do several previous pages, there is really no contest. In the course of this intensive editing, I've come up with a few ideas for new features, as well as a few observations on existing features.

Show All Pages Mentioning a Subject

Currently, the article page for each subject shows a list of the pages on which the subject is mentioned. This is pretty useful, but it really doesn't serve the purposes of the reader or editor who wants to read every mention of that subject, in context. In particular, after adding links to 300 diary pages, I realized that "Paul" might be either Paul Bennett, Julia's 20-year-old grandson who is making a crop on the farm, or Paul Smith, Julia's 7-year-old grandson who lives a mile away from her and visits frequently. Determining which Paul was which was pretty easy from the context, but navigating the application to each of those 100-odd pages took several hours.

Based on this experience, I intend to add a new way of filtering the multi-page view, which would display all the transcriptions of all pages that mention a subject. I've already partially developed this as a way to filter the pages within a work, but I really need to 1) see mentions across works, and 2) make this accessible from the subject article page. I am embarrassed to admit that the existing work-filtering feature is so hard to find, that I'd forgotten it even existed.

Autolink

The Autolink feature has proven invaluable. I originally developed it to save myself the bother of typing [[Benjamin Franklin Brumfield, Sr.|Ben]] every time Julia mentioned "Ben". However, it's proven especially useful as a way of maintaining editorial consistency. If I decided that "bathing babies" was worth an index entry on one page, I may not remember that decision 100 pages later. However, if Autolink suggests [[bathing babies]] when it sees the string "bathed the baby", I'll be reminded of that. It doesn't catch every instance , but for subjects that tend to cluster (like occurrences of newborns), it really helps out.

Full Text Search

Currently there is no text search feature. Implementing one would be pretty straightforward, but in addition to that I'd like to hook in the Autolink suggester. In particular, I'd like to scan through pages I've already edited to see if I missed mentions of indexed subjects. This would be especially helpful when I decide that a subject is noteworthy halfway through editing a work.

Unannotated Page List

This is more a matter of work flow management, but I really don't have a good way to find out which pages have been transcribed but not edited or linked. It's really hard to figure out where to resume my editing.

[Update: While this blog post was in draft, I added a status indicator to the table of contents screen to flag pages with transcriptions but no subject links.]

Dual Subject Graphs/Searches

Identifying names is especially difficult when the only evidence is the text itself. In some cases I've been able to use subject graphs to search for relationships between unknown and identified people. This might be much easier if I could filter either my subject graphs or the page display to see all occurrences of subjects X and Y on the same page.

Research Credits

Now that the Julia Brumfield Diaries are public, suggestions, corrections, and research is pouring in. My aunt has telephoned old-timers to ask what "rebulking tobacco" refers to. A great-uncle has emailed with definitions of more terms, and I've had other conversations via email and telephone identifying some of the people mentioned in the text. To my horror, I find that I've got no way to attribute any of this information to those sources. At minimum, I need a large, HTML acknowledgments field at the collection level. Ideally, I'd figure out an easy-to-use way to attribute article comments to individual sources.

Collaborative Manuscript Transcription

Monday, March 23, 2009

Feature: Mechanical Turk Integration

Wednesday, March 18, 2009

Progress Report: Page Thumbnails and Sensitive Tags

Sunday, March 15, 2009

Feature: Editorial Toolkit

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History