Monday, December 21, 2009

Feature: Related Pages

I've been thinking a lot about page-to-subject links lately as I edit and annotate Julia Brumfield's 1921 diary. While I've been able to exploit the links data structure in editing, printing, analyzing and displaying the texts, I really haven't viewed it as a way to navigate from one manuscript page to another. In fact, the linkages I've made between pages have been pretty boring -- next/previous page links and a table of contents are the limit. I'm using the page-to-subject links to connect subjects to each other, so why not pages?

The obvious answer is that the subjects which page A would have most in common with page B are the same ones it would have in common with nearly every other page in the collection. In the corpus I'm working with, the diarist mentions her son and daughter-in-law in 95% of pages, for the simple reason that she lives with them. If I choose two pages at random, I find that March 12, 1921 and August 12, 1919 both contain Ben and Jim doing agricultural work, Josie doing domestic work, and Julia's near-daily visit to Marvin's. The two pages are connected through those four subjects (as well as this similarly-disappointing "dinner"), but not in a way that is at all meaningful. So I decided that a page-to-page relatedness tool couldn't be built from the page-to-subject link data.

All that changed two weeks ago, when I was editing the 1921 diary and came across the mention of a "musick box". In trying to figure out whether or not Julia was referring to a phonograph by the term, I discovered that the string "musick box" occurred only two times: when the phonograph was ordered and the first time Julia heard it played. Each one of these mentions shed so much light on the other that I was forced to re-evaluate how pages are connected through subjects. In particular, I was reminded of the "you and one other" recommendations that LibraryThing offers. This is a feature that find other users with whom you share an obscure book. In this case, obscurity is defined as the book occurring only twice in the system: once in your library, once in the other user's.

This would be a relatively easy feature to implement in FromThePage. When displaying a page, perform this algorithm:
  • For each subject link in the page, calculate how many times it is referenced within the collection, then
  • Sort those subjects by reference count, and
  • Take the 3 or 4 subject links with the lowest reference count and,
  • Display the pages which link to those subjects.
For a really useful experience, I'd want to display keyword-in-context, showing a few words to explain the context in which that other occurrence of "musick box" appears.