Saturday, May 17, 2008

Progress Report: De-duping catastrophe and a host change

After a very difficult ten days of coding, I'm almost where I was at the beginning of May. The story:

Early in the month, I got a duplicate identifier feature coded. The UI was based on LibraryThing's, which is the best de-duping interface I've ever seen. Mine still falls short, but it's able to pair "Ren Worsham" up with "Wren Worsham", so it'll probably do for now. With that completed, I built a tool to combine subjects: if you see a possible duplicate of the subject you're viewing, you click combine, and it updates all textual references to the duplicate to point to the main article, then deletes the duplicate. Pretty simple, right?

Enter DreamHost's process killer. Now I love DreamHost, and I want to love them more, but I really don't think their cheap shared hosting plan is appropriate for a computationally intensive web-app. In order to insulate users from other, potentially clueless users, they have a daemon that monitors running processes and kills off any that look "bad". I'm not sure what criteria constitute "bad", but I should have realized the heuristic might be over-aggressive when I wasn't able to run basic database migrations without running afoul of it. Nevertheless, it didn't seem to be causing anything beyond the occasional "Rails Application failed to start" message that could be solved with a browser reload.

However. Killing a de-duping process in the middle of reference updates is altogether different from killing a relatedness graph display. Unfortunately, I wasn't quite aware of the problem before I'd tried to de-dup several records, sometimes multiple times. My app assumes its data will be internally consistent, so my attempts to clean up the carnage resulted in hundreds more duplicates being created.

So I've moved FromThePage from DreamHost to HostingRails, which I completed this morning. There remains a lot of back-end work to clean up the data, but I'm pretty sure I'll get there before THATCamp.