Transcribing Cursive Handwriting

 

A cache of 365 handwritten cursive letters is not an easy dataset to digitize. Ultimately, I enlisted friends and family with strong typing skills to help me manually transcribe as many letters as we could and then used Amazon Mechanical Turk for another hundred or so. Typing the letters in-house was a last resort and took around 10 to 30 minutes apiece. Here are the other avenues I tried before settling largely on the manual route.

Transcription from Drafts (click to enlarge)

Attempt 1: Dictation

Bottom line: Not as time-efficient as just typing it myself

My first instinct was to dictate the letters to an app on my phone or computer. Testing both Drafts (free iOS app) and Dragon Anywhere (free trial) revealed that while they were excellent dictation programs, they did not meet my particular needs for this project. I can type somewhere north of 100 words per minute when I'm rocking and rolling. Dictation needed to be faster than that.

Reading the letters aloud took awhile. I had to state each punctuation mark (Grandma Jones used a lot of it!) and sometimes had long pauses while I deciphered a word or name. Then I exported the file to my computer and spent around 15 minutes correcting small errors, which entailed reading the entire letter again and comparing the digital and physical versions. There was an extra layer of challenge for Drafts: after each minute of dictation, the output seemed to blink. It missed a few words of input, used === as a separator, and started a new line. 

The proofreading process was crucial because details matter for natural language processing. When "Mutt" (my grandfather's nickname) is transcribed as "Matt", that's a big problem. Ultimately, this approach didn't save me any time, so I ditched it.

Several OCR attempts (click to enlarge)

Attempt 2: Optical Character Recognition

Bottom line: Cannot parse cursive handwriting

Optical character recognition, or OCR, is the process of converting images of text (hard copy, scans, photographs, etc.) into digital, machine-readable text. Current OCR technology has advanced rapidly but is still designed to read standardized fonts where each letter can be distinguished as a separate item. OCR is not designed for the slanted, swirling, connected mayhem of cursive handwriting; it would rather read license plates.

I wanted to cover my bases, so I tested a few OCR options anyway. They all failed spectacularly. The website free-ocr.com took a full page of text and returned this nonsense. The text recognition feature in Microsoft OneNote, which is accessible by right-clicking an image, returned a small amount of an unknown character-based language for one page of a letter and a large amount of Arabic for another page. 

An emerging branch of OCR known as Intelligent Character Recognition, or ICR, uses neural networks and machine learning to parse handwritten print or cursive. It still does best with form-like input where each character is written in its own box. At the time of this project, there were no publicly available ICR tools which could parse my grandmother's cursive.

Results from MTurk (click to enlarge)

Results from MTurk (click to enlarge)

Attempt 3: Outsourcing to Strangers

Bottom line: Not affordable at scale

The last attempt before resigning myself to the need to get these done in-house was to look at options for paying strangers to transcribe them. I opted for Amazon Mechanical Turk, an online marketplace where anyone can post Human Intelligence Tasks ("HITs") for strangers to complete. HITs are straightforward for a human and very difficult for a machine. This includes identifying text or objects in images, checking entries in a dataset, and conducting quick research online.

I posted one letter as a pilot and was satisfied enough with the quality I got back, so eventually I posted another batch of 107 letters of similar length (around 800-1000 words). I offered $2.00 per letter and was charged an exorbitant 20% fee by Amazon, plus an extra 10% to get a "Master" level worker, bringing the cost per letter to $2.50. All letters but one were completed successfully.

I still have mixed feelings about the platform; the whole MTurk process was eye-opening and deserves a full post of its own. But 48 hours and $253.50 later, I had tripled the size of my dataset, so I really can't argue with the results.

Final Decision

So what's the best option for a burgeoning data scientist who is both cash-strapped and under a tight deadline? As usual, it was a hybrid. I got as many letters as I could transcribed for free, outsourced a big batch, and am leaving the rest for another day. The tally as of mid-June 2018 is:

Transcriber # Letters Cost
Kelly (myself) 57 $0
Friends/Family 12 $0
Amazon MTurk 107 $253.50
Remaining 189 TBD