If you are digitising text from an existing text-based document, consider what your purpose is in terms of the final result. Identify whether you want to simply capture the information digitally, or to digitally represent the text in the way it was originally created, capturing layout, fonts, paper texture, and other aspects of the document. This will greatly influence your strategy, as the first lends itself to transcription and optical character recognition, while the second suggests digital image scans.
OCR can be most useful for transcription projects, forming a base of raw text that can then be corrected and checked by hand. OCR is also used in order to make image-based text searchable, such as the text used by the Papers Past website (external link).
Most document scanners come with software that supports OCR. Although this software can be of variable quality, even the best OCR techniques require some degree of correction and re-formatting. The highest success rate is likely to be plain pages of typed text from books or business documents. Handwriting is generally still too difficult to decipher, especially older script-based styles. Text in publications with more complex layouts (such as newspapers and magazines) may require the physical separation of text elements (such as headings, columns, and advertisements) or software that can interpret them accurately. Deskewing poorly printed text and eliminating print artefacts and stains, along with training the software to recognise uncommon words, can help increase accuracy up to around 99% in some circumstances. Grayscale and full-colour images tend to scan as accurately as each other, while bitonal images tend to be less accurate.
Full transcription (using OCR as base) effectively creates a new edition of a document. Transcription can be time-consuming, but allows the greatest ability to repurpose a digitised document and keep in a usable format long-term. The New Zealand Electronic Text Centre has taken this approach, encoding their text using the open TEI standard (external link). This allows the creation of structured text that can be reformatted into almost any format and remain usable. It allows the presentation of one set of text in multiple places, such as through a webpage, PDF document or an e-book. Transcription is not a complete replacement for an original text document in most cases, but has high value when the information being transcribed is itself of value.
The main considerations for digital scanning of text are:
When creating new text-based documents digitally, it's important to consider which format will be best for your purposes. Despite it being a long-standing industry standard, the Microsoft Word .DOC format is not an open standard, as its specifications have never been published. Microsoft's newer Office Open XML standard, which is used in their .DOCX format, is an open ISO standard, but reputedly difficult to fully implement. Two counters to this standard are the Open Document Format (ODF), widely supported outside of Microsoft products as a standard set by OASIS (Organization for the Advancement of Structured Information Standards), and Adobe's PDF image-based format which was made open in 2008. All have some limitations in terms of longevity, although the Open Document Format is the most flexible in terms of interoperability between different software products. The 2010 version of Microsoft Office along with the free Open Office Suite natively support ODF.
If writing for the web, consider compatibility between different web browsers and operating systems. The most compatible and accessible format for text on the web is to deliver as simply formatted HTML text, with downloadable document versions of DOC, ODF and PDF alongside.
Back to Creating Digital Content
Do our short survey and let us know how we're doing