Creating digital text

This page includes guidance about digitisation techniques and formats for creating new text-based documents.

Identifying how text can be used

If you are digitising text from an existing text-based document, consider what your purpose is in terms of the final result. Identify whether you want to simply capture the information digitally, or to digitally represent the text in the way it was originally created, capturing layout, fonts, paper texture, and other aspects of the document. This will greatly influence your strategy, as the first lends itself to transcription and optical character recognition, while the second suggests digital image scans.

Digitisation techniques

Optical Character Recognition (OCR)

OCR can be most useful for transcription projects, forming a base of raw text that can then be corrected and checked by hand. OCR is also used in order to make image-based text searchable, such as the text used by the Papers Past website.

Most document scanners come with software that supports OCR. Although this software can be of variable quality, even the best OCR techniques require some degree of correction and re-formatting. The highest success rate is likely to be plain pages of typed text from books or business documents. Handwriting is generally still too difficult to decipher, especially older script-based styles. Text in publications with more complex layouts (such as newspapers and magazines) may require the physical separation of text elements (such as headings, columns, and advertisements) or software that can interpret them accurately. De-skewing poorly printed text and eliminating print artefacts and stains, along with training the software to recognise uncommon words, can help increase accuracy up to around 99% in some circumstances. Grayscale and full-colour images tend to scan as accurately as each other, while bitonal images tend to be less accurate.

Transcription and Mark-up

Full transcription (using OCR as base) effectively creates a new edition of a document. Transcription can be time-consuming, but allows the greatest ability to repurpose a digitised document and keep in a usable format long-term. The New Zealand Electronic Text Centre has taken this approach, encoding their text using the open TEI standard. This allows the creation of structured text that can be reformatted into almost any format and remain usable. It allows the presentation of one set of text in multiple places, such as through a webpage, PDF document or an e-book. Transcription is not a complete replacement for an original text document in most cases, but has high value when the information being transcribed is itself of value.

Digital scanning of text

The main considerations for digital scanning of text are:

The size of the original page. Most flatbed scanners are no bigger than A3, meaning different techniques such as using map scanners or overhead copy stands and a high-megapixel camera may be required.
Capturing the smallest significant level of detail in an image. The resolution setting for scanning or photographing text cannot be determined by the size of the page. The smallest character elements (such as commas and full stops) need to be viewable in detail in order to preserve readability of the overall document.
The angle of the camera or scanner to the page. Some scanning techniques are highly destructive, requiring dis-binding or cutting of pages to achieve an that is not skewed or distorted.
Colouration of the pages that may make text illegible. Most paper darkens with age due to the acids in the paper. This may result in text of very low contrast, requiring manipulation of the image through software to ensure the text is readable, and a scanner that is capable of resolving low-contrast detail.
Quality checking the pages scanned. Ensure the pages captured are not blurred, are in the proper sequence, and that none are missing is an essential part of the text-scanning process.
Having a delivery mechanism for the resulting images. Making hundreds or even thousands of pages of text digital is only useful if the pages can be searched, retrieved, and made sense of. Many digitisation projects have struck trouble when they have had nowhere to host their pages. Sites like the Internet Archive may be of use in these kinds of circumstances.

Standards for creating new text-based documents

When creating new text-based documents digitally, it's important to consider which format will be best for your purposes. Despite it being a long-standing industry standard, the Microsoft Word .DOC format is not an open standard, as its specifications have never been published. Microsoft's newer Office Open XML standard, which is used in their .DOCX format, is an open ISO standard, but reputedly difficult to fully implement. Two counters to this standard are the Open Document Format (ODF), widely supported outside of Microsoft products as a standard set by OASIS (Organization for the Advancement of Structured Information Standards), and Adobe's PDF image-based format which was made open in 2008. All have some limitations in terms of longevity, although the Open Document Format is the most flexible in terms of interoperability between different software products. The 2010 version of Microsoft Office along with the free Open Office Suite natively support ODF.

If writing for the web, consider compatibility between different web browsers and operating systems. The most compatible and accessible format for text on the web is to deliver as simply formatted HTML text, with downloadable document versions of DOC, ODF and PDF alongside.