Content/ Text Conversion
Your collection becomes richer and more discoverable when information is pulled from your image files. Text can be produced from content in the images, by ascribing descriptive metadata or by linking images with your analogue or digital finding aids or other databased information. This creation of searchable or editable text increases the chances of discovery and assists with sharing information about your images and collections.
There are many ways to undertake content conversion. NZMS offers content conversion services at various levels from the most basic Optical Character Recognition (OCR) – the creation of a simple PDF using software to recognise text in an image, and not correcting it in any way, through to the most complex: transcription and “marking up” – completely republished and repurposed documents using a combination of keyboarding, software, and marking up into a future-proofed eLanguage (such as the New Zealand Official Yearbooks).
Books or manuscripts (or any other objects with text content) are digitised (scanned) and from there we use software or keyboarding to produce output files in the formats you use – Microsoft Word, Microsoft Excel, Adobe PDF, Text or CSV or a variety of eBook formats. It’s hard to predict future use, so if you seek flexibility and the ability to re-purpose your content we offer a complete customised Extensible Markup Language (XML) conversion service.
If you are interested in displaying your data (or images) online then we can build websites or web pages (that fit seamlessly within your existing website) so that you can make your content available to others.
It can be a complicated area – by all means contact us to discuss what’s possible with your material.
Optical Character Recognition (OCR) Processing
Optical Character Recognition (OCR) software converts scanned images of printed or typewritten pages to searchable and editable text. We use a variety of software tools and have had great results with uncorrected OCR for clients and for the Stones Directories, which we produce for sale.
Our standard OCR services are fully automated where powerful software analyses the digitised images and identifies the text within, avoiding the necessity for operator intervention.
NZMS also offers customised OCR services for more challenging material. This includes manual zoning of newspaper and journal text (where articles and headlines are not uniformly placed on the page) or isolating of marginalia and extracting abstract information from material. We also offer the option of part-OCR of material, where specific parts can be OCR’d, eliminating extra costs for converting material that does not aid discovery. This is particularly useful in Journals and Magazines where advertorial content with artistic fonts is confusing the output. That said, our OCR systems allow us to “Pattern Train”, whereby we “teach” the system how to recognise text of varying fonts in order to improve the accuracy of our conversion.
Not all documents are suited to OCR however – the accuracy level of OCR on handwritten text is very poor and is often not usable. For documents that are not suited to OCR we offer transcription services‘data entry’ or ‘keyboarding’ which means that a mix of two typists and a third arbiter or a typist and a verifier type the same information. This is then compared and any discrepancies are highlighted and rectified.
Transcription is the act of copying textual information from a digitised image into another form, usually a document, spread sheet or database. Types of digitised images that could be transcribed might include handwritten documents, index cards, lists, scripts, typed fonts that do not OCR well or simple data like names and addresses.
Double data transcription is a data entry quality control method. In the first pass through a set of records, data keystrokes are entered onto each record as the data entry operator types them. On the second pass through the batch, an operator at a separate machine enters the same data again. This information is then either fed through a computer verification program or is checked by a person comparing the two blocks of data. The verifier compares the second operator’s keystrokes with the contents of the record. If there were no discrepancies the verifier accepts the data. If there are discrepancies between the two blocks of data a choice is made as to which is the best to choose from. This can be handled by means of strict vocabulary dictionaries, customer-prescribed “rules” or manually by a data operator. The accuracy for double data transcription should exceed 99.9%.
Single-entry transcription is used in the interest of simplicity. It is usually less expensive than double-entry transcription because it does not require data to be entered twice and then compared.
Expected accuracy will vary depending on the transcription method chosen and the quality of the originals and digitised images. We recommend a pilot on some “typical” data be undertaken to fine tune cost-estimates and provide evidence of quality expectations.
The real value of NZMS in this process is in the troubleshooting experience we have in this area (significant!) as well as providing a Quality Assurance interface for you. We welcome your enquiry about improving the discoverability of your material via transcription.