Introduction

What is OCR?

OCR stands for optical character recognition. It is a process by which text characters on an image are converted to machine readable text. Pixels on an image are broken up into characters, compared to characters in an alphabet, and assigned the matching encoded text character's value. This is how computer software can take scanned images of documents and translate typed or machine printed text into text data that is "machine readable", encoded characters that can be processed by other software, such as Grooper.

Native Text vs OCR Text

In Grooper, we make a distinction between "native text" data and "OCR text" data.

  • Native text data is obtained from natively authored digital content.
  • OCR data is obtained from image-based content.

Digital document files will already have native text data encoded within the file if they are authored with a text editing software. For example, take a PDF file generated in Adobe Acrobat. All the text generated for the document is part of the end PDF file's encoded data. There is no need to OCR these kinds of documents. Their text data can be obtained simply by extracting the encoded text data native to the file.

However, if a file is simply an image of a document, such as a scanned page, it has no such text data present. The document must therefore be processed by OCR software (or "OCR'd" for short), in order to obtain a usable set of text data for the document.

In both cases, you will use the Recognize activity to obtain text data for a page and/or document.  This course will focus on using the Recognize activity specifically to OCR image-based content, such as scanned pages.  Depending on the quality of the image, condition of the original scanned page, and even fonts used, there is a lot to consider to obtain the most accurate OCR results from your documents.  In this course we will learn about how OCR software recognizes text characters from an image, how this can lead to misrecognized characters, how to improve OCR results by altering the image fed to the OCR software, how Grooper's unique suite of OCR Synthesis operations can improve standard OCR results, and more!