Pre-processing OCR software often pre-processes images to improve the chances of successful recognition. Techniques include: • De-
skewingif the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical. •
Despecklingremoval of positive and negative spots, smoothing edges • Binarizationconversion of an image from color or
greyscale to black-and-white (called a
binary image because there are two colors). The task is performed as a simple way of separating the text (or any other desired image component) from the background. The task of binarization is necessary since most commercial recognition algorithms work only on binary images, as it is simpler to do so. In addition, the effectiveness of binarization influences to a significant extent the quality of character recognition, and careful decisions are made in the choice of the binarization employed for a given input image type; since the quality of the method used to obtain the binary result depends on the type of image (scanned document,
scene text image, degraded historical document, etc.). • Line removalCleaning up non-glyph boxes and lines •
Layout analysis or zoningIdentification of columns, paragraphs, captions, etc. as distinct blocks. Especially important in
multi-column layouts and
tables. • Line and word detectionEstablishment of a baseline for word and character shapes, separating words as necessary. • Script recognitionIn multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script. • Character isolation or segmentationFor per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected. • Normalization of
aspect ratio and
scale Segmentation of
fixed-pitch fonts is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For
proportional fonts, more sophisticated techniques are needed because whitespace between letters can sometimes be greater than that between words, and vertical lines can intersect more than one character. •
Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as
pattern matching,
pattern recognition, or
image correlation. This relies on the input glyph being correctly isolated from the rest of the image, and the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique early physical photocell-based OCR implemented, rather directly. •
Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. The extraction features reduces the dimensionality of the representation and makes the recognition process computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of
feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent"
handwriting recognition and most modern OCR software.
Nearest neighbour classifiers such as the
k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match. Software such as
Cuneiform and
Tesseract use a two-pass approach to character recognition. The second pass is known as adaptive recognition and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded). , modern OCR software includes
Google Docs OCR,
ABBYY FineReader, Transym, and open source engines like Tesseract 5 (which introduced an LSTM-based recognition engine), PaddleOCR (a multilingual OCR toolkit supporting over 80 languages), and TrOCR (a transformer-based model developed by Microsoft for handwritten and printed text recognition). Others like
OCRopus and Tesseract use
neural networks which are trained to recognize whole lines of text instead of focusing on single characters. A technique known as iterative OCR automatically crops a document into sections based on the page layout. OCR is then performed on each section individually using variable character confidence level thresholds to maximize page-level OCR accuracy. A patent from the United States Patent Office has been issued for this method. The OCR result can be stored in the standardized
ALTO format, a dedicated
XML schema maintained by the United States
Library of Congress. Other common formats include
hOCR and
PAGE XML. For a list of optical character recognition software, see
Comparison of optical character recognition software.
Post-processing OCR accuracy can be increased if the output is constrained by a
lexicona list of words that are allowed to occur in a document. For example, "Washington, D.C." is generally far more common in English than "Washington DOC". Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The
Levenshtein Distance algorithm has also been used in OCR post-processing to further optimize results from an OCR API.
Application-specific optimizations In recent years, the major OCR technology providers began to tweak OCR systems to deal more efficiently with specific types of input. Beyond an application-specific lexicon, better performance may be had by taking into account business rules, standard expression, or rich information contained in color images. This strategy is called "Application-Oriented OCR" or "Customized OCR", and has been applied to OCR of
license plates,
invoices,
screenshots,
ID cards,
driver's licenses, and
automobile manufacturing.
The New York Times has adapted the OCR technology into a proprietary tool they entitle
Document Helper, that enables their interactive news team to accelerate the processing of documents that need to be reviewed. They note that it enables them to process what amounts to as many as 5,400 pages per hour in preparation for reporters to review the contents. ==Workarounds==