Definition

OCR (optical character recognition)

TechTarget Contributor

By

TechTarget Contributor

Published: Nov 15, 2022

What is OCR (optical character recognition)?

OCR (optical character recognition) is the use of technology to distinguish printed or handwritten text characters inside digital images of physical documents, such as a scanned paper document. The basic process of OCR involves examining the text of a document and translating the characters into code that can be used for data processing. OCR is sometimes also referred to as text recognition.

OCR systems are made up of a combination of hardware and software that is used to convert physical documents into machine-readable text. Hardware, such as an optical scanner or specialized circuit board, is used to copy or read text while software typically handles the advanced processing. Software can also take advantage of artificial intelligence (AI) to implement more advanced methods of intelligent character recognition (ICR), like identifying languages or styles of handwriting.

The process of OCR is most commonly used to turn hard copy legal or historic documents into PDFs. Once placed in this soft copy, users can edit, format and search the document as if it was created with a word processor.

How optical character recognition works

The first step of OCR is using a scanner to process the physical form of a document. Once all pages are copied, OCR software converts the document into a two-color, or black and white, version. The scanned-in image or bitmap is analyzed for light and dark areas, where the dark areas are identified as characters that need to be recognized and light areas are identified as background.

The dark areas are then processed further to find alphabetic letters or numeric digits. OCR programs can vary in their techniques, but typically involve targeting one character, word or block of text at a time. Characters are then identified using one of two algorithms:

Pattern recognition. OCR programs are fed examples of text in various fonts and formats which are then used to compare, and recognize, characters in the scanned document.
Feature detection. OCR programs apply rules regarding the features of a specific letter or number to recognize characters in the scanned document. Features could include the number of angled lines, crossed lines or curves in a character for comparison. For example, the capital letter "A" may be stored as two diagonal lines that meet with a horizontal line across the middle.

When a character is identified, it is converted into an ASCII code that can be used by computer systems to handle further manipulations. Users should correct basic errors, proofread and make sure complex layouts were handled properly before saving the document for future use.

Optical character recognition use cases

OCR can be used for a variety of applications, including the following:

Scanning printed documents into versions that can be edited with word processors, like Microsoft Word or Google Docs.
Indexing print material for search engines.
Automating data entry, extraction and processing.
Deciphering documents into text that can be read aloud to visually-impaired or blind users.
Archiving historic information, such as newspapers, magazines or phonebooks, into searchable formats.
Electronically depositing checks without the need for a bank teller.
Placing important, signed legal documents into an electronic database.
Recognizing text, such as license plates, with a camera or software.
Sorting letters for mail delivery.
Translating words within an image into a specified language.

Benefits of optical character recognition

The main advantages of OCR technology are the following:

saves time;
decreases errors;
minimizes effort; and
enables actions that are not possible with physical copies, such as compressing into ZIP files, highlighting keywords, incorporating into a website and attaching to an email.

While taking images of documents enables them to be digitally archived, OCR provides the added functionality of being able to edit and search those documents.

Continue Reading About OCR (optical character recognition)

How automated content tagging improves findability

The history and evolution of the paperless office

4 ways natural language querying in BI tools can benefit users

Dig Deeper on Content management software and services

Search Business Analytics

What makes an effective data science team structure?
Data science team structures vary in strength, and their success depends on how roles and leadership align with business goals to...
Synthetic data vs. real data for predictive analytics
Synthetic data helps simulate rare events and meet privacy compliance, while real data preserves natural variability needed to ...
7 predictive analytics skills to improve simulation modeling
Predictive analytics skills such as statistical analysis, data preprocessing and model evaluation can help data professionals ...

Search Data Management

Hadoop vs. Spark for modern data pipelines
Hadoop and Spark differ in architecture, performance, scalability, cost and deployment. They offer distinct strengths for modern ...
Informatica adds MCP support, spate of AI-fueled features
With Model Context Protocol helping standardize how enterprises develop and deploy agents, support for the open standard is ...
What is data lineage? Techniques, best practices and tools
Organizations can bolster data governance efforts by tracking the lineage of data in their systems. Get advice on how to do so ...

Search ERP

10 software requirements to look for in an MES
In production environments, a manufacturing execution system can help boost efficiency. Learn the required features that an MES ...
Learn benefits and challenges of CRM and ERP integration
Integration can be difficult because of technical challenges and organizational change. Learn the benefits and potential issues ...
6 benefits of using low-code ERP
Using low-code ERP can result in easier user training and more agility, among other benefits. Learn more about how the software ...

Search Oracle

Oracle sets lofty national EHR goal with Cerner acquisition
With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...
With Cerner, Oracle Cloud Infrastructure gets a boost
Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...
Supreme Court sides with Google in Oracle API copyright suit
The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

Search SAP

SAP agrees to allow Celonis data access until case resolved
SAP agrees to allow Celonis customers to access data from its systems as their legal battle continues, but customers will be best...
Grow with SAP fuels Phoenix Global's digital transition
Phoenix Global implemented S/4HANA Cloud via Grow with SAP to replace outdated systems, digitize manual processes and enable AI ...
SAP Sapphire 2025 news, trends and analysis
SAP showcased new business AI applications and continued to make the case for S/4HANA Cloud as the future of SaaS-based ERP ...

Close