Structured versus unstructured form data capture

by | Jan 22, 2020

Data is data- why does it even matter what form it takes? Researching things you can do to make data entry tasks less painful might have led you here. Automating the processing of your documents through digitization and OCR technologies is only the first step. How data is structured depends on the purpose of the document/data.

In data capture, your configured software is going to capture an image of a document so that the information can be translated into electronic data without manual input. This occurs through recognition technology such as OCR.

Structured Data Capture

What are structured documents? Structured documents are made up of clearly defined forms with fields that are always in the same place. The only change is the information populated in each field while the document structure is always the same. Some of the documents that are considered structured include questionnaires, tests, surveys, medical claim forms such as CMS1500 and UB04, etc. The common patterns presented makes them easily recognized/searchable. 

OCR works well with structured forms and documents by definition- because the data is in the same place on each page, in many cases it does not take much to configure OCR capture software to process structured forms. You would only need to setup a rule that tells the software to only look at defined locations on a form. Or for the mathematical thinkers out there X and Y axis. A rule can also tell software to look for a specific data type such as a date, number or zip code by using regular expressions. Regular expressions limit the options allowed in a specific field so for example if we are looking at a date field we can tell the system to limit that field only to numbers where it will always have a structure of XX/XX/XXXX (Month/Day/Year). This predictability in many cases allows for higher accuracy although there may be other factors that can hinder capturing information such as when information is typed over the lines of the documents. If a 1 is typed over a field border the lines can get too close and the OCR engine may not capture the proper number or letter. 

Semi-Structured Data Capture

Life would be much easier if every type of document followed a specific format. This, of course, is a pipe dream. Invoices for example, can be different from company to company. This poses a problem for capture software as the defined locations it scans might not have the correct data as it might be located somewhere else on the page. This is what is known as semi-structured data- invoices might have common data types such as invoice numbers and dates but might vary greatly in format.

 The commonalities between different types of invoices include invoice number, invoice date, net, gross totals, and more. In order to parse and collect this data, we configure the OCR system with Key Values that we ask the system to search for. Invoices are set up using key value rules that tell the system to look for words or clues on the page and capture the data that pertains to the key value word or term. Once this algorithm or command hones in on the vicinity of the keywords, it will look underneath or around that word and extract the data relevant to it.

Unstructured Data Capture

Unstructured data/documents are exactly as they sound – data that follows a free-form format and therefore no set structure. You would think unstructured formats would need to be manually sifted but that is just not true. Unstructured data found in contracts, articles, letters, memos, and more can still be captured with today’s advanced OCR Capture algorithms.

The key to successful capture and processing of unstructured data is to know how to approach it. Rules can be defined that can classify types of documents and extract different metadata fields depending on the type. Complex documents can be classified based on rules that look at the text on the page. Lastly, if enough samples are fed into a system, machine-learning comes into play and we can train the system to capture the needed information every time no matter what changes in location there are. The self-learned classifiers that software creates can categorize and parse documents easily over time.

Call us!

When it comes to the capture of data- structured or unstructured, there is no one better than the experts at OCR Solutions. Through our array of different products and industry experience, we will create a solution that works best for your business. If you have any questions, call us today! We look forward to hearing from you!