Analysis Of Historical Documents Through The Use Of Optical Character Recognition

Author(s)

Eleanor Burch

School Name

Governor's School for Science and Math

Grade Level

12th Grade

Presentation Topic

Math and Computer Science

Presentation Type

Mentored

Mentor

Mentor: Dr. Saqib Bukhari; Knowledge Management, German Research Center for Artificial Intelligence

Oral Presentation Award

2nd Place

Abstract

As our world enters an electronic era, it has become important to be able to quickly and easily preserve documents in an electronic format. The purpose of this project was to build upon a preexisting optical character recognition (OCR) system in order to be able to analyze and recognize the text in handwritten historical documents. The preexisting system, called OCRopus and created by researchers from the German Research Center for Artificial Intelligence in Kaiserslautern, Germany, was designed to recognize computer created documents that have a specific font and spacing between words and characters. However, historical documents are handwritten, with varied spacing between words and characters, and contain characters that no longer exist in the modern alphabet. In order to examine handwritten documents, a program was written to divide lines of text into words. While individual characters can be recognized by finding the blank space between characters, the spacing between words varies. The average spacing between words was found in order to accurately divide lines into words. In addition, the grayscale images of text were binarized into black and white images in a way that eliminated as many random marks, or noise, on the page as possible.

Location

Owens 207

Start Date

4-16-2016 9:15 AM

COinS
 
Apr 16th, 9:15 AM

Analysis Of Historical Documents Through The Use Of Optical Character Recognition

Owens 207

As our world enters an electronic era, it has become important to be able to quickly and easily preserve documents in an electronic format. The purpose of this project was to build upon a preexisting optical character recognition (OCR) system in order to be able to analyze and recognize the text in handwritten historical documents. The preexisting system, called OCRopus and created by researchers from the German Research Center for Artificial Intelligence in Kaiserslautern, Germany, was designed to recognize computer created documents that have a specific font and spacing between words and characters. However, historical documents are handwritten, with varied spacing between words and characters, and contain characters that no longer exist in the modern alphabet. In order to examine handwritten documents, a program was written to divide lines of text into words. While individual characters can be recognized by finding the blank space between characters, the spacing between words varies. The average spacing between words was found in order to accurately divide lines into words. In addition, the grayscale images of text were binarized into black and white images in a way that eliminated as many random marks, or noise, on the page as possible.