Analysis Of Historical Documents Through The Use Of Optical Character Recognition
School Name
Governor's School for Science and Math
Grade Level
12th Grade
Presentation Topic
Math and Computer Science
Presentation Type
Mentored
Oral Presentation Award
2nd Place
Abstract
As our world enters an electronic era, it has become important to be able to quickly and easily preserve documents in an electronic format. The purpose of this project was to build upon a preexisting optical character recognition (OCR) system in order to be able to analyze and recognize the text in handwritten historical documents. The preexisting system, called OCRopus and created by researchers from the German Research Center for Artificial Intelligence in Kaiserslautern, Germany, was designed to recognize computer created documents that have a specific font and spacing between words and characters. However, historical documents are handwritten, with varied spacing between words and characters, and contain characters that no longer exist in the modern alphabet. In order to examine handwritten documents, a program was written to divide lines of text into words. While individual characters can be recognized by finding the blank space between characters, the spacing between words varies. The average spacing between words was found in order to accurately divide lines into words. In addition, the grayscale images of text were binarized into black and white images in a way that eliminated as many random marks, or noise, on the page as possible.
Recommended Citation
Burch, Eleanor, "Analysis Of Historical Documents Through The Use Of Optical Character Recognition" (2016). South Carolina Junior Academy of Science. 85.
https://scholarexchange.furman.edu/scjas/2016/all/85
Location
Owens 207
Start Date
4-16-2016 9:15 AM
Analysis Of Historical Documents Through The Use Of Optical Character Recognition
Owens 207
As our world enters an electronic era, it has become important to be able to quickly and easily preserve documents in an electronic format. The purpose of this project was to build upon a preexisting optical character recognition (OCR) system in order to be able to analyze and recognize the text in handwritten historical documents. The preexisting system, called OCRopus and created by researchers from the German Research Center for Artificial Intelligence in Kaiserslautern, Germany, was designed to recognize computer created documents that have a specific font and spacing between words and characters. However, historical documents are handwritten, with varied spacing between words and characters, and contain characters that no longer exist in the modern alphabet. In order to examine handwritten documents, a program was written to divide lines of text into words. While individual characters can be recognized by finding the blank space between characters, the spacing between words varies. The average spacing between words was found in order to accurately divide lines into words. In addition, the grayscale images of text were binarized into black and white images in a way that eliminated as many random marks, or noise, on the page as possible.
Mentor
Mentor: Dr. Saqib Bukhari; Knowledge Management, German Research Center for Artificial Intelligence