Developing A Document Classifier Using A Part Of Speech Tagger
School Name
Governor's School for Science and Math
Grade Level
12th Grade
Presentation Topic
Math and Computer Science
Presentation Type
Mentored
Abstract
Natural language processing is a form of artificial intelligence, in which human language is interpreted and examined. In natural language processing, researchers have the ability to summarize a document of text into a paragraph of text, to translate text from one language to another, and to give an answer to provided question. The Natural Language Toolkit1 (NLTK) is a python software library that offers helpful methods in this subset of artificial intelligence. The overall goal of the research was to develop a classifier, which could sort documents into type, such as email, essay, or joke, and its tone towards a subject by tagging the words in the document with their respective parts of speech. As research progressed, it could be seen that the part of speech tagger was not tagging with a high accuracy using the NLTK software. Therefore, I began to examine the NLTK part of speech tagger. Many documents, all of different types, were tagged using the NTLK toolkit. Those same documents were then manually tagged using a dictionary. Then, the percent accuracy of the NLTK part of speech tagger was determined, and steps were taken to improve the tagger, which was critical to the success of the classifier.
Recommended Citation
Babb, Emily, "Developing A Document Classifier Using A Part Of Speech Tagger" (2016). South Carolina Junior Academy of Science. 82.
https://scholarexchange.furman.edu/scjas/2016/all/82
Location
Owens 207
Start Date
4-16-2016 8:30 AM
Developing A Document Classifier Using A Part Of Speech Tagger
Owens 207
Natural language processing is a form of artificial intelligence, in which human language is interpreted and examined. In natural language processing, researchers have the ability to summarize a document of text into a paragraph of text, to translate text from one language to another, and to give an answer to provided question. The Natural Language Toolkit1 (NLTK) is a python software library that offers helpful methods in this subset of artificial intelligence. The overall goal of the research was to develop a classifier, which could sort documents into type, such as email, essay, or joke, and its tone towards a subject by tagging the words in the document with their respective parts of speech. As research progressed, it could be seen that the part of speech tagger was not tagging with a high accuracy using the NLTK software. Therefore, I began to examine the NLTK part of speech tagger. Many documents, all of different types, were tagged using the NTLK toolkit. Those same documents were then manually tagged using a dictionary. Then, the percent accuracy of the NLTK part of speech tagger was determined, and steps were taken to improve the tagger, which was critical to the success of the classifier.
Mentor
Mentor: Dr. Rashid; Knowledge Management, German Research Center for Artificial Intelligence