Developing A Document Classifier Using A Part Of Speech Tagger

Author(s)

Emily Babb

School Name

Governor's School for Science and Math

Grade Level

12th Grade

Presentation Topic

Math and Computer Science

Presentation Type

Mentored

Mentor

Mentor: Dr. Rashid; Knowledge Management, German Research Center for Artificial Intelligence

Abstract

Natural language processing is a form of artificial intelligence, in which human language is interpreted and examined. In natural language processing, researchers have the ability to summarize a document of text into a paragraph of text, to translate text from one language to another, and to give an answer to provided question. The Natural Language Toolkit1 (NLTK) is a python software library that offers helpful methods in this subset of artificial intelligence. The overall goal of the research was to develop a classifier, which could sort documents into type, such as email, essay, or joke, and its tone towards a subject by tagging the words in the document with their respective parts of speech. As research progressed, it could be seen that the part of speech tagger was not tagging with a high accuracy using the NLTK software. Therefore, I began to examine the NLTK part of speech tagger. Many documents, all of different types, were tagged using the NTLK toolkit. Those same documents were then manually tagged using a dictionary. Then, the percent accuracy of the NLTK part of speech tagger was determined, and steps were taken to improve the tagger, which was critical to the success of the classifier.

Location

Owens 207

Start Date

4-16-2016 8:30 AM

COinS
 
Apr 16th, 8:30 AM

Developing A Document Classifier Using A Part Of Speech Tagger

Owens 207

Natural language processing is a form of artificial intelligence, in which human language is interpreted and examined. In natural language processing, researchers have the ability to summarize a document of text into a paragraph of text, to translate text from one language to another, and to give an answer to provided question. The Natural Language Toolkit1 (NLTK) is a python software library that offers helpful methods in this subset of artificial intelligence. The overall goal of the research was to develop a classifier, which could sort documents into type, such as email, essay, or joke, and its tone towards a subject by tagging the words in the document with their respective parts of speech. As research progressed, it could be seen that the part of speech tagger was not tagging with a high accuracy using the NLTK software. Therefore, I began to examine the NLTK part of speech tagger. Many documents, all of different types, were tagged using the NTLK toolkit. Those same documents were then manually tagged using a dictionary. Then, the percent accuracy of the NLTK part of speech tagger was determined, and steps were taken to improve the tagger, which was critical to the success of the classifier.