Abstract
The goal of authorship attribution is to find a set of unconscious writing characteristics or style features that distinguish text written by one person from text written by another. Once these features are found, they can be used to pair a text with the individual who wrote it. It is now well accepted that authors develop distinct and unconscious writing features. Over one thousand stylometric features (style markers) have been proposed in a variety of research disciplines [44] but none of that research has looked at the syntactic structure of the text. I conjectures that the distinct writing features of an author are not limited to these features already studied, but also include syntactic features. To support this hypothesis, I ran experiments using two open source parsing programs and analyzed the results to see if features given to me from these programs were enough for me to determine who is the most probable author of a text. Parsing programs are designed to determine syntactic structures in nat ural language. They take a text or a writing sample and produce output showing the grammatical relationship between the words in the text. They provide a means to test the hypothesis that authors' syntactic use of words provide enough identifying characteristics to differentiate between them. Using two open source natural language parsing programs, the Link Gram mar Parser and Collins' Parser, this research tested to see if an authors sentence structure is unique enough to provide a means of recognizing the probable author of a text. Initial data was collected on a pool of test au thors. Sample texts by each author were run through both parsers. The output of each parser was analyzed using two multivariate analysis methods: discriminant analysis and cluster k- means. My results show that syntactic sentence structures may be a viable method for authorship attribution. The Link Grammar shows promise as a way to augment authorship attribution methods already out there. Collins' Parser provided even better results that should be solid enough to stand on their own as a new and viable alternative to methods that already exist. Collins' parser also provided new predictors that might improve current authorship attribution methods. For example, elements and phrases with wh- words and the length of noun phrases are highly corrolated with authorship in this study.
Library of Congress Subject Headings
Authorship; Natural language processing (Computer science); Parsing (Computer grammar)
Publication Date
2003
Document Type
Thesis
Student Type
Graduate
Degree Name
Computer Science (MS)
Department, Program, or Center
Computer Science (GCCIS)
Advisor
Edith Hemaspaandra
Advisor/Committee Member
Myroslava Dzikovska
Advisor/Committee Member
Carol Marchetti
Recommended Citation
Magnera, Westerly A D, "Using Natural Language Parsers for Authorship Attribution" (2003). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/7552
Campus
RIT – Main Campus
Comments
Physical copy available from RIT's Wallace Library at QA76.9.N38 M33 2003