Abstract
This particular study will be based on linguistics and stylometric use to find or identify the legitimate authors of the text. For this purpose, the study is expected to use the machine learning approach or framework that consists of various features to sort out and find the style of writing belonging to the right author. The machine learning approach is accompanied by the support of SAS (Statistical Analysis System).
SAS covers the algorithms problems required for better accurate functioning of machine learning approach experiments. In this learning framework, the experiments use a large amount of data to sort and filter out the right person responsible for the content written. AI technologies and SAS are the two components that help the machine learning experiments work accurately and provide reliable results. Moreover, this study will be using the statistical methodology of CRISP-DM (Cross Industry Standard Process for Data Mining), which is suitable for such large data mining projects like this particular project of identifying authors from massive data of the document.
Moreover, the tweets dataset used for this research is already available on the Kaggle platform in the .csv format. The dataset contains textual data for five (05) authors and 10000 approximate tweets. Each author has 1900 to 2000 tweets related to his/her name. In the proposed model, the dataset is initially passed through certain preprocessing steps such as cleansing of data from null values and unnecessary details. Also, stop words, nouns, Adverbs, or other particular parts of speech are removed from the data.
After pre-processing, different SAS-based Machine learning models are applied to relate the specific text to the author. For this purpose, a specific CRISP-DM model is adopted and four (04) different machine learning algorithms are tested. For the training of each model, the train test split is set to be 80-20. Initially, the Bayesian Network is applied to the dataset followed by the classifier. It is observed from the results that the Decision Tree classifier outperforms Bayesian Network. Afterward, Gradient Boosting Trees and MBR are tested with the same data. The end results for each model are: MBR Model = 97.09, Gradient Boosting = 97.06, Decision Tree = 83.89, HPBNC = 81.36. The results are better from most of the state-of-the-art mechanisms with
the same dataset. Moreover, it is worth mentioning that MBR and Gradient Boosting have performed exceptionally well with the forensic texts.
This research may be utilized as a starting point for forensic examination of Twitter data to identify ownership and Stylometry style. The accuracy of the models is high, however, it might be improved in the future by utilizing different parameters and methodologies instead of current research. Lastly, this research will be extremely useful for any country's cybercrime unit in reducing bogus news, and postings, and determining which news truly belongs to them.
Publication Date
12-16-2022
Document Type
Master's Project
Student Type
Graduate
Degree Name
Professional Studies (MS)
Department, Program, or Center
Graduate Programs & Research (Dubai)
Advisor
Sanjay Modak
Advisor/Committee Member
Ehsan Warriach
Recommended Citation
Alshuweihi, Abdulla and Alblooshi, Sultan, "Writer Identity using Stylometry and Machine Learning" (2022). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/11410
Campus
RIT Dubai