Theses

An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

Abstract

Is there any similarity between the contexts of the Holy Bible and the Holy Quran, and can this be proven mathematically? The purpose of this research is using the Bible and the Quran as our corpus, we explore the performance of various feature extraction and machine learning techniques. The unstructured nature of text data adds an extra layer of complexity in the feature extraction task, and the inherently sparse nature of the corresponding data matrices makes text mining a distinctly difficult task. Among other things, We assess the difference between domain-based syntactic feature extraction and domain-free feature extraction, and then use a variety of similarity measures like Euclidean, Hillinger, Manhattan, cosine, Bhattacharyya, symmetries kullback-leibler, Jensen Shannon, probabilistic chi-square and clark. For a similarity to identify similarities and differences between sacred texts. Initially I started by comparing chapters of two raw text using the proximity measures to visualize their behaviors on high dimensional and spars space. It was apparent there was similarity between some of the chapters, but it was not conclusive. Therefore, there was a need to clean the noise using the so called Natural Language processing (NLP). For example, to minimize the size of two vectors, We initiated lists of similar vocabulary that worded differently in both texts but indicates the same exact meaning. Therefore, the program would recognize Lord as God in the Holy Bible and Allah as God in the Quran and Jacob as prophet in bible and Yaqub as a prophet in Quran. This process was completed many times to give relative comparisons on a variety of different words. After completion of the comparison of the raw texts, the comparison was completed for the processed text. The next comparison was completed using probabilistic topic modeling on feature extracted matrix to project the topical matrix into low dimensional space for more dense comparison. Among the distance measures intrdued to the sacred corpora, the analysis of similarities based on the probability based measures like Kullback leibler and Jenson shown the best result. Another similarity result based on Hellinger distance on the CTM also shows good discrimination result between documents. This work started with a believe that if there is intersection between Bible and Quran, it will be shown clearly between the book of Deuteronomy and some Quranic chapters. It is now not only historically, but also mathematically is correct to say that there is much similarity between the Biblical and Quranic contexts more than the similarity within the holy books themselves. Furthermore, it is the conclusion that distances based on probabilistic measures such as Jeffersyn divergence and Hellinger distance are the recommended methods for the unstructured sacred texts.

Library of Congress Subject Headings

Bible--Criticism, Textual; Qur'an--Criticism, Textual; Criticism, Textual--Data processing; Content analysis (Communication)--Data processing

Publication Date

11-2014

Document Type

Thesis

Student Type

Graduate

Degree Name

Applied Statistics (MS)

Advisor

Ernest Fokoué

Advisor/Committee Member

Linlin Chen

Advisor/Committee Member

Robert Parody

Comments

Physical copy available from RIT's Wallace Library at P47 .Q34 2014

Recommended Citation

Qahl, Salha Hassan Muhammed, "An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures" (2014). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/8496

Campus

RIT – Main Campus

Plan Codes

APPSTAT-MS

Download

COinS

Theses

An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Advisor

Advisor/Committee Member

Advisor/Committee Member

Comments

Recommended Citation

Campus

Plan Codes

Search

Browse

Author Corner

RIT Links

Theses

An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

Author

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Advisor

Advisor/Committee Member

Advisor/Committee Member

Comments

Recommended Citation

Campus

Plan Codes

Share

Search

Browse

Author Corner

RIT Links