Abstract
Documents found on the World Wide Web (WWW) may be composed of a single web page, or several web pages that are linked together by a table of contents or some other commonly known document construct. When a document spans multiple web pages, it is often inconvenient to print or download the entire document using available tools. This thesis introduces a concept called the document boundary to facilitate representation and analysis of multi-page web documents, and suggests a two-phase approach towards automated identification of document boundaries. In the first phase, individual pages are examined to determine which links are most likely to represent an intra-document link. This procedure is applied recursively to identify a group of candidate pages which may be part of the same document. In the second phase, the link topology and other features of the identified pages are examined in aggregate for indications of a multi-page document. A test suite of both single- and multi-page web documents was assembled using a mixture of handpicked documents and documents which were gathered by an arbitrary third party. The document boundary detection system was applied to the main page of each document. The document boundary detection system was able to achieve a success rate of 73% when its results were compared to the ground truth documents.
Library of Congress Subject Headings
Web sites
Publication Date
4-1-2002
Document Type
Thesis
Department, Program, or Center
Computer Engineering (KGCOE)
Advisor
Harrington, Steven
Advisor/Committee Member
Jones, Price
Advisor/Committee Member
Shaaban, Muhammad
Recommended Citation
Sweet, James, "Reconstructing the Boundary of a Web Document" (2002). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/3145
Campus
RIT – Main Campus
Comments
Note: imported from RIT’s Digital Media Library running on DSpace to RIT Scholar Works. Physical copy available through RIT's The Wallace Library at: TK5105.888 .S94 2003