Abstract

Mathematical formulas and chemical diagrams appear frequently in scientific documents but are often embedded as visual content, either rasterized or vector-based images, limiting their accessibility and automated analysis. This thesis aims to bridge this gap by presenting a graph-based visual parsing framework that recognizes and parses these notations from both vector and raster image inputs in digital documents. For mathematical formulas in born-digital PDFs, we construct Symbol Layout Trees (SLTs) using a graph defined over vector-based primitives, capturing spatial relationships, avoiding relying on OCR. For born-digital chemical diagrams, we introduce a Minimum Spanning Tree (MST)-based technique that extracts molecular structure graphs by interpreting vector graphics using domain-specific spatial and symbolic constraints. To parse rasterized images, we develop a multi-task, segmentation-aware neural network that operates on over-segmented visual primitives extracted via line segment detection and watershed-based segmentation. We create annotated training data by aligning vector-based ground truth with detected visual primitives in raster images. The model jointly performs symbol classification, segmentation, and relationship classification in a multi-task learning framework, utilizing discrete attention mechanisms to dynamically modify input features over iterative passes. We enhance robustness using synthetic structural and visual noise applied at the primitive level to simulate degradations in real document images and mitigate class imbalance through stratified sampling and loss reweighting strategies, including weighted cross-entropy, class-balanced and focal losses. We introduce a two-stage graph attention model to support cross-task learning, where class distributions from the first stage are used to inform refinement in the second. Evaluation metrics compare nodes and edges in the predicted graphs to ground truth using adjacency matrices and Hamming distances to quantify structural and labeling errors. The results and analysis across mathematical and chemical datasets show that (1) input line-of-sight (LOS) graph representation improves expression coverage (the upper bound on the number of expressions that can be correctly parsed) and reduce number of edge hypotheses for math, while 6 nearest-neighbor (6NN) graphs are better suited for chemistry due to their local structure, (2) attention mechanisms and cross task interaction enhance structural prediction, (3) primitive-level noise augmentation and loss rebalancing and aggregation improve generalization across input conditions. Together, these findings support the development of a unified and extensible framework for visual parsing of structured scientific notations across domains.

Library of Congress Subject Headings

Mathematical symbols (Typefaces)--Classification; Computer vision; Optical pattern recognition; Mathematics--Formulae; Graph theory

Publication Date

6-2025

Document Type

Dissertation

Student Type

Graduate

Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

Computing and Information Sciences Ph.D, Department of

College

Golisano College of Computing and Information Sciences

Advisor

Richard Zanibbi

Advisor/Committee Member

Qi Yu

Advisor/Committee Member

Weijie Zhao

Campus

RIT – Main Campus

Plan Codes

COMPIS-PHD

Share

COinS