Parsing of Math Formulas and Chemical Diagrams using Graph-Based Representation and Attention Models
Abstract
Mathematical formulas and chemical diagrams appear frequently in scientific documents but are often embedded as visual content, either rasterized or vector-based images, limiting their accessibility and automated analysis. This thesis aims to bridge this gap by presenting a graph-based visual parsing framework that recognizes and parses these notations from both vector and raster image inputs in digital documents. For mathematical formulas in born-digital PDFs, we construct Symbol Layout Trees (SLTs) using a graph defined over vector-based primitives, capturing spatial relationships, avoiding relying on OCR. For born-digital chemical diagrams, we introduce a Minimum Spanning Tree (MST)-based technique that extracts molecular structure graphs by interpreting vector graphics using domain-specific spatial and symbolic constraints. To parse rasterized images, we develop a multi-task, segmentation-aware neural network that operates on over-segmented visual primitives extracted via line segment detection and watershed-based segmentation. We create annotated training data by aligning vector-based ground truth with detected visual primitives in raster images. The model jointly performs symbol classification, segmentation, and relationship classification in a multi-task learning framework, utilizing discrete attention mechanisms to dynamically modify input features over iterative passes. We enhance robustness using synthetic structural and visual noise applied at the primitive level to simulate degradations in real document images and mitigate class imbalance through stratified sampling and loss reweighting strategies, including weighted cross-entropy, class-balanced and focal losses. We introduce a two-stage graph attention model to support cross-task learning, where class distributions from the first stage are used to inform refinement in the second. Evaluation metrics compare nodes and edges in the predicted graphs to ground truth using adjacency matrices and Hamming distances to quantify structural and labeling errors. The results and analysis across mathematical and chemical datasets show that (1) input line-of-sight (LOS) graph representation improves expression coverage (the upper bound on the number of expressions that can be correctly parsed) and reduce number of edge hypotheses for math, while 6 nearest-neighbor (6NN) graphs are better suited for chemistry due to their local structure, (2) attention mechanisms and cross task interaction enhance structural prediction, (3) primitive-level noise augmentation and loss rebalancing and aggregation improve generalization across input conditions. Together, these findings support the development of a unified and extensible framework for visual parsing of structured scientific notations across domains.
Library of Congress Subject Headings
Mathematical symbols (Typefaces)--Classification; Computer vision; Optical pattern recognition; Mathematics--Formulae; Graph theory
Publication Date
6-2025
Document Type
Dissertation
Student Type
Graduate
Degree Name
Computing and Information Sciences (Ph.D.)
Department, Program, or Center
Computing and Information Sciences Ph.D, Department of
College
Golisano College of Computing and Information Sciences
Advisor
Richard Zanibbi
Advisor/Committee Member
Qi Yu
Advisor/Committee Member
Weijie Zhao
Recommended Citation
Shah, Ayush Kumar, "Parsing of Math Formulas and Chemical Diagrams using Graph-Based Representation and Attention Models" (2025). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12254
Campus
RIT – Main Campus
Plan Codes
COMPIS-PHD
