Abstract
Contrastive learning methods require well-defined positive pairs, limiting their applicability to domains where complete, high-fidelity pairings are available. In practice, large-scale scientific corpora --including patents, publications, and web-scale data -- contain vast quantities of contextually relevant but incompletely paired samples that are discarded under standard training paradigms. In this work, we demonstrate that hard negative mining can be leveraged to construct pseudo-positive supervision signals from unpaired or partially paired data, enabling contrastive learning to exploit the full breadth of available corpora without sacrificing representational quality. Using a large-scale chemical drug patent corpus as a testbed, we train a cross-modal contrastive model aligning chemical text passages with molecular SMILES representations, where a significant fraction of passages lack extractable chemical entity mentions, and have no ground-truth positive pairing. We show that incorporating these unpaired passages via hard-negative-derived pseudo-positive objectives yields higher ranking quality (nDCG) compared to training on strictly paired data alone, with the hard negative selection strategy proving critical -- random pseudo-positive assignment degrades performance while geometric hard negative mining provides meaningful alignment signal. Our findings suggest that partially paired data, when coupled with principled pseudo-supervision, provides complementary context that enriches the contrastive learning objective beyond what high-fidelity pairs alone can offer. This paradigm generalizes naturally to other data-abundant but incompletely paired domains, including biology, physics, and large-scale web and social media corpora, offering a path toward more data-efficient contrastive representation learning at scale. To rigorously evaluate our model, we constructed an end-to-end test collection grounded in real-world chemical information retrieval needs, First, in collaboration with domain expert chemists, we developed a graded relevance benchmark comprising 35 diverse query topics, each with 10--30 expert-assessed candidates pooled from a retrieval pipeline combining specialized models for both chemical text and SMILES-based molecular search. Relevance judgments were assigned at multiple granularity levels reflecting the degree to which each candidate satisfies the query's intended information need, yielding the first publicly available graded relevance collection for chemical information retrieval. Second, we constructed a multi-modal test dataset derived from 142 chemical patent PDFs spanning over 30,000 pages, covering compounds associated with 14 specific human gene targets. Each extracted passage retains provenance metadata linking it to its source PDF, page, and location, enabling fine-grained retrieval evaluation. The pooling pipeline used to generate candidates combined term-based retrieval (BM25, PL2), BERT-based dense retrieval for semantic text matching, Tanimoto similarity and subgraph search for molecular structure matching, and ChemBERTa-based dense retrieval for molecular semantic similarity. Multi-modal queries combining text and molecular inputs were resolved via overlap-based re-ranking over the union of text and molecule retrieval candidates. Together, the collection provides a rigorous and reproducible evaluation framework for cross-modal chemical retrieval.
Publication Date
4-2026
Document Type
Dissertation
Student Type
Graduate
Degree Name
Computing and Information Sciences (Ph.D.)
Department, Program, or Center
Computing and Information Sciences Ph.D, Department of
College
Golisano College of Computing and Information Sciences
Advisor
Richard Zanibbi
Advisor/Committee Member
Nathaniel H. Stanley
Advisor/Committee Member
Weijie Zhao
Recommended Citation
Dey, Abhisek, "In-Context Retrieval for Molecules and Chemical Synthesis Pathways" (2026). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12617
Campus
RIT – Main Campus
