Abstract

Contrastive learning methods require well-defined positive pairs, limiting their applicability to domains where complete, high-fidelity pairings are available. In practice, large-scale scientific corpora --including patents, publications, and web-scale data -- contain vast quantities of contextually relevant but incompletely paired samples that are discarded under standard training paradigms. In this work, we demonstrate that hard negative mining can be leveraged to construct pseudo-positive supervision signals from unpaired or partially paired data, enabling contrastive learning to exploit the full breadth of available corpora without sacrificing representational quality. Using a large-scale chemical drug patent corpus as a testbed, we train a cross-modal contrastive model aligning chemical text passages with molecular SMILES representations, where a significant fraction of passages lack extractable chemical entity mentions, and have no ground-truth positive pairing. We show that incorporating these unpaired passages via hard-negative-derived pseudo-positive objectives yields higher ranking quality (nDCG) compared to training on strictly paired data alone, with the hard negative selection strategy proving critical -- random pseudo-positive assignment degrades performance while geometric hard negative mining provides meaningful alignment signal. Our findings suggest that partially paired data, when coupled with principled pseudo-supervision, provides complementary context that enriches the contrastive learning objective beyond what high-fidelity pairs alone can offer. This paradigm generalizes naturally to other data-abundant but incompletely paired domains, including biology, physics, and large-scale web and social media corpora, offering a path toward more data-efficient contrastive representation learning at scale. To rigorously evaluate our model, we constructed an end-to-end test collection grounded in real-world chemical information retrieval needs,  First, in collaboration with domain expert chemists, we developed a graded relevance benchmark comprising 35 diverse query topics, each with 10--30 expert-assessed candidates pooled from a retrieval pipeline combining specialized models for both chemical text and SMILES-based molecular search. Relevance judgments were assigned at multiple granularity levels reflecting the degree to which each candidate satisfies the query's intended information need, yielding the first publicly available graded relevance collection for chemical information retrieval. Second, we constructed a multi-modal test dataset derived from 142 chemical patent PDFs spanning over 30,000 pages, covering compounds associated with 14 specific human gene targets. Each extracted passage retains provenance metadata linking it to its source PDF, page, and location, enabling fine-grained retrieval evaluation. The pooling pipeline used to generate candidates combined term-based retrieval (BM25, PL2), BERT-based dense retrieval for semantic text matching, Tanimoto similarity and subgraph search for molecular structure matching, and ChemBERTa-based dense retrieval for molecular semantic similarity. Multi-modal queries combining text and molecular inputs were resolved via overlap-based re-ranking over the union of text and molecule retrieval candidates. Together, the collection provides a rigorous and reproducible evaluation framework for cross-modal chemical retrieval.

Publication Date

4-2026

Document Type

Dissertation

Student Type

Graduate

Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

Computing and Information Sciences Ph.D, Department of

College

Golisano College of Computing and Information Sciences

Advisor

Richard Zanibbi

Advisor/Committee Member

Nathaniel H. Stanley

Advisor/Committee Member

Weijie Zhao

Campus

RIT – Main Campus

Share

COinS