Abstract
We dissect and extend ColBERT, a state of the art multi-vector retrieval model. We first perform a number of experiments to identify the role structural tokens (i.e. [CLS], [SEP], [MASK], [Q]) play in retrieval. The most consequential findings are that (1) [MASK] tokens can be remapped to their closest non-[MASK] embedding without a significant degradation in performance, (2) the existence of [MASK], [Q], and [D] does not really affect the contextualization of query text tokens, and (3) [CLS] and [SEP] can act as “summary” embeddings for the model. In another set of experiments, we ex tend the number of [MASK] tokens to far greater than the number it has been trained with, finding performance to crater when removing all [MASK]s, shoot up as the number of [MASK]s increases to around 9, then plateau afterwards, with no significant reduction in performance. Using this information, we propose GoalBERT, an iterative retrieval model that uses our findings to select and add weight to terms using a reinforcement learning-based strategy. We compare this model to Baleen, another ColBERT-derived iterative retrieval model, to identify how it performs against the full Baleen pipeline and against just its retrieval component. Though our model underperforms over the baseline, we identify this is due to placing too much responsibility on the retriever, and identify promising future directions to take the research.
Library of Congress Subject Headings
Information retrieval; Iterative methods (Mathematics); Reinforcement learning
Publication Date
5-2024
Document Type
Thesis
Student Type
Graduate
Degree Name
Computer Science (MS)
Department, Program, or Center
Computer Science, Department of
College
Golisano College of Computing and Information Sciences
Advisor
Richard Zanibbi
Advisor/Committee Member
Weijie Zhao
Advisor/Committee Member
Arthur Azevedo de Amorim
Recommended Citation
Giacalone, Ben, "GoalBERT: Goal-Directed ColBERT for Iterative Retrieval" (2024). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/11754
Campus
RIT – Main Campus
Plan Codes
COMPSCI-MS