Abstract

We dissect and extend ColBERT, a state of the art multi-vector retrieval model. We first perform a number of experiments to identify the role structural tokens (i.e. [CLS], [SEP], [MASK], [Q]) play in retrieval. The most consequential findings are that (1) [MASK] tokens can be remapped to their closest non-[MASK] embedding without a significant degradation in performance, (2) the existence of [MASK], [Q], and [D] does not really affect the contextualization of query text tokens, and (3) [CLS] and [SEP] can act as “summary” embeddings for the model. In another set of experiments, we ex tend the number of [MASK] tokens to far greater than the number it has been trained with, finding performance to crater when removing all [MASK]s, shoot up as the number of [MASK]s increases to around 9, then plateau afterwards, with no significant reduction in performance. Using this information, we propose GoalBERT, an iterative retrieval model that uses our findings to select and add weight to terms using a reinforcement learning-based strategy. We compare this model to Baleen, another ColBERT-derived iterative retrieval model, to identify how it performs against the full Baleen pipeline and against just its retrieval component. Though our model underperforms over the baseline, we identify this is due to placing too much responsibility on the retriever, and identify promising future directions to take the research.

Library of Congress Subject Headings

Information retrieval; Iterative methods (Mathematics); Reinforcement learning

Publication Date

5-2024

Document Type

Thesis

Student Type

Graduate

Degree Name

Computer Science (MS)

Department, Program, or Center

Computer Science, Department of

College

Golisano College of Computing and Information Sciences

Advisor

Richard Zanibbi

Advisor/Committee Member

Weijie Zhao

Advisor/Committee Member

Arthur Azevedo de Amorim

Campus

RIT – Main Campus

Plan Codes

COMPSCI-MS

Share

COinS