Abstract
Large language models (LLMs) have demonstrated strong performance on a range of reasoning tasks, however, their reliability often depends not only on model size or training data, but also on inference-time strategies. However, existing inference-time methods are typically evaluated in isolation and under differing experimental assumptions, making it difficult to draw systematic conclusions about their relative effectiveness. This thesis proposes a controlled empirical study of inference-time scaling strategies for large language models under fixed inference-time compute budgets. The findings reveal that no single strategy dominates uniformly. PRM guided selection with the IBM Granite verifier achieves the highest absolute accuracy across arithmetic and compositional tasks. Multi-agent debate with heterogeneous critic-solver pairs achieves the highest overall accuracy on object counting, surpassing PRM guidance. Majority voting provides reliable, compute-efficient improvements across most settings when answer diversity is sufficient, and while beam search helps, it does not usually provide much benefit with width growth. We also show that verifier choice and aggregation method (notably, last-step aggregation) substantially affect PRM performance, and that model heterogeneity in debate consistently outperforms self-debate.
Publication Date
4-2026
Document Type
Thesis
Student Type
Graduate
Degree Name
Computer Science (MS)
Department, Program, or Center
Computer Science, Department of
College
Golisano College of Computing and Information Sciences
Advisor
Christopher Homan
Advisor/Committee Member
Eduardo Coelho De Lima
Advisor/Committee Member
Richard Lange
Recommended Citation
Owolabi, Oluwamayowa, "A ComparativeStudyofInference-Time Scaling StrategiesforLargeLanguage Models" (2026). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12550
Campus
RIT – Main Campus
