Abstract

Large language models (LLMs) have demonstrated strong performance on a range of reasoning tasks, however, their reliability often depends not only on model size or training data, but also on inference-time strategies. However, existing inference-time methods are typically evaluated in isolation and under differing experimental assumptions, making it difficult to draw systematic conclusions about their relative effectiveness. This thesis proposes a controlled empirical study of inference-time scaling strategies for large language models under fixed inference-time compute budgets. The findings reveal that no single strategy dominates uniformly. PRM guided selection with the IBM Granite verifier achieves the highest absolute accuracy across arithmetic and compositional tasks. Multi-agent debate with heterogeneous critic-solver pairs achieves the highest overall accuracy on object counting, surpassing PRM guidance. Majority voting provides reliable, compute-efficient improvements across most settings when answer diversity is sufficient, and while beam search helps, it does not usually provide much benefit with width growth. We also show that verifier choice and aggregation method (notably, last-step aggregation) substantially affect PRM performance, and that model heterogeneity in debate consistently outperforms self-debate.

Publication Date

4-2026

Document Type

Thesis

Student Type

Graduate

Degree Name

Computer Science (MS)

Department, Program, or Center

Computer Science, Department of

College

Golisano College of Computing and Information Sciences

Advisor

Christopher Homan

Advisor/Committee Member

Eduardo Coelho De Lima

Advisor/Committee Member

Richard Lange

Campus

RIT – Main Campus

Share

COinS