Abstract
Knowledge distillation enables adversaries to replicate the functionality of proprietary machine learning models by querying their APIs and training surrogate student models on the returned soft-label distributions. Antidistillation Sampling (ADS), recently proposed for large language models, perturbs the output distribution of a teacher model at inference time to degrade the quality of the resulting distilled model while preserving utility for legitimate users. We adapt ADS to the supervised classification setting and identify a structural obstacle to its direct transfer: the high-confidence, near-one-hot output distributions characteristic of well-trained classifiers leave insufficient probability mass on non-target classes for the additive penalty to meaningfully redistribute, with median non-target mass of only 0.20 on CIFAR-100 and 0.09 on CIFAR-10. As a consequence, the unmodified defense reduces distilled-student accuracy by less than 2 percentage points (less than 5% relative) on either dataset. The central contribution of this thesis is Confidence-based Temperature Scaling (CTS), the mechanism that makes ADS effective in the classification setting. CTS is a new instance of the Adaptive Temperature Scaling family that targets extraction resistance rather than calibration and conditions a per-sample temperature on the top-class probability rather than the full predictive entropy. It applies softening proportional to the teacher’s excess confidence above a fixed threshold, aggressively flattening near-one-hot predictions while leaving genuinely uncertain ones untouched, and so creates the non-target probability mass on which ADS can subsequently act. We then introduce two complementary defense layers that improve the resulting tradeoff but are not themselves what unlocks ADS for classification. Non-target perturbation restricts the ADS penalty to non-target logits, leaving the targetclass score unchanged; we deliberately do not enforce preservation of the original argmax, since a sufficiently large non-target perturbation can still flip the top-1 prediction. This non-guarantee is itself a defensive feature: a defense that always preserved the argmax could be circumvented by an attacker who simply trains the student on hard labels read off the (always-correct) top-1 class, while the controlled possibility of argmax flips creates uncertainty about which queries carry a reliable hard label. Non-target permutation, repurposing Furlanello et al.’s DKPP diagnostic as an active defense, randomly shuffles the poisoned nontarget probabilities across class indices on every query, destroying the inter-class relational structure on which distillation relies at essentially no cost to the teacher. We evaluate the complete pipeline on CIFAR-100 (ResNet-110→ResNet-20) and CIFAR-10 (ResNet-56 → ResNet-20). The full defense reduces distilled-student accuracy by approximately 18 percentage points (41% relative) on CIFAR-100, with the same component ordering yielding a 27-point (34% relative) drop on CIFAR-10 at a comparable teacher cost; a component-level ablation confirms that the full pipeline occupies the Pareto frontier of the defense–utility tradeoff, while ADS composed with non-target permutation alone provides a low-cost alternative that captures roughly 63% of the defensive yield at 74% of the teacher cost. The defense operates entirely at inference time, requires no modification to the teacher’s weights, no access to the original training corpus, and only two additional forward passes through a compact proxy model per query. Our results demonstrate that confidence-aware temperature scaling, augmented with targeted perturbation and permutation strategies, constitutes a viable inference-time defense against unauthorized knowledge distillation in the classification domain.
Publication Date
5-2026
Document Type
Thesis
Student Type
Graduate
Degree Name
Artificial Intelligence (MS)
College
Golisano College of Computing and Information Sciences
Advisor
Mohammad Javad Khojasteh
Advisor/Committee Member
Majid Rabbani
Advisor/Committee Member
Sohail Dianat
Recommended Citation
Abaid Ullah, Khawaja, "Antidistillation Sampling for Classification Models" (2026). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12584
Campus
RIT – Main Campus
