Abstract

Malware classification poses unique challenges for continual learning (CL) systems, driven by the daily influx of new samples and the evolving nature of malware threats that exploit new vulnerabilities. Antivirus vendors encounter hundreds of thousands of unique software pieces daily, encompassing both malicious and benign files. Over its operational life, a malware classifier can accumulate more than a billion samples. Training malware classification system with only new samples and classes leads to catastrophic forgetting (CF), where the system forgets previously learned data distribution. While retraining with all old and new samples effectively combats CF, it is computationally expensive and necessitates storing vast amounts of older software and malware samples. Employing sequential training with CL strategies offers a potential solution to mitigate these challenges by reducing both training and storage demands. However, the adoption of CL for malware classification has not been extensively explored. This work represents the first in-depth examination of CL not just in the realm of malware classification, but also more broadly within the cybersecurity domain. In this thesis, first we systematize the malware classification pipeline through the lens of three continual learning scenarios: Domain Incremental Learning (Domain-IL), Class Incremental Learning (Class-IL), and Task Incremental Learning (Task-IL), detailed in Chapter 3. Our objective is to bridge the research gap between existing CL literature and the specific needs of malware classification. We undertake a thorough examination of state-of-the-art CL methods within the frameworks of these three CL scenarios. Chapter 4 presents an in-depth exploration of the catastrophic forgetting phenomenon in the context of malware classification. We analyze the applicability and performance of 11 leading CL techniques across three categories -- regularization, replay, and replay with exemplars, initially developed for computer vision tasks, to uncover if they can also mitigate catastrophic forgetting in malware domain. Contrary to expectations, our findings indicate that none of the CL approaches tested successfully mitigate catastrophic forgetting in malware classification systems, pointing to a significant research opportunity in this domain. The unexpected results presented in Chapter 4 prompted a detailed exploratory analysis of the EMBER dataset, which comprises Windows malware and benign software samples, in Chapter 5. This analysis revealed significant diversity within malware data distributions, both across and within malware families. Drawing on these insights, we developed MADAR -- Malware Analysis with Diversity-Aware Replay. This innovative strategy for malware classification adopts a diversity-aware, replay-based approach, integrating a mix of representative and novel samples into the training regimen to enhance the stability of the model to retain learned information and identify emerging malware threats, even with limited memory budget. Moreover, we have created two new benchmarks using Android malware from the AndroZoo repository for testing in both Domain-IL (AZ-Domain) and Class-IL (AZ-Class) scenarios. The results from these benchmarks underscore the effectiveness of the MADAR framework, establishing it as the new state-of-the-art and demonstrating its enhanced performance over existing leading CL methods in adapting to realistic shifts in malware data distribution. In Chapter 6, we conclude with promising future research direction to advance continual learning research for an ever evolving and intelligent malware classification systems, focusing on adaptability to evolving threats and tackling challenges relevant to both industry and academia.

Library of Congress Subject Headings

Malware (Computer software)--Classification; Transfer learning (Machine learning); Cyberterrorism--Prevention

Publication Date

5-2024

Document Type

Dissertation

Student Type

Graduate

Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

Computing and Information Sciences Ph.D, Department of

College

Golisano College of Computing and Information Sciences

Advisor

Matthew Wright

Advisor/Committee Member

Scott E. Coull

Advisor/Committee Member

Qi Yu

Campus

RIT – Main Campus

Plan Codes

COMPIS-PHD

Share

COinS