Abstract

Post-translational modifications (PTMs) are chemical changes that occur after translation and play a key role in regulating protein function and cellular processes. Their dysregulation is associated with various diseases, making accurate prediction of PTM sites essential for understanding cellular mechanisms and informing therapeutic strategies. This work focuses on the computational prediction of two biologically significant PTMs, namely succinylation and O-GlcNAcylation, utilizing recent advances in protein language models (pLMs). Drawing an analogy to natural languages, amino acids are treated as words and sequences as sentences, allowing pLMs to capture contextual dependencies within protein sequences. One of the earliest contributions of this thesis, LMSuccSite, is a framework that integrates both global and local contextual information from protein sequences to improve the identification of succinylation sites in proteins. The global context is captured using pretrained pLMs, which learn semantic representations of amino acid sequences by modeling dependencies across distant residues. To complement this, the local context is incorporated through a supervised word embedding model that encodes dense representations of fixed-length sequence windows centered around target residues. Global and local contextual representations are subsequently fused to produce a unified representation that benefits from both broad sequence-level understanding and fine-grained residue-level patterns. This fusion addresses a key limitation of prior approaches, which are based on handcrafted features that are often restricted to local context. By integrating embeddings from both perspectives, LMSuccSite offers a more comprehensive approach to succinylation prediction. Building on the effectiveness of protein language models in succinylation site prediction, the second project in this thesis focuses on identifying O-GlcNAcylation sites using a model named LM-OGlcNAc-Site, where we explored various strategies to integrate the multiple sequence-based protein language models. LM-OGlcNAc-Site integrates multiple pLMs with diverse architectures and training objectives, as each captures distinct contextual representations of protein sequences. Their fusion enables a more comprehensive and informative representation of proteins. This integrated approach improved the identification of O-GlcNAc sites compared to individual representation, demonstrating the effectiveness in enhancing PTM site prediction. Expanding upon the sequence-based protein representations, we further investigated the integration of structure-aware pLMs for residue-level and protein-level classification tasks, aiming to capture complementary information embedded in protein sequences and their 3D conformations. Several state-of-the-art sequence-based and structure-aware pLMs were systematically evaluated to extract high-dimensional embeddings, and multiple fusion strategies were explored to integrate these diverse feature sets effectively. While sequence-based models demonstrated strong baseline performance, incorporating structural context consistently yielded competitive results in most of the experiments. These experiments highlight the potential of incorporating structural information to enhance protein function prediction.

Library of Congress Subject Headings

Post-translational modifications--Models; Proteomics; Genomics

Publication Date

7-2025

Document Type

Dissertation

Student Type

Graduate

Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

Computing and Information Sciences Ph.D, Department of

College

Golisano College of Computing and Information Sciences

Advisor

Dukka B. KC

Advisor/Committee Member

Stefan Schulze

Advisor/Committee Member

Christopher Homan

Campus

RIT – Main Campus

Plan Codes

COMPIS-PHD

Share

COinS