Abstract
Proteins of unknown function represent a significant gap in our understanding of biological processes, with many organisms, especially prokaryotes, harboring large portions of their proteomes that remain uncharacterized. For example, in Pseudomonas aeruginosa, a major human pathogen, a large proportion of proteins are categorized as unknown or hypothetical in function. The precise number varies across species, but studies have shown that up to 30–50% of proteins in bacterial proteomes can lack functional annotations. Addressing this gap is critical to understanding the biology and pathogenicity of such organisms. Recent advancements in computational tools, particularly those in sequence and structure-based prediction, offer new opportunities to annotate these proteins of unknown function. Here, we present a computational pipeline, ProtPen, that integrates eggNOG-mapper for sequence-based functional annotations and Foldseek for structural comparisons using AlphaFold-generated models. ProtPen begins by using FASTA sequences to generate functional annotations from eggNOG-mapper, followed by the retrieval of AlphaFold protein structures from UniProt. These structures are then analyzed using Foldseek to identify structural homologs. By combining these results, the pipeline improves the accuracy and comprehensiveness of functional predictions. We applied this pipeline to quantitative proteomics datasets from P. aeruginosa strains PAO1 and LESB58, where a significant proportion of differentially abundant proteins were of unknown or hypothetical function. In the PAO1 strain, 7 out of 21 proteins were of unknown function, and the pipeline provided annotations for 5 of these. In the LESB58 strain, 28 out of 66 proteins were of unknown function, and the pipeline provided annotations for 20 of these. Our findings demonstrate that combining sequence and structure-based approaches offers complementary insights into protein function. When integrated with quantitative proteomics data, the pipeline provides functional insights into a large fraction of previously uncharacterized proteins, for which their significant proteomics results already demonstrated importance in antibiotic resistance. Notably, this versatile pipeline is easily extendable to new annotation tools and applicable to protein sequences from a wide range of organisms and datasets.
Library of Congress Subject Headings
Proteins--Analysis--Data processing; Structural bioinformatics; Sequence alignment (Bioinformatics)
Publication Date
4-8-2025
Document Type
Thesis
Student Type
Graduate
Degree Name
Bioinformatics (MS)
Department, Program, or Center
Thomas H. Gosnell School of Life Sciences
College
College of Science
Advisor
Stefan Schulze
Advisor/Committee Member
Gary R. Skuse
Advisor/Committee Member
Paul Craig
Recommended Citation
Mathai, Diya, "ProtPen: A pipeline that combines sequence- and structure-based approaches to predict protein functions" (2025). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12073
Campus
RIT – Main Campus
Plan Codes
BIOINFO-MS