Using Machine Learning to Determine Peptide Sequences with High Heme Binding Propensity
Session Number
CMPS(ai) 03
Advisor(s)
Dr. Chris Fry, Dr. Henry Chan, Dr. Jesse Prelesnik, Argonne National Laboratory
Discipline
Computer Science
Start Date
17-4-2025 11:10 AM
End Date
17-4-2025 11:25 AM
Abstract
Self-assembling peptides, or chains of amino acids that form various structures in response to environmental conditions, have a variety of uses in materials science as well as biomedicine, such as drug delivery. Our work hopes to utilize machine learning to find patterns in peptide sequences to streamline material discovery. In doing so, we have experimentally gathered spectroscopy data on >200 16 amino acid long peptides. These rationally designed peptides contained sequence redundancy yielding an ineffective dataset for machine learning. Therefore, we require more data to create a model capable of determining the heme binding propensity of peptides given their amino acid sequence. Here, we aim to create a new pipeline for discovering novel sequences without relying on rational design. We gathered data on >4000 proteins from the Protein Data Bank to create a Long Short Term Memory (LSTM) model capable of generating amino acid sequences derived from those of naturally occurring heme binding proteins. We used this LSTM to generate a list of >1,000,000 peptide sequences of length 16. We processed these sequences using a variety of techniques and plan to gather experimental data on these diverse peptide sequences to expand and improve the quality of our database.
Using Machine Learning to Determine Peptide Sequences with High Heme Binding Propensity
Self-assembling peptides, or chains of amino acids that form various structures in response to environmental conditions, have a variety of uses in materials science as well as biomedicine, such as drug delivery. Our work hopes to utilize machine learning to find patterns in peptide sequences to streamline material discovery. In doing so, we have experimentally gathered spectroscopy data on >200 16 amino acid long peptides. These rationally designed peptides contained sequence redundancy yielding an ineffective dataset for machine learning. Therefore, we require more data to create a model capable of determining the heme binding propensity of peptides given their amino acid sequence. Here, we aim to create a new pipeline for discovering novel sequences without relying on rational design. We gathered data on >4000 proteins from the Protein Data Bank to create a Long Short Term Memory (LSTM) model capable of generating amino acid sequences derived from those of naturally occurring heme binding proteins. We used this LSTM to generate a list of >1,000,000 peptide sequences of length 16. We processed these sequences using a variety of techniques and plan to gather experimental data on these diverse peptide sequences to expand and improve the quality of our database.