Using Machine Learning to Determine Peptide Sequences with High Heme Binding Propensity

Session Number

CMPS(ai) 03

Advisor(s)

Dr. Chris Fry, Dr. Henry Chan, Dr. Jesse Prelesnik, Argonne National Laboratory

Discipline

Computer Science

Start Date

17-4-2025 11:10 AM

End Date

17-4-2025 11:25 AM

Abstract

Self-assembling peptides, or chains of amino acids that form various structures in response to environmental conditions, have a variety of uses in materials science as well as biomedicine, such as drug delivery. Our work hopes to utilize machine learning to find patterns in peptide sequences to streamline material discovery. In doing so, we have experimentally gathered spectroscopy data on >200 16 amino acid long peptides. These rationally designed peptides contained sequence redundancy yielding an ineffective dataset for machine learning. Therefore, we require more data to create a model capable of determining the heme binding propensity of peptides given their amino acid sequence. Here, we aim to create a new pipeline for discovering novel sequences without relying on rational design. We gathered data on >4000 proteins from the Protein Data Bank to create a Long Short Term Memory (LSTM) model capable of generating amino acid sequences derived from those of naturally occurring heme binding proteins. We used this LSTM to generate a list of >1,000,000 peptide sequences of length 16. We processed these sequences using a variety of techniques and plan to gather experimental data on these diverse peptide sequences to expand and improve the quality of our database.

Share

COinS
 
Apr 17th, 11:10 AM Apr 17th, 11:25 AM

Using Machine Learning to Determine Peptide Sequences with High Heme Binding Propensity

Self-assembling peptides, or chains of amino acids that form various structures in response to environmental conditions, have a variety of uses in materials science as well as biomedicine, such as drug delivery. Our work hopes to utilize machine learning to find patterns in peptide sequences to streamline material discovery. In doing so, we have experimentally gathered spectroscopy data on >200 16 amino acid long peptides. These rationally designed peptides contained sequence redundancy yielding an ineffective dataset for machine learning. Therefore, we require more data to create a model capable of determining the heme binding propensity of peptides given their amino acid sequence. Here, we aim to create a new pipeline for discovering novel sequences without relying on rational design. We gathered data on >4000 proteins from the Protein Data Bank to create a Long Short Term Memory (LSTM) model capable of generating amino acid sequences derived from those of naturally occurring heme binding proteins. We used this LSTM to generate a list of >1,000,000 peptide sequences of length 16. We processed these sequences using a variety of techniques and plan to gather experimental data on these diverse peptide sequences to expand and improve the quality of our database.