A Transformer-Based Approach for Gene Discovery in Radiation Response Under Data-Sparse Conditions

Session Number

CMPS(ai) 01

Advisor(s)

Dr. Tarak Nath Nandi, Argonne National Laboratory

Discipline

Computer Science

Start Date

17-4-2025 11:25 AM

End Date

17-4-2025 11:40 AM

Abstract

This paper investigates the application of Geneformer, a transformer-based model, for identifying genes that cause transitions between radiation levels in data-sparse situations. Traditional differential gene expression (DGE) methods often face limitations when data availability is minimal. Preprocessing was done to leverage high-throughput single-cell RNA sequencing data to ensure accurate analysis of the genes responsible for transitions in irradiated cell states. Statistical techniques, including t-tests, Wilcoxon rank-sum, and logistic regression, were employed to rank gene expression across four radiation exposures (0, 10, 100, and 1000 mGy). The Geneformer transformer-based model was fine-tuned on the tokenized data with hyperparameter optimization. This yielded significant improvements in classification accuracy as validated by two-dimensional embedding representations and in-silico perturbation experiments. When both processes were tested on data subsets consisting of 1024, 256, and 128 cells, the finetuned Geneformer model consistently outperformed the traditional DGE method. Overall, the findings demonstrate how Geneformer detects subtle shifts in gene expression with high precision and reliably identifies key genetic drivers of radiation response, thereby offering a viable alternative to conventional DGE approaches in low-data environments.

Share

COinS
 
Apr 17th, 11:25 AM Apr 17th, 11:40 AM

A Transformer-Based Approach for Gene Discovery in Radiation Response Under Data-Sparse Conditions

This paper investigates the application of Geneformer, a transformer-based model, for identifying genes that cause transitions between radiation levels in data-sparse situations. Traditional differential gene expression (DGE) methods often face limitations when data availability is minimal. Preprocessing was done to leverage high-throughput single-cell RNA sequencing data to ensure accurate analysis of the genes responsible for transitions in irradiated cell states. Statistical techniques, including t-tests, Wilcoxon rank-sum, and logistic regression, were employed to rank gene expression across four radiation exposures (0, 10, 100, and 1000 mGy). The Geneformer transformer-based model was fine-tuned on the tokenized data with hyperparameter optimization. This yielded significant improvements in classification accuracy as validated by two-dimensional embedding representations and in-silico perturbation experiments. When both processes were tested on data subsets consisting of 1024, 256, and 128 cells, the finetuned Geneformer model consistently outperformed the traditional DGE method. Overall, the findings demonstrate how Geneformer detects subtle shifts in gene expression with high precision and reliably identifies key genetic drivers of radiation response, thereby offering a viable alternative to conventional DGE approaches in low-data environments.