A Transformer-Based Approach for Gene Discovery in Radiation Response Under Data-Sparse Conditions
Session Number
CMPS(ai) 01
Advisor(s)
Dr. Tarak Nath Nandi, Argonne National Laboratory
Discipline
Computer Science
Start Date
17-4-2025 11:25 AM
End Date
17-4-2025 11:40 AM
Abstract
This paper investigates the application of Geneformer, a transformer-based model, for identifying genes that cause transitions between radiation levels in data-sparse situations. Traditional differential gene expression (DGE) methods often face limitations when data availability is minimal. Preprocessing was done to leverage high-throughput single-cell RNA sequencing data to ensure accurate analysis of the genes responsible for transitions in irradiated cell states. Statistical techniques, including t-tests, Wilcoxon rank-sum, and logistic regression, were employed to rank gene expression across four radiation exposures (0, 10, 100, and 1000 mGy). The Geneformer transformer-based model was fine-tuned on the tokenized data with hyperparameter optimization. This yielded significant improvements in classification accuracy as validated by two-dimensional embedding representations and in-silico perturbation experiments. When both processes were tested on data subsets consisting of 1024, 256, and 128 cells, the finetuned Geneformer model consistently outperformed the traditional DGE method. Overall, the findings demonstrate how Geneformer detects subtle shifts in gene expression with high precision and reliably identifies key genetic drivers of radiation response, thereby offering a viable alternative to conventional DGE approaches in low-data environments.
A Transformer-Based Approach for Gene Discovery in Radiation Response Under Data-Sparse Conditions
This paper investigates the application of Geneformer, a transformer-based model, for identifying genes that cause transitions between radiation levels in data-sparse situations. Traditional differential gene expression (DGE) methods often face limitations when data availability is minimal. Preprocessing was done to leverage high-throughput single-cell RNA sequencing data to ensure accurate analysis of the genes responsible for transitions in irradiated cell states. Statistical techniques, including t-tests, Wilcoxon rank-sum, and logistic regression, were employed to rank gene expression across four radiation exposures (0, 10, 100, and 1000 mGy). The Geneformer transformer-based model was fine-tuned on the tokenized data with hyperparameter optimization. This yielded significant improvements in classification accuracy as validated by two-dimensional embedding representations and in-silico perturbation experiments. When both processes were tested on data subsets consisting of 1024, 256, and 128 cells, the finetuned Geneformer model consistently outperformed the traditional DGE method. Overall, the findings demonstrate how Geneformer detects subtle shifts in gene expression with high precision and reliably identifies key genetic drivers of radiation response, thereby offering a viable alternative to conventional DGE approaches in low-data environments.