A Machine Learning Approach to Predict Schizophrenia from SNP-Array Based Genomic Data
Session Number
Project ID: CMPS 9
Advisor(s)
Dr. Jubao Duan; NorthShore University HealthSystem
Dr. Subhajit Sengupta; NorthShore University HealthSystem
Discipline
Computer Science
Start Date
22-4-2020 10:05 AM
End Date
22-4-2020 10:20 AM
Abstract
There has been growing interest in using machine learning to improve disease detection, and although identifying mental illnesses such as Schizophrenia is being pursued, diagnostic methods have remained largely qualitative. This project aims to use genomic wide array data to predict schizophrenia. Various machine learning procedures using Python and TensorFlow were conducted on a dataset of 5334 subjects’ genomes from 17262 loci provided by NorthShore University HealthSystem. A linear dimensional analysis (LDA) was run on the raw data, revealing that variables were collinear. Next, various support vector machine (SVM) tests were also conducted, and the RBF kernel resulted in an average accuracy rate of 72.97%. A convolutional neural network is being designed to further improve the accuracy. While the neural network currently produces a lower accuracy rate than that of the SVM, it can be improved using different parameters and set-ups. The LDA indicates that the dataset must undergo dimensional reduction to improve accuracy. Since the target accuracy rate lies above 95%, further steps would be to improve the CNN and to utilize different machine learning techniques such as Random Forests on the same genomic data to further improve the accuracy rate.
A Machine Learning Approach to Predict Schizophrenia from SNP-Array Based Genomic Data
There has been growing interest in using machine learning to improve disease detection, and although identifying mental illnesses such as Schizophrenia is being pursued, diagnostic methods have remained largely qualitative. This project aims to use genomic wide array data to predict schizophrenia. Various machine learning procedures using Python and TensorFlow were conducted on a dataset of 5334 subjects’ genomes from 17262 loci provided by NorthShore University HealthSystem. A linear dimensional analysis (LDA) was run on the raw data, revealing that variables were collinear. Next, various support vector machine (SVM) tests were also conducted, and the RBF kernel resulted in an average accuracy rate of 72.97%. A convolutional neural network is being designed to further improve the accuracy. While the neural network currently produces a lower accuracy rate than that of the SVM, it can be improved using different parameters and set-ups. The LDA indicates that the dataset must undergo dimensional reduction to improve accuracy. Since the target accuracy rate lies above 95%, further steps would be to improve the CNN and to utilize different machine learning techniques such as Random Forests on the same genomic data to further improve the accuracy rate.