A Machine Learning Approach to Predict Schizophrenia from SNP-Array Based Genomic Data

Session Number

Project ID: CMPS 9

Advisor(s)

Dr. Jubao Duan; NorthShore University HealthSystem

Dr. Subhajit Sengupta; NorthShore University HealthSystem

Discipline

Computer Science

Start Date

22-4-2020 10:05 AM

End Date

22-4-2020 10:20 AM

Abstract

There has been growing interest in using machine learning to improve disease detection, and although identifying mental illnesses such as Schizophrenia is being pursued, diagnostic methods have remained largely qualitative. This project aims to use genomic wide array data to predict schizophrenia. Various machine learning procedures using Python and TensorFlow were conducted on a dataset of 5334 subjects’ genomes from 17262 loci provided by NorthShore University HealthSystem. A linear dimensional analysis (LDA) was run on the raw data, revealing that variables were collinear. Next, various support vector machine (SVM) tests were also conducted, and the RBF kernel resulted in an average accuracy rate of 72.97%. A convolutional neural network is being designed to further improve the accuracy. While the neural network currently produces a lower accuracy rate than that of the SVM, it can be improved using different parameters and set-ups. The LDA indicates that the dataset must undergo dimensional reduction to improve accuracy. Since the target accuracy rate lies above 95%, further steps would be to improve the CNN and to utilize different machine learning techniques such as Random Forests on the same genomic data to further improve the accuracy rate.

Share

COinS
 
Apr 22nd, 10:05 AM Apr 22nd, 10:20 AM

A Machine Learning Approach to Predict Schizophrenia from SNP-Array Based Genomic Data

There has been growing interest in using machine learning to improve disease detection, and although identifying mental illnesses such as Schizophrenia is being pursued, diagnostic methods have remained largely qualitative. This project aims to use genomic wide array data to predict schizophrenia. Various machine learning procedures using Python and TensorFlow were conducted on a dataset of 5334 subjects’ genomes from 17262 loci provided by NorthShore University HealthSystem. A linear dimensional analysis (LDA) was run on the raw data, revealing that variables were collinear. Next, various support vector machine (SVM) tests were also conducted, and the RBF kernel resulted in an average accuracy rate of 72.97%. A convolutional neural network is being designed to further improve the accuracy. While the neural network currently produces a lower accuracy rate than that of the SVM, it can be improved using different parameters and set-ups. The LDA indicates that the dataset must undergo dimensional reduction to improve accuracy. Since the target accuracy rate lies above 95%, further steps would be to improve the CNN and to utilize different machine learning techniques such as Random Forests on the same genomic data to further improve the accuracy rate.