【Lecture】Feature Selection Bias in Assessing the Predictivity of SNPs for Alzheimer's Disease

Publisher:陶琦Release time:2019-08-23Browse the number:24

主题:   Feature Selection Bias in Assessing the Predictivity of SNPs for Alzheimer主讲人:   Longhai Li地点:   松江校区2号学院楼331理学院报告厅时间:   2019-08-23 14:00:00组织单位:   理学院


Prof Longhai Li received his Ph.Din statistics from the University of Toronto in 2007 under the supervision by Radford M. Neal of the U of T machine learning group. Before that, he received his B.Sc. instatistics from the University of Science and Technology of China. He joined the Dept of Mathematics and Statistics at the University of Saskatchewan in 2007 as an assistant professor and was promoted to professor in 2018. His research activities focus on developing and applying statistical learning methods for high-through put and correlated data. His research results have been published in a few prestigious statistics journals, such as Journal of American Statistical Association, Bayesian Analysis, Statistics in Medicine, and Statistics and Computing. His research has been funded by a few funding agencies, including NSERC, CFI, MITACS and CFREF.


This is a joint work with Ms Mei Dong, a student from Donghua University. In the context of identifying related SNPs for a phenotype of interest, we consider assessing the predictivity of SNPs selected by performing GWAS. There are two kinds of cross-validation methods. One is called internal cross-validation (ICV), in which a subset of SNPs are pre-selected based on all samples then cross-validation is applied to the selected subset. The other is external cross-validation (ECV), in which features are re-selected based on only the training samples in each fold of cross-validation. The feature selectionbias of ICV has not received sufficient attention when predicting a phenotype with SNP data. We demonstrate that ICV can lead to severe false discovery using Alzheimer'sdisease. We use a real SNP dataset related to late-onset Alzheimer's disease (LOAD)and two synthetic datasets. For the prediction, we compare the performances of three regularized logistic regression methods. For the LOAD dataset, if using ECV method, no other SNPs can improve the prediction of LOAD based on only APOE. However, the predictivity estimate of selected SNPs given by ICV can reach an R^2 of 80%. There sults of synthetic datasets are similar to the real dataset. Furthermore, we have found that the predictivity estimate given by ICV can significantly higher than the oracle predictivity based on the truly related SNPs. We have also found that Hyper-LASSO performs better than LASSO and elastic net. We recommend that ICV should not be used to measure the predictivity of selected SNP.