Penalized Logistic Regression Analysis for Genetic Association Studies of Binary Phenotypes
Approved Research ID: 78350
Approval date: January 25th 2022
Genetic variants are differences in DNA among individuals in a population. Rare genetic variants are thought to have large effects on complex human diseases such as high blood pressure or cancer. This conjecture has led researchers to study the connections, or associations, between diseases and rare single-nucleotide variants (SNVs). Rare SNVs are positions on the human genome at which there are two known DNA "letters" in the population and one of these letters is rare. Unfortunately, standard statistical methods tend to over-estimate the strength of associations between diseases and rare SNVs. We will study a method called log-F-penalized logistic regression that counteracts over-estimation by "shrinking" the standard estimate toward zero, i.e., toward no association. The log-F approach includes a so-called "tuning" parameter that controls how aggressively it shrinks estimates, but this flexibility leaves researchers not knowing how much shrinkage to specify. Too little shrinkage and we fail to correct the problem of over-estimation; too much shrinkage and we end up missing variants that are truly associated with the disease. We have developed a method that selects the amount of shrinkage based on a dataset of SNVs and a disease outcome, but this tuning procedure is time consuming and requires specialized computer software. Using a comprehensive set of SNVs from the UK Biobank, we plan to apply the tuning procedure for a variety of disease outcomes, from common ones such as hypertension and high cholesterol to lower-prevalence ones such as rare autoimmune disorders or cancers. We hope to distill these analyses into rules-of-thumb that tell researchers how much shrinkage to apply, without their needing to tune the method themselves. Our findings will give concrete suggestions on analysis methods for our own research and that of others who study how genes influence the risk of complex human diseases. The project is expected to be completed within 3 years.