Last updated:
ID:
992241
Start date:
11 October 2025
Project status:
Current
Principal investigator:
Dr Bhramar Mukherjee
Lead institution:
Yale University, United States of America

We develop statistical methods to address key biases in electronic health record (EHR) data, including selection, confounding, and information biases, as well as challenges related to the longitudinal structure of the dataset and data integration. Our four primary objectives are as follows:

1.We propose to work on efficient designs that address multiple sources of bias-for example, both selection and information bias. This work will lead to resource-saving designs that yield, e.g., efficient and representative validation samples for addressing outcome and exposure misclassification, which require chart review by a healthcare provider.

2.We propose a novel approach that explicitly incorporates information from clinically informative presence and clinically informative observation into the longitudinal model for the biomarker. Our goal is to improve the estimation and prediction accuracy of the biomarker model by accounting for the informative nature of clinical encounters and lab orders in EHR data from the biobank.

3.We propose to develop inverse probability weighting and augmented inverse probability weighting methods designed specifically for multi-source studies where individual-level data cannot be shared. To account for variations in selection mechanisms across different sites, we incorporate site-specific selection models and auxiliary score models, ensuring robust estimation despite sampling heterogeneity in data sources.

4. We aim to develop risk prediction models that integrate multi-modal data (e.g., genomic data, imaging, biomarkers, and health record) and leverage association studies, e.g. environmental exposure analyses and phenome-wide association studies, to identify high-risk individuals and inform strategies for early intervention for diseases including cardiovascular disease and cancer, more details in A4.1.1.