Last updated:
Author(s):
Maxwell Salvatore, Ritoban Kundu, Xu Shi, Christopher R Friese, Seunggeun Lee, Lars G Fritsche, Alison M Mondul, David Hanauer, Celeste Leigh Pearce, Bhramar Mukherjee
Publish date:
14 May 2024
Journal:
Journal of the American Medical Informatics Association
PubMed ID:
38742457

Abstract

OBJECTIVES: To develop recommendations regarding the use of weights to reduce selection bias for commonly performed analyses using electronic health record (EHR)-linked biobank data.

MATERIALS AND METHODS: We mapped diagnosis (ICD code) data to standardized phecodes from 3 EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n = 244 071), Michigan Genomics Initiative (MGI; n = 81 243), and UK Biobank (UKB; n = 401 167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to represent the US adult population more. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted 4 common analyses comparing unweighted and weighted results.

RESULTS: For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted phenome-wide association study for colorectal cancer, the strongest associations remained unaltered, with considerable overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates.

DISCUSSION: Weighting had a limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation. When interested in estimating effect size, specific signals from untargeted association analyses should be followed up by weighted analysis.

CONCLUSION: EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.

Related projects

The growing availability of electronic health record (EHR) linked biobanks brings up new research opportunities but also comes with new methodological challenges. This research aims…

Institution:
University of Michigan, United States of America

All projects