Semi-supervised Phenotyping with Electronic Health Records

Jesse Gronsbell , Chuan Hong , Molei Liu , Clara-Lea Bonzel , Aaron Sonabend

Abstract: Phenotyping is the process of identifying a patient’s health state based on the information in their electronic health records. In this tutorial, we will discuss why phenotyping is a challenging problem from both a practical and methodological perspective. We will focus primarily on the the challenges in obtaining annotated phenotype information from patient records and present statistical learning methods that leverage unlabeled examples to improve model estimation and evaluation to reduce the annotation burden.

Bio: Jesse Gronsbell is an Assistant Professor in the Department of Statistical Sciences at the University of Toronto. Prior to joining U of T, Jesse spent a couple of years as a data scientist in the Mental Health Research and Development Group at Alphabet's Verily Life Sciences. Her primary interest is in the development of statistical methods for modern digital data sources such as electronic health records and mobile health data.

Chuan Hong is an instructor in biomedical informatics from the Department of Biomedical Informatics (DBMI) at Harvard Medical School. She received her PhD in Biostatistics from the University of Texas Health Science Center at Houston. Her doctoral research focused on meta-analysis and DNA methylation detection. At DBMI, Chuan's research interests lie in developing statistical and computational methods for biomarker evaluation, predictive modeling, and precision medicine with biomedical data. In particular, she is interested in combining electronic medical records with biorepositories and relevant resources to improve phenotyping accuracy, detect novel biomarkers, and monitor disease progression in clinical research.

Molei Liu is a 4th year PhD candidate in the Biostatistics department at Harvard T.H. Chan School of Public Health. He received a Bachelor's degree in Statistics from Peking University. Molei has been working in areas including high dimensional statistics, distributed learning, semi-supervised learning, semi-parametric inference, and model-X inference. He has also been working on methods for phenome-wide association studies (PheWAS) using electronic health records data.

Clara-Lea Bonzel is a research assistant at the Department of Biomedical Informatics at Harvard Medical School. She is mainly interested in personalized medicine using phenomic and genomic data, and model selection and evaluation. Clara-Lea received her master's degree in Applied Mathematics and Financial Engineering from the Swiss Federal Institute of Technology (EPFL).

Aaron Sonabend is a PhD candidate in the Biostatistics department at Harvard T.H. Chan School of Public Health. He is primarily focused on developing robust reinforcement learning and natural language processing methods for contexts with sampling bias, partially observed rewards, or strong distribution shifts. He is interested in healthcare and biomedical applications, such as finding optimal sequential treatment regimes for complex diseases, and phenotyping using electronic health records. Aaron holds a Bachelor's degree in Applied Mathematics, and in Economics from the National Autonomous Technological Institute of Mexico (ITAM).