Generation of Differentially Private Heterogeneous Synthetic Electronic Health Records using GANs

Kieran Chin-Cheong, Thomas M. Sutter, Julia E. Vogt

Abstract: Electronic Health Records (EHRs) are commonly used by the machine learning community for research on problems specifically related to health care and medicine. EHRs have the advantages that they can be easily distributed and contain many features useful for e.g. classification problems. What makes EHR data sets different from typical machine learning data sets is that they are often very sparse, due to their high dimensionality, and often contain heterogeneous data types. Furthermore, the data sets deal with sensitive information, which limits the distribution of any models learned using them, due to privacy concerns. In this work, we explore using Generative Adversarial Networks to generate synthetic, heterogeneous EHRs with the goal of using these synthetic records in place of existing data sets. We will further explore applying differential privacy (DP) preserving optimization in order to produce differentially private synthetic EHR data sets, which provide rigorous privacy guarantees, and are therefore more easily shareable. The performance (measured by AUROC, AUPRC and accuracy) of our model's synthetic, heterogeneous data is very close to the original data set (within 6.4%) for the non-DP model when tested in a binary classification task. Although incurring a 20% performance penalty, the DP synthetic data is still useful for machine learning tasks. We additionally perform a sub-population analysis and find that our model does not introduce any bias into the synthetic EHR data compared to the baseline in either male/female populations, or the 0-18, 19-50 and 51+ age groups in terms of classification performance.