CFP Track 2: Applications

CHIL CFP Track 2, Applications: Investigation, Evaluation, and Interpretation

Track Chairs: Dr. Tristan Naumann, Dr. Andrew Beam, Dr. Joyce Ho, Matthew McDermott, Dr. Shalmali Joshi


The goal of this track is to highlight works applying robust methods, models, or practices to identify, characterize, audit, evaluate, or benchmark systems. Whereas the goal of Track 1 is to select papers that show significant technical novelty, submit your work here if the contribution is more focused on solving a carefully motivated problem grounded in applications. Introducing a new method is not prohibited by any means for this track, but the focus should be on methods which are designed to work particularly robustly (e.g., fail gracefully in practice), scale particularly well either in terms of computational runtime or data required, work across real-world data modalities and systems, etc. Contributions will be evaluated for technical rigor, robustness, and comprehensiveness.


All areas of machine learning and all kinds of data within healthcare are amenable to this track. An example set of topics of interest and exemplar papers is shown below. These examples are by no means exhaustive and are meant as illustration and motivation.

  • Careful examinations of the robustness of ML systems to real-world dataset shift, adversarial shift, or on minority subpopulations.
    • Nestor, Bret, et al. “Feature Robustness in Non-Stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks.” Proceedings of Machine Learning for Healthcare 2019 (MLHC ’19), 2019,
    • Finlayson, Samuel G., et al. “Adversarial attacks on medical machine learning.” Science 363.6433 (2019): 1287-1289.
  • Investigations into model performance on minority subpopulations, and the implications thereof.
    • Boag, Willie, et al. “Racial Disparities and Mistrust in End-of-Life Care.” Machine Learning for Healthcare Conference. 2018.
    • Chen, Irene Y., Peter Szolovits, and Marzyeh Ghassemi. “Can AI Help Reduce Disparities in General Medical and Mental Health Care?.” AMA journal of ethics 21.2 (2019): 167-179.
  • Scalable, safe machine learning / inference in clinical environments
    • Henderson, Jette, et al. “Phenotype instance verification and evaluation tool (PIVET): A scaled phenotype evidence generation framework using web-based medical literature.” Journal of medical Internet research 20.5 (2018): e164.
  • New tools or comprehensive benchmarks for machine learning for healthcare.
    • Wang, Shirly, et al. “MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III.” Machine Learning for Healthcare, 2019.
  • Development of Scalable Systems for Processing Data in Practice (demonstrating, e.g., concern for multi-modality, runtime, robustness, etc., as guided by a clinical use case):
    • Xu, Yanbo, et al. “Raim: Recurrent attentive and intensive model of multimodal patient monitoring data.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018.
  • Bridging the deployment gap
    • Tonekaboni, Sana, et al. “What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use.” Machine Learning for Healthcare (2019)