CHIL : Improving medical annotation quality to decrease labeling burden using stratified noisy cross-validation

Improving medical annotation quality to decrease labeling burden using stratified noisy cross-validation

Joy Hsu, Sonia Phene, Akinori Mitani, Jieying Luo, Naama Hammel, Jonathan Krause, Rory Sayres

Abstract: As machine learning has become increasingly applied to medical imaging data, noise in training labels has emerged as an important challenge. Variability in diagnosis of medical images is well established; in addition, variability in training and attention to task among medical labelers may exacerbate this issue. Methods for identifying and mitigating the impact of low quality labels have been studied, but are not well characterized in medical imaging tasks. For instance, Noisy Cross-Validation splits the training data into halves, and has been shown to identify low-quality labels in computer vision tasks; but it has not been applied to medical imaging tasks specifically. In addition, there may be concerns around label imbalance for medical image sets, where relevant pathology may be rare. In this work we introduce Stratified Noisy Cross-Validation (SNCV), an extension of noisy cross validation. SNCV allows us to measure confidence in model prediction and assign a quality score to each example; supports label stratification to handle class imbalance; and identifies likely low-quality labels to analyse the causes. In contrast to noisy cross-validation, sample selection for SNCV occurs after training two models, not during training, which simplifies application of the method. We assess performance of SNCV on diagnosis of glaucoma suspect risk (GSR) from retinal fundus photographs, a clinically important yet nuanced labeling task. Using training data from a previously-published deep learning model, we compute a continuous quality score (QS) for each training example. We relabel 1,277 low-QS examples using a trained glaucoma specialist; the new labels agree with the SNCV prediction over the initial label >85% of the time, indicating that low-QS examples appear mostly reflect labeler erors. We then quantify the impact of training with only high-QS labels, showing that strong model performance may be obtained with many fewer examples. By applying the method to randomly sub-sampled training dataset, we show that our method can reduce labelling burden by approximately 50% while achieving model performance non-inferior to using the full dataset on multiple held-out test sets.