Revisiting Machine-Learning based Drug Repurposing: Drug Indications Are Not a Right Prediction Target

Siun Kim* (Seoul National University), Jung-Hyun Won (Seoul National University), David Seung U Lee (Seoul National University), Renqian Luo (Microsoft Research), Lijun Wu (Microsoft Research), Yingce Xia (Microsoft Research), Tao Qin (Microsoft Research), Howard Lee (Seoul National University)

Abstract: In this paper, we challenge the utility of approved drug indications as a prediction target for machine learning in drug repurposing (DR) studies. Our research highlights two major limitations of this approach: 1) the presence of strong confounding between drug indications and drug characteristics data, which results in shortcut learning, and 2) inappropriate normalization of indications in existing drug-disease association (DDA) datasets, which leads to an overestimation of model performance. We show that the collection patterns of drug characteristics data were similar within drugs of the same category and the Anatomical Therapeutic Chemical (ATC) classification of drugs could be predicted by using the data collection patterns. Furthermore, we confirm that the performance of existing DR models is significantly degraded in the realistic evaluation setting we proposed in this study. We provide realistic data split information for two benchmark datasets, Fdataset and deepDR dataset.