Researchers have discovered that artificial intelligence models that excel in predicting race and gender from X-ray images also exhibit substantial ‘fairness gaps.’ These gaps refer to disparities in their diagnostic accuracy when analyzing images of individuals from different racial or gender groups.
Artificial intelligence models are frequently utilized in medical diagnoses, particularly in image analysis like X-rays. However, studies have revealed that these models don’t consistently perform well across all demographic segments, often showing lower accuracy on women and individuals of color.
Interestingly, in 2022, MIT researchers revealed that AI models can accurately predict a patient’s race based on their chest X-rays, a feat even the most experienced radiologists struggle with.
These researchers have now uncovered a link between the models’ accuracy in demographic predictions and their fairness gaps. This connection suggests that the models may rely on “demographic shortcuts” in their diagnostic assessments, leading to inaccurate results for women, Black individuals, and other groups, as stated by the researchers.
Marzyeh Ghassemi, an MIT associate professor of electrical engineering and computer science and the senior author of the study, highlighted the ability of high-capacity machine-learning models to predict human demographics and tied this capacity to their differential performance across various groups, a novel finding in the field.
The researchers also found that retraining the models could enhance their fairness. However, this “debiasing” process yielded the best results when the models were tested on similar patient groups they were originally trained on, such as those from the same hospital. Fairness gaps reemerged when these models were applied to patients from different hospital datasets.
According to Haoran Zhang, an MIT graduate student and lead author of the paper, evaluating external models on one’s own data is crucial since fairness assurances provided by model developers on their training data may not extend to other populations. Additionally, training models on local data when available is recommended for optimal performance.
Addressing Bias
As of May 2024, the FDA has approved 882 AI-enabled medical devices, with 671 tailored for radiology applications. Following the 2022 revelation by Ghassemi and her team regarding models predicting race from X-rays, subsequent research indicated these models are proficient in predicting gender and age despite lacking specific training in these tasks.
Ghassemi pointed out that many machine learning models possess remarkable demographic prediction capabilities, surpassing the abilities of radiologists to detect race from chest X-rays. Despite excelling in disease prediction, these models inadvertently learn to predict unintended attributes during training.
The researchers aimed to investigate why these models exhibit performance discrepancies among different groups, particularly if they rely on demographic cues for predictions, resulting in reduced accuracy for certain groups. These shortcuts emerge when models employ demographic factors to identify medical conditions instead of leveraging image features.
By utilizing publicly available chest X-ray datasets from Boston’s Beth Israel Deaconess Medical Center, the researchers trained models to predict specific medical conditions, then assessed their performance on withheld X-ray images.
While the models generally performed well, the researchers observed fairness gaps, indicating accuracy disparities between genders and racial groups, with most models also successfully predicting gender, race, and age of X-ray subjects. Notably, a correlation existed between the models’ accuracy in demographic predictions and the size of their fairness gaps, implying the models might depend on demographic cues for disease predictions.
To mitigate these fairness gaps, the researchers employed two strategies: optimizing ‘subgroup robustness’ in one set of models, and implementing ‘group adversarial’ approaches in another set to remove demographic information from images. Both strategies yielded positive outcomes.
Ghassemi emphasized that state-of-the-art methods can effectively reduce fairness gaps without compromising overall performance, particularly when applied to in-distribution data. Subgroup robustness enhances sensitivity to a particular subgroup, while group adversarial methods eliminate group information entirely.
Challenges with Fairness
However, these approaches only proved effective when tested on data similar to the patient groups on which the models were trained, such as limited to the Beth Israel Deaconess Medical Center dataset.
Upon evaluating ‘debiased’ models on patient groups from five other hospital datasets, the researchers observed sustained high overall accuracy but noted significant fairness discrepancies in some models.
Zhang expressed concern over the lack of sustained fairness when deploying debiased models on diverse patient groups across various hospitals, common practice in healthcare settings utilizing off-the-shelf models acquired from different sources.
Ghassemi highlighted that although models optimize performance on data akin to their training samples, they often fail to strike a balance between overall and subgroup performance in novel environments, mirroring real-world deployment scenarios. The researchers are now exploring additional methods to enhance model fairness across different datasets.
The findings underscore the importance for hospitals leveraging AI models to evaluate them on local patient populations to prevent inaccurate results for specific groups.
The study received funding from several sources, including a Google Research Scholar Award, the Robert Wood Johnson Foundation Harold Amos Medical Faculty Development Program, RSNA Health Disparities, the Lacuna Fund, the Gordon and Betty Moore Foundation, the National Institute of Biomedical Imaging and Bioengineering, and the National Heart, Lung, and Blood Institute.