If you ask multiple medical experts to evaluate the same medical image or case, they will often give different opinions. In fact, if you ask the same expert about the same case in the morning and at night, it is not uncommon for them to contradict their own prior opinion.
This simple example is part of the broader problem of medical error, defined as an unintended act (either of omission or commission) or one that does not achieve its intended outcome. Makary, M. A., & Daniel, M. (2016) claim that “medical error is the third most common cause of death in the US” with a point estimate of 251,000 deaths per year. However, it is often under reported since medical error is not included in death certificates or traditional cause of death ranking metrics.
And doctors are humans, who make human errors. Being fatigued, stressed or distracted while evaluating a case can profoundly impact an expert's decision-making prowess.
So experts often disagree with each other, with themselves, and are prone to human error. These problems are pervasive, but rarely documented. Outside of medicine, we instrument and analyze data on many aspects of our personal lives (number of calories burned in our morning workout, how much time we’ve spent on social media, and likelihood to receive a low-interest rate). This carries over to the professional world too, where if you ask an MLB player how they are hitting this season they will know their batting average within a few points.
However, for some of the most critical questions and resultant decisions in life, such as “does this X-ray contain cancer”, we do not have an effective way to systematically and continually evaluate the performance of physicians. Furthermore, if you ask a radiologist how good they are at identifying breast cancer, they will probably refer you to their resume, since there are no concrete metrics that they can share for how good they actually are in either absolute terms or relative to their peers.
Along those lines, Berner ES, Graber ML. (2008) show that physician overconfidence leads to misdiagnosis across dozens of medical conditions. They cite the Dreiseitl, S., & Binder, M. (2005) study which shows that experienced dermatologists were confident in diagnosing melanoma in 50% of test cases, but were wrong in 30% of these decisions.
Even though it may be challenging to measure the accuracy of a physician’s diagnosis, it’s not hard to see the impacts of a mistake. Diagnostic error in medicine is a widespread problem, with the Society to Improve Diagnosis in Medicine suggesting that 40,000 to 80,000 patients die annually in the U.S. from diagnostic errors.
Healthcare AI companies suffer the consequences of diagnostic errors too, because they will often employ physicians to create training data and only solicit a single opinion on a particular case. As a result, noisy or inaccurate labels will negatively impact model performance and force companies to have an overall larger training dataset to reduce model bias. And even so, the model can only be as accurate as the already error-prone individual expert.
One of the best ways to improve data labeling accuracy is to gather and aggregate multiple expert opinions together. As cited by Kurvers et al (2016), “Aggregating the independent judgments of doctors outperforms the best doctor in a group.”
In his study, Kurvers aggregated the opinions using a majority rule. However, much more sophisticated methods for opinion aggregation are possible if more data about an expert’s performance on a task can be measured over time. This allows for a weighted average approach that biases toward the opinion of the expert who has demonstrated the highest performance on similar tasks.
Improving healthcare AI through a more data-driven approach inspired us to start Centaur Labs. We provide trusted, accurate medical data annotations (classification, segmentation, etc.) from a network of medical experts. Our novel approach intelligently weighs multiple opinions and uses performance-based incentives to ensure that quality assurance is baked into every step of the process.
The value of aggregating multiple opinions is demonstrated in the chart below. As you can see, the greater the number of opinions, the greater the classification accuracy, up until a plateau of around 10 opinions. The actual number of opinions needed varies by the complexity of the task and difficulty of the case. As we've found, simpler cases require fewer expert opinions while more difficult cases, require more.
For a more thorough analysis of the unique challenges with medical data labeling, the relative lack of accuracy produced by traditional data labeling methods, and a more accurate and scalable alternative for your medical AI based on collective intelligence, download our white paper here.
For more information about how we can offer you accurate medical data labels at scale, contact us today.
From SMS to insurance claims to pathology reports and scientific studies, in this post we dig into the most common type of medical text datasets leveraged for NLP in healthcare.