Many AI clients come to us after a painstaking journey coordinating with medical labeling experts. We’ve heard this story time and again - hours spent checking the accuracy of data labels and qualifying experts, with no clear success metric or methodology. Without an industry standard for assessing label quality, AI companies are either asking an in-house expert (and relying on the accuracy of that single opinion) to QA or inputting the labels into their AI model and seeing what comes out. In the end, they’re rarely happy with the results.
This is why one of the first questions we ask customers is, how are you measuring accuracy? At the onset of a project, we seek to determine an appropriate benchmark for success. Customers will give us over 100 examples to train our labelers, which we call gold standards. Gold standards act as a source of truth, letting us provide real-time feedback to our labeling network as they complete the task.
As it turns out, customers are not always confident in their gold standard answers. So how can we benchmark and train our labelers, if the gold standards are not always right?
Measuring IRR of the gold standards
In an ideal world, customers create their gold standards from multiple expert opinions. We then calculate the rate of disagreement across the experts as a benchmark.
Sometimes, we coordinate the gold standard creation for the client. For example, we hired three pulmonologists, vetted by our client, to independently listen to one hundred lung sound recordings and identify whether there was an adventitious lung crackle. Almost half of the hundred cases did not have consensus—experts disagreed if they heard crackle or not. The three pulmonologists then listened to the 45 controversial cases together, coming to a final ruling after lengthy discussion and debate.
Gathering anecdotal evidence
The reality is that medical assessments are rarely clear cut. For a pathology client, we reviewed slides labeled by their team of pathology consultants with their in-house MD and AI team. On the same slides, their consultants drew wildly different shapes outlining the cancerous cells. Looking at these examples together set the expectation that this was a hard task, and one where the agreement among the labelers was likely to be low.
Confusion matrices to benchmark against gold standards
Once the labeling task is underway, we calculate confusion matrices to benchmark our labelers' performance against the gold standards. These measure our true positive, false positive, false negative, and true negative rates.
Here's the confusion matrix for the lung crackle labeling task. As you can see, we had a much higher false positive rate versus false negative rate for crackle—it was really hard to distinguish crackle from background noise or other coarse lung sounds! Since our client had witnessed the high rate of disagreement on the gold standards, they fully expected these results.
Agreement and difficulty scores leverage multiple opinions to assess task difficulty
At the individual case level, we calculate two scores to measure our confidence in the answer.
If it’s a gold standard, we measure how often our labelers submitted a different answer than the gold standard answer—what we term difficulty score. Whether or not it’s a gold standard, we also measure how often labelers agreed with each other—what we term agreement score. These scores are unique to our solution as we gather multiple opinions per case to come to a consensus answer.
The average of those scores helps us understand at a high level both how hard the task is and how well we are performing in comparison to the client’s gold standards. The information also enables our clients to intelligently ingest our data, sometimes underweighting certain controversial cases in their models or double checking whether their gold standards were right.
Highlight controversial cases
Instead of random spot checks, we elevate the hardest cases for additional review using these scores. This focuses attention on edge cases and makes QA time more efficient.
Check out this case from a highly validated ISIC dataset with a 100% difficulty score; this means all the labelers (30+ reads in this case) disagreed with the “correct” answer. This case shows no dermoscopic features of a benign nevus - and the score elevated it for further review.
Flagged cases identify outliers
Lastly, we have a mechanism for labelers to flag cases. These cases will be pulled from the dataset for manual review. This becomes a crucial tool for a few reasons, including:
My advice to medical AI companies evaluating different labeling vendors is to ask a lot of questions. In particular:
Medical assessments are rarely black and white and to handle the grey, you need a rigorous, data-driven approach.
The model recommends patients for partial (UKA) or total (UKA) knee arthroplasty with high confidence, based on standard knee x-ray views.
Learn about why we're called Centaur Labs!
Learn all about NLP in healthcare - and the medical text datasets that power it - in our new 4-part blog series.