Our data-driven approach to QA

March 29, 2021

The problem

Many AI clients come to us after a painstaking journey coordinating with medical labeling experts. We’ve heard this story time and again - hours spent checking the accuracy of data labels and qualifying experts, with no clear success metric or methodology. Without an industry standard for assessing label quality, AI companies are either asking an in-house expert (and relying on the accuracy of that single opinion) to QA or inputting the labels into their AI model and seeing what comes out. In the end, they’re rarely happy with the results.

Benchmarking success

This is why one of the first questions we ask customers is, how are you measuring accuracy? At the onset of a project, we seek to determine an appropriate benchmark for success. Customers will give us over 100 examples to train our labelers, which we call gold standards. Gold standards act as a source of truth, letting us provide real-time feedback to our labeling network as they complete the task. 

As it turns out, customers are not always confident in their gold standard answers. So how can we benchmark and train our labelers, if the gold standards are not always right?

Measuring IRR of the gold standards

In an ideal world, customers create their gold standards from multiple expert opinions. We then calculate the rate of disagreement across the experts as a benchmark.

Sometimes, we coordinate the gold standard creation for the client. For example, we hired three pulmonologists, vetted by our client, to independently listen to one hundred lung sound recordings and identify whether there was an adventitious lung crackle. Almost half of the hundred cases did not have consensus—experts disagreed if they heard crackle or not. The three pulmonologists then listened to the 45 controversial cases together, coming to a final ruling after lengthy discussion and debate. 

Gathering anecdotal evidence 

The reality is that medical assessments are rarely clear cut. For a pathology client, we reviewed slides labeled by their team of pathology consultants with their in-house MD and AI team. On the same slides, their consultants drew wildly different shapes outlining the cancerous cells. Looking at these examples together set the expectation that this was a hard task, and one where the agreement among the labelers was likely to be low.

Calibrating performance 

Confusion matrices to benchmark against gold standards

Once the labeling task is underway, we calculate confusion matrices to benchmark our labelers' performance against the gold standards. These measure our true positive, false positive, false negative, and true negative rates. 

Here's the confusion matrix for the lung crackle labeling task. As you can see, we had a much higher false positive rate versus false negative rate for crackle—it was really hard to distinguish crackle from background noise or other coarse lung sounds! Since our client had witnessed the high rate of disagreement on the gold standards, they fully expected these results. 


Confusion matrix highlighting agreement between labelers and gold standards.

Agreement and difficulty scores leverage multiple opinions to assess task difficulty

At the individual case level, we calculate two scores to measure our confidence in the answer. 

If it’s a gold standard, we measure how often our labelers submitted a different answer than the gold standard answer—what we term difficulty score. Whether or not it’s a gold standard, we also measure how often labelers agreed with each other—what we term agreement score. These scores are unique to our solution as we gather multiple opinions per case to come to a consensus answer.

The average of those scores helps us understand at a high level both how hard the task is and how well we are performing in comparison to the client’s gold standards. The information also enables our clients to intelligently ingest our data, sometimes underweighting certain controversial cases in their models or double checking whether their gold standards were right.

Elevating the hardest cases

Highlight controversial cases

Instead of random spot checks, we elevate the hardest cases for additional review using these scores. This focuses attention on edge cases and makes QA time more efficient. 

Check out this case from a highly validated ISIC dataset with a 100% difficulty score; this means all the labelers (30+ reads in this case) disagreed with the “correct” answer. This case shows no dermoscopic features of a benign nevus - and the score elevated it for further review.

A controversial case labeled as "benign nevus" in the dataset


Flagged cases identify outliers

Lastly, we have a mechanism for labelers to flag cases. These cases will be pulled from the dataset for manual review. This becomes a crucial tool for a few reasons, including:

  1. If they see irrelevant data - for example, pictures of an operating room. We want those cases removed from circulation.
  2. If the image is poor or low quality - we often instruct our labelers to flag poor quality or blurry images that cannot be assessed.
  3. If the labelers vehemently disagree with the gold standard. Since labelers are paid based on their performance on cases with answers, they are incentivized to flag cases where they are certain the answer is wrong.

OK, so what should I do?

My advice to medical AI companies evaluating different labeling vendors is to ask a lot of questions. In particular:

  1. How are they measuring accuracy?
  2. How are they approaching QC?
  3. What are they doing to add extra review when the task is hard? 

Medical assessments are rarely black and white and to handle the grey, you need a rigorous, data-driven approach.

Related Posts