How to draw insights from your medical data labels

October 20, 2020

Understand how Centaur Labs' data annotation platform offers richer results than traditional data labeling vendors to improve your medical AI pipeline

With the demand for medical AI increasing rapidly, so has the need for accurate and scalable medical data labeling solutions. In recent years, three models for acquiring medical data labels have emerged:

  1. Hire board certified physicians for $200/hour or more
  2. Recruit medical students and residents for tasks that do not require as much skill
  3. Work with teams of off-shore labelers that offer a ‘hybrid team’ consisting of one medical expert supported by a team of non-medical staff

As cited in our white paper, these traditional models fail to deliver highly accurate results. What’s more, the reasoning behind the labeling results is somewhat of a  ‘black box’ because no information about labels like confidence and case difficulty is provided, which makes QCing data labels extremely difficult.

In contrast, labels produced through Centaur Labs’ collective intelligence approach can not only provide high accuracy, but also provide deep insights into the labels produced.  The additional signal produced by collecting and aggregating multiple opinions from experts opens the door to several beneficial use cases.

Improve the accuracy of existing data labels

If you have a dataset with labels of uncertain quality because they were produced by a single expert, an outsourced labeling vendor, or pre-labeled by an AI model, we can analyze each label and our network of experts will flag any that are incorrect. This use-case was inspired by the joke among radiologists where “every time a radiologist looks at scan they find a new nodule"

Joking aside, consider the following example: we once worked with colonoscopy data where “ground truth” data was provided by an AI company. Since they had an expert GI doctor perform the labeling, they assumed it was indeed the ‘ground truth’, and assured us that each frame had only one polyp. This polyp was marked by the yellow box in the image below:

"Ground Truth" Segmentation from GI Expert

Upon inspection, it is somewhat obvious that there are actually two polyps in this image and the GI doctor must have simply missed the second polyp. This is actually more common than one might expect, simply because it is easy for a single individual to make a mistake, especially on tedious tasks commonly found in data labeling. 

When we ran the data through our platform, our experts were told that there was only one polyp (as this is what we were told by the client), but by gathering multiple expert opinions, we were able to yield very interesting results:

  1. Some segmented the polyp at the top that the GI expert had segmented
  2. Some segmented the polyp at the bottom the GI expert had missed
  3. And some segmented the entire area
Segmentations from Centaur's Expert Network

From this, our collective intelligence algorithm was able to properly aggregate the result, and correct the original segmentation. Specifically, it segmented not only the polyp segmented by the original expert but also segment the missed polyp in the lower right. The output of our model is shown in the magenta boxes.

Final Segmentation Output (correcting for the missed polyp in the lower right)

Improve model performance by understanding a case’s relative difficulty

Since we collect multiple expert opinions for each case, we are able to develop a much richer understanding of the quality of the data. One particular insight we can measure is the inter-rater reliability or level of agreement/disagreement between our experts for a specific case.

Using this signal, plus factoring other signals such as the elapsed time each labeler spends on a given case allows us to calculate a ‘difficulty score’ for each case. Typically, high difficulty cases are unlike others in the dataset. For example, a case could be very difficult because the correct answer is controversial as is common when classifying some types of malignant skin lesions or chest x-rays.

Difficulty scores are reported on our platform for each case labeled:

Cases rank ordered by difficulty

Clients use the difficulty score in a few different ways, depending on their specific situation and goals:

  1. Incorporate difficulty as a feature in their model. By informing the model about which cases are difficult, our clients have realized improved model performance.
  2. Identify cases that are of poor quality. Many times images will be blurry, distorted or in the case of audio files, contain a lot of background noise. These poor quality cases can either be edited so they can be utilized or simply discarded
  3. Identify cases that are highly controversial. As is common in medicine, there are times when experts will disagree. For specific cases that are hard to judge, clients can opt to get even more expert opinions and include the case if consensus can be reached or discard it if not.

Getting multiple opinions doesn’t just give you more accurate answers, it also gives you more insight into your data that can be used to optimize your data pipeline and improve the accuracy of your models. Curious to give our platform a try? Create an account to preview our portal to explore sample datasets and download example results of medical data created using collective intelligence.

Related Posts