How multiple opinions drive huge gains in data labeling accuracy
How Centaur Labs leverages multiple expert opinions to create the most accurate medical data labeling platform for text, image and video data
Have you ever witnessed a group of people estimate the number of jelly beans in a jar? Chances are, each person individually is off the mark, but if you average the opinions of everyone in the group, they will get pretty close.
This is an illustration of a concept known as “wisdom of the crowd.” Collectively, the aggregate opinion of many individuals tends to be as accurate or better than a singular opinion. This idea, while seemingly simple, has powerful applications.
Centaur Labs harnesses the power of a network of experts and applies it to a crucial area of medicine: medical data labeling. By aggregating the collective opinion of multiple experts per case (often times 5 or more), We are capable of labeling medical text, image, video to an accuracy superior to that of individual board certified physicians. Our platform can support classification and segmentation tasks.
Our users label medical images through our mobile app, DiagnosUs, and can practice analyzing a wide range of cases — from classifying skin lesions to performing medical image segmentation that provides pixel-level segmentation of lung nodules in an X-ray. Fittingly, the app is valuable to medical students and professionals looking to practice their diagnostic skills. The tasks are gamified and allow users to compete against each other for cash prizes.
How accurate is our network of medical experts?
In this post, we will take a look at how many users, on average, must review a case in order to have high assurance of accuracy using a dataset of skin images, each labeled as with or without psoriasis by a large number of DiagnosUs users. This project was undertaken in collaboration with LEO Pharma’s Innovation lab. The “correct answers” of the dataset are determined by a panel of 2–12 board certified dermatologists.
If we take the simple majority vote of all DiagnosUs users across these images and assume that the board certified dermatologists are correct, the crowd achieves an accuracy of 97.1% and an AUC of over 0.99! It also turns out that in most of the problems that our crowd gets “incorrect,” there were fewer dermatologists (usually 2–3) determining the correct answer — suggesting that the network may have identified cases where the board certified dermatologists were incorrect.
How many expert opinions is enough?
How many opinions do we need on a particular image in this dataset to get a reasonable assurance of accuracy?
One approach to answering this question would be to repeatedly take a random sample of n users for each image, perform a simple majority vote, average the results, and then see how the accuracy of this method changes for different values of n. However, randomly sampling n users of the N who rated a particular image many times is relatively computationally expensive. And we would ideally want to sample all of the ways we can choose n of the N users. How do we do this efficiently?
This is where hypergeometric distributions come in. A hypergeometric distribution describes the probability of having k successes (random draws with a specific characteristic) in n draws from a finite population of n objects where k of them have that characteristic total. For example, say we have 100 votes on an image, 60 of which are positive (presence of psoriasis), and we want to know the probability that we will get 3 positive votes if we draw 5 of them at random (without replacement). The answer to this question would be (60 choose 3) * (40 choose 2) / (100 choose 5). This is essentially equivalent to choosing 3 of the 60 positive votes. 2 of the remaining 40 non-positive votes, over the total number of ways to choose 5 out of 100 votes in general. The distribution of these probabilities is the probability mass function (PMF), and the cumulative sum of values of the PMF is called the CDF. For example, we can deterministically express the probability of drawing 3 or less positive votes out of 5 as a CDF.
So, back to our original question. In order to estimate the accuracy of a subset of n users on a particular image, we can use a CDF. SciPy has a built-in function called hypergeom.cdf that computes this relatively quickly. For the case of choosing a subset of n users, we can determine the probability that a majority (majority = (n-1)/2 for an odd value of n) of that subset voted for the correct answer by looking at the total number of users who voted at all, and within that group, who voted correctly. To get the probability that at least the majority voted for the correct answer, we can compute 1-CDF (out of a subset of n users, less than half voted correctly). If we compute this for various odd values of n (1, 3, 5, etc..) we get the following diagram:
It would appear that the accuracy begins to level off at about n = 5 or n = 7, approaching the maximum accuracy of 97.1%. This indicates that we need surprisingly few of our users to obtain a reasonable assurance of accuracy on this dataset. We could also even further improve the accuracy by weighting opinions of individuals according to their skill level.
At Centaur Labs, we are excited to see how our network of experts will ultimately improve the ways we make decisions in crucial areas like medicine. We believe that the future of medical AI will involve intelligent collaboration between humans and computers, as both bring complementary strengths to the table. Ultimately, the diversity of our network is our greatest asset.