Approximately 2.6 million academic papers are published annually, up from 1.8 million in 2008. These studies push the frontiers of knowledge, and inform and evolve our individual opinions and actions, as well as public policy and discourse. Yet, as you search for information on any popular commercial search engine, you’re more likely to find marketing blogs and op-ed pieces than the rigorous science and thoughtful analysis of professionals. This high quality information is buried under layers and layers of what is often hastily created fluff.
Imagine, instead, you take your important questions to a search engine that queries the relevant academic literature, and only shares scientifically substantiated search results. This is the solution the team at Consensus is building - aiming to make science accessible and consumable for all.
As the team was building the search algorithm that powers Consensus, and gearing up for a beta launch, they were struggling to build the training, test and validation datasets they needed to improve their models. The bottleneck was their unscalable data annotation system.
The team had developed a stopgap system, posting jobs on UpWork - a popular platform to find freelance workers - looking for people with PhDs who had the expertise required to accurately read and interpret scientific papers, and who were willing to do tedious manual annotation.
“We thought about outsourcing to Amazon Mechanical Turk, or a general data annotation service provider, but we were warned against it. Because we’re labeling scientific documents, our labelers need to be skilled and have experience reading, interpreting and synthesizing these papers. This expertise will produce the high quality annotations we can use to improve model quality,” says Eric Olson, cofounder and CEO of Consensus.
The team then interviewed, hired, and onboarded 12 of these freelance annotators. For each annotation task they would meet with each annotator to train them on the task, answer any of their questions, and then manage the quality of their work, coaching them as needed. They’d also need to track these freelancers down as deadlines were inevitably missed. While this system helped Consensus get to an MVP, they knew it would not scale to support their beta launch and broader ambitions.
“While the labeling software we use is great, the onus is on us to find and manage the labelers. Managing all these people and different moving parts was at least a hundred hours of management work per annotation task. And it is only going to get more complex as we grow. Because of the work to recruit, train and manage the PhD labelers, the system was unscalable,” said Eric.
Consensus started working with Centaur Labs to create a high quality data annotation system that would scale as the company grew. Consensus wanted to evaluate their current model to understand - how relevant are the search results the model is producing now to the initial search query? Is the information we’re sharing in the search results simple and easy to understand? Does it make sense when you read it outside of the context of the entire scientific paper?
To understand this, Consensus had a test dataset of 12,500 pieces of data, including 5,000 search results and 7,500 query-search result pairs they wanted to annotate. Each of the 5,000 search results needed to be classified along two lines - first to determine if the sentence required additional information to be meaningful outside of the context of the scientific paper, and second if the sentence needed to be simplified or rephrased to be understandable to the searcher. The 7,500 query-search result pairs needed to be classified on a 1-3 scale to determine how well the search results answered the query. Consensus started by annotating subsets of these datasets to serve as examples of high quality annotations - or ‘Gold Standards’. Consensus had 3 of their contracted PhDs annotate each of these Gold Standards, and the majority opinion was the final label.
Once the Gold Standards were created, Centaur Labs began generating all 17,500 annotations for the dataset. Centaur Labs generated the annotations for the additional information and simplification classification tasks at a rate of approximately 6,600 annotations per week, and the more complex 1-3 rating task at a rate of approximately 1,900 annotations per week. For both tasks, the Centaur Labs network generated an average of 10 qualified opinions per piece of data.
“Annotations that would have taken me 3 months to complete with our prior data annotation system, Centaur Labs completed in only 2 weeks,” said Olson.
On top of this speed, Centaur Labs also delivered high quality annotations. In between 88.6% and 93% of cases, depending on the task, the majority label generated by Centaur Labs was the same as the Gold Standard label provided by Consensus. “Because we have orders of magnitude more reads per case and can measure labeler performance constantly against the Gold Standards - we have much more quality assurance on the data annotations.” Consensus was also able to leverage these 10 reads per case to understand more meaningful insights about their dataset. They used the labeler agreement rate provided per label to identify the most contentious labels and prioritize investigating them in their model training efforts.
With this accelerated data annotation system, Consensus was able to improve their model quality, give their leadership team time back, and create a data annotation system that will scale to support their company’s ambitions. After evaluating their model with this new annotated dataset, the Consensus team began another round of retraining their model. Using the annotations provided by Centaur Labs, Consensus was able to improve their model’s performance on a quality test from 73% up to 87%.
“Working with Centaur Labs to annotate data is better in every way than our prior system. The annotations are more accurate, more affordable and the system is easier for our team to manage,” says Olson. “It improves how we annotate data now, and it gives us the flexibility to stand up larger and more complex annotation projects that will power our model development initiatives as we grow as a company.”
Learn the how to mitigate the impact of medical error in your data labeling pipeline by intelligently aggregating multiple expert opinions together
Learn more about how Centaur Labs is working with the Brigham and Women's Hospital team to develop multiple AI applications for point of care ultrasound.
Learn all about NLP in healthcare - and the medical text datasets that power it - in our new 4-part blog series.