Text Classification Case Study: Centaur Labs Cuts SciBite's Vocabulary Curation Timeline by 2+ Months with Expert Crowd Labeling 

Ali Devaney, Marketing
June 18, 2024

High-quality Text Classification by Centaur Labs Yields 90.3% to 95.1% Accuracy and an Average of up to 10 Reads per Case

So much of scientific information is stored as unstructured text - from scientific literature and internal R&D reports and emails, to patient documents that could qualify a patient for a clinical trial, or describe an adverse reaction to a medication. Mining this unstructured text for insights is key to accelerating scientific advancements.

SciBite, from Elsevier, has developed a semantic analytics platform that enables life sciences organizations to unlock knowledge from unstructured text. At the foundation of this platform are the high-quality vocabularies and ontologies used to detect terms and topics within biomedical text. Maintaining these comprehensive vocabularies is a never-ending task that relies on expert scientific curation.

To scale this critical “humans in the loop” capability, SciBite partnered with Centaur Labs. Read on to learn how our expert-powered text classifications are scaling vocabulary evaluation and expansion for SciBite.

About SciBite  

Founded in 2012 and based in Cambridge, UK, SciBite supports the world’s leading scientific organizations with its pioneering data-first, semantic analytics platform.

At the core of SciBite's platform is TERMite, a high-performance named entity recognition (NER) engine that, when combined with SciBite's domain-specific vocabularies, can recognize and extract relevant terms found in scientific text. 

These capabilities together, among others offered by SciBite, enable leading global pharmaceutical companies to extract insights from unstructured text, powering drug discovery, clinical trial recruitment, pharmacovigilance, and more. 

SciBite’s Challenge: Validating and Scaling Vocabulary Growth 

SciBite has developed over 100 vocabularies that contain as many as hundreds of thousands of entities. The team builds these vocabularies using public ontologies - managed by organizations such as the US National Library of Medicine - as a starting point from which to go into much greater depth. These high-quality vocabularies and ontologies are the critical foundation that enable the TERMite engine to accurately detect important topics within biomedical text. 

Keeping these vocabularies up-to-date is essential, and also complex! New potential synonyms for existing terms can emerge, and ontology creators release regular updates. SciBite’s team of in-house expert curators need to evaluate each of these external changes, and decide if and how to appropriately incorporate them into SciBite’s vocabularies. 

"We found around 5,000 potential new synonyms for the indication and anatomy vocabularies, but curating that many terms manually would take months of dedicated work," said Mark Streer, Scientific Curator.

After a quick scan of these synonym candidates, sourced from Wikidata and web scraping, it was clear some of the synonyms were high quality, others were definitely wrong, and many warranted further investigation.

Our Solution: Crowdsourced Text Evaluation and Classification 

SciBite partnered with Centaur Labs to design two separate crowdsourcing workflows to evaluate candidate synonyms and accelerate updates to two of SciBite’s most popular vocabularies - their indication and anatomy vocabularies. 

In both cases, the Centaur Labs crowd of medical experts was asked to evaluate the relationship between the candidate synonym and the existing reference term, classifying it as “Exact”, “Broad”, “Narrow”, or “No Match”. 

For both workflows, Centaur Labs was able to evaluate 700 - 1200 synonyms per day, generating 7-10 qualified opinions per case, and achieving high accuracy - 90.3% for disease and 95.1% for anatomy agreement with ground truth reference cases. 

The Results: Accelerated Vocabulary Expansion 

Through the crowdsourced efforts, 900 new disease terms and 600 new anatomy terms were classified as “exact” matches with the reference terms, giving the SciBite team the confidence to add these 1500 new terms to their vocabularies.

The crowd's high accuracy enabled SciBite to streamline internal curation efforts with our crowd-validated terms. For the most complex synonym candidates, where the interrater agreement was below 75%, the SciBite team performed a review and applied a final classification. Where agreement was above 75%, the classifications were accepted as is, and considered for addition to the vocabularies. 

"The Centaur Labs crowd offers the scalability and quality we need to rapidly update vocabularies based on new data sources. This frees our curators to focus on the most complex requirements rather than tedious tasks."

The success paved the way for an ongoing partnership, with SciBite looking to leverage Centaur Labs for additional vocabulary updates, generating new vocabularies their customers need, and evaluating the output of new NER models.

Get in touch to learn how Centaur Labs can accelerate your text classification projects and assure the quality of your data pipelines and model outputs.

Related posts

August 31, 2022

9 most common types of medical text datasets

From SMS to insurance claims to pathology reports and scientific studies, in this post we dig into the most common type of medical text datasets leveraged for NLP in healthcare.

Continue reading →
July 8, 2021

Centaur Labs teams up with Brigham and Women's Hospital on Massachusetts Life Sciences funded project

Learn more about how Centaur Labs is working with the Brigham and Women's Hospital team to develop multiple AI applications for point of care ultrasound.

Continue reading →