Centaur Labs CEO and cofounder, Erik Duhaime, talks Generative AI in healthcare with EMERJ

Ali Devaney, Marketing
August 31, 2023

Centaur Labs CEO and cofounder, Erik Duhaime, sat down with Matthew Demello from EMERJ to talk about the challenges and opportunities of leveraging Generative AI In healthcare and life sciences.

What you'll hear

✅ Exciting opportunities for impact - more accessible model outputs

✅ Challenges of quality control, model oversight and management

✅ The persistent importance of high quality datasets

✅ Experts in the Loop and Reinforcement Learning with Expert Feedback: Shifting from "Humans" to "Experts"


Matt DeMello: [01:03] The last nine months have completely changed how the world talks about AI - especially with generative AI. How can we see the opportunities and challenges of these new generative AI capabilities in the healthcare space?

Erik Duhaime: [01:18] Yeah, it's been an exciting time. So in terms of opportunities - with the launch of large foundation models, like GPT-3, you can simply build models faster than ever before, basically overnight, and you can interact with them as though they're human, which makes these models far more accessible. 

For example, historically, if you built an AI model to, say, interpret a radiology image, it might spit out a probability like “likelihood of cancer is 70%”. And this can be difficult for people to interpret and work with. People are bad at probabilities - but we're good with language. So now, you have the potential to do something like create a model that can read a radiology image and generate a complete descriptive radiology report just like a human would. So it's, it's an exciting time, and there's a lot of new opportunities. 

The challenge is that these more human-like responses produce essentially an infinite number of potential outputs. And this makes quality control more challenging. We've also all heard about issues like “hallucinations”. 

These challenges are especially important in industries like healthcare. You can have a large language model that makes a first draft of marketing copy or drafts an email for you. It can make some mistakes, and that's fine. But how do you maintain quality when the stakes are high?

The "Move fast and break things" approach works well in some areas, but not when it comes to human health. 

One very public example of this recently involved the National Eating Disorders Association. They had an AI powered chatbot that was developed and managed by a third party company. The chatbot would answer questions and offer advice to people seeking advice about their eating disorder. They also had a human powered hotline that patients could call into. Historically, this chatbot was rules-based. So all the answers were pre approved and determined by experts in eating disorders. That way, the company had complete control over the chatbot model - it couldn't do anything unexpected. The AI vendor updated the rules-based chatbot so it leveraged a large language model.

Now, the upside of this change is that the model’s' responses can be more flexible, they can be more personalized, they sounded more human. But I think it was within weeks, there were numerous reports of the model sharing harmful information.

Someone asked the chatbot for help with their eating disorder and was offered information about how to lose weight and restrict calories - behaviors that exacerbate disordered eating.

That's just one example of a company where the developers jumped the gun, and didn't think enough about how to ensure adequate model quality in this new era. So it's certainly a time of exciting opportunities. But there's also a lot of new challenges.

Matt DeMello: [04:12] Absolutely. Let's dig into quality control. I think even in your last answer, our audience could understand being at that juncture of development and thinking, “Well, I’ve tested the model quality - don't I have my bases covered?” And that just not being the case. Also, in your last answer, how very important this is in the healthcare space. Given the regulatory and ethical standards, it's a different ballgame than in other industries. What do healthcare leaders need to know to ensure quality in these models?

Erik Duhaime: [04:38] Ensuring models are accurate - it comes down to three things. So first, it's still all about the data. In specialized domains like healthcare, you need to build large scale training datasets to fine tune those foundation models for your use case. Second, you have to give the model feedback by building expert preference datasets. And finally, you have to recognize that AI development - it's an ongoing iterative process. So you need to rigorously test and monitor models on an ongoing basis. It's not a one and done thing. 

So one thing that's different in domains like healthcare, where accuracy really matters, is you need skilled people - who are experts - involved in all of these model development and management steps.

Historically, most of the grunt work for developing AI - say, in  domains like the autonomous vehicle industry - it’s been done by low skilled low wage workers. But in healthcare, you need experts. 

So for that first point, to build large, high quality training datasets, consider the fact that the latest versions of these large language models were essentially trained on the whole corpus of internet data. If you want a more specialized model, you need unique datasets that these models haven't seen before. 

One example is at NYU - they fine tuned Google's BERT model with millions of clinical notes from their EHR. Leveraging that dataset led to significant improvements compared to the baseline, and compared to other models. 

So for the second point - on expert preference datasets - let me say what I mean by this. A lot of models like GPT-3, were trained with reinforcement learning with human feedback, or RLHF. Open AI used an army of low wage workers throughout the world to provide feedback on the models outputs. These workers are doing things like scoring the summaries and ranking the quality of answers from ChatGPT. When it comes to healthcare, not anybody can provide adequate feedback on things like the quality of a synthetically generated radiology report, or on the quality of advice for patients with eating disorders. So what you need is people with specialized expertise to provide that feedback.

You have to move beyond reinforcement learning with human feedback, or RLHF, to reinforcement learning with expert feedback, or RLEF. 

So finally, when it comes to ongoing model monitoring and quality control, the same principle applies - it's critical to have experts reviewing model outputs. In many cases, what you want is for a model to generate a response, and then for an expert to review that prediction. It's not really any different from having a model, generate an email or draft marketing copy for you, but then you still have to review the output before sending that email or posting the marketing copy to the company website. But the difference is in skilled domains, you need skilled people reviewing the model’s output. 

So to take that example of the eating disorder hotline again, you might want to have a quality control system where when a chatbot gets a particular type of question, like something urgent, or maybe something like suicidal ideation, it escalates the response to a human expert for review before it's sent, or to take over the conversation. 

Regardless, if you don't have real time experts in the loop, you want to be randomly sampling the models responses to monitor quality on an ongoing basis. Even if the model seems like it works on day one, that doesn't mean there's not edge cases that are missed. That doesn't mean new issues won't arise. New people start to use the AI, the world changes, etc - an AI model trained to identify medical misinformation might need to be continually updated as our understanding of diseases and treatments evolves. 

Making sure these models are accurate requires engaging a workforce of experts at every step of the process. And that's what's going to differentiate those AI models that make an impact - and are lasting - from those that won't be useful, and that might even be harmful.

Matt DeMello: [08:30] We've spoken a lot of times on this show about “humans in the loop”. But I'm seeing now - across the board - a greater emphasis on experts rather than just humans, especially in highly skilled workflows, as you were mentioning. 

Can you put a finer point on “experts”? To me, it doesn't seem to be a matter of just skills or education. Who are these experts? Are you an expert if you’re just spending time with the patient? How do you determine if an expert’s work is high quality? And how can AI business leaders ensure that bringing experts in the loop doesn't slow AI development? 

Erik Duhaime: [09:15] Yeah, I mean, if you're bringing experts into the loop, it's inherently going to be slower than leveraging unskilled workers overseas, which is what most solutions out there provide today. Providing experts in a scalable way - that's what we've set out to do at Centaur Labs. We've certainly built up an army of tens of thousands of medical doctors, PhDs, nurses and other experts to help companies build state of the art AI models. 

But simply having a large workforce of experts isn't enough. We also apply data science principles to ensure quality from those experts.

So motivating experts can be difficult. Even when experts try their hardest, they disagree with each other. How do you get to the truth even when the experts disagree with each other? 

It's actually pretty scary if you look at the literature, and see how much experts disagree with each other when it comes to important decisions like cancer diagnoses. And yet, you're basically trusting one person when you're the patient. That's the question that first motivated me - what do you do when the experts disagree with each other. I was doing my PhD at this place called the Center for Collective Intelligence at MIT and came up with an answer - you basically do two things. 

First, you don't just trust experts based on their credential. Just because someone has a fancy degree, doesn't mean they're performing well at the task that you care about.

You need to measure performance continually, and in a data driven way. The second thing is you need to harness the collective intelligence of multiple people - don't trust just one expert. 

So for example, we found in research with folks at Memorial Sloan Kettering that - by measuring the performance of people, throwing out the votes of people that are doing a bad job, and intelligently combining the votes of people that are doing a good job, we can get groups of what you might call “semi experts” - like medical students - to outperform board certified dermatologists with 10 years of experience at classifying skin lesions for cancer.

It's not about recruiting the army of doctors and PhDs and experts, as much as it is about continually measuring expertise in a data driven fashion, and then intelligently combining the votes of highly skilled workers. 

In terms of how to do this at a huge scale, one thing that's fun about our company is that the way in which we measure performance, and reward performance has actually solved the scale problem. So our data annotation platform, our experts are literally competing with each other to be the most accurate. And they're only rewarded if they're high up enough on a leaderboard for any given task. And it turns out that experts like medical doctors, they're also highly competitive, in addition to being highly skilled. So we found that when it comes to skilled work, gamification in this way can drive incredible scale. Today, we collect millions of opinions weekly from experts. And because they're having fun and competing with each other, it's at a price point more similar to hiring unskilled workers at hourly rates.

Matt DeMello: [11:58] Absolutely. And while I think this perspective is falling by the wayside in the mainstream media - this idea that humans will be replaced by AI - I think among my grandma and the pop culture crowd there's a different perception. Something I really appreciated in your last answer is showing just how much these models are really dependent on humans - on human inputs and oversight - as what's going to differentiate them from their foundational counterparts. But given that context, how should humans - particularly experts - think of their role in AI?

Erik Duhaime: [12:34] Yeah, I used to get that question a lot, especially early on, and it seems to be coming back up again these days.

I think what people miss is that new technologies like AI don't necessarily replace work, they change how work is organized. 

So, for example, when ATM machines were deployed, there was fear that bank tellers were going to be out of a job. And what actually happened was it became more affordable for a bank to open up a new branch, and the number of bank tellers increased. Now, that doesn't mean that there weren't changes - what happened is the nature of the job changed. So today, the typical bank teller is more educated than 30 years ago, and they spend less time cashing checks and more time on relatively higher skilled tasks, like answering complicated questions or cross selling products. 

The same thing is happening today when it comes to AI in healthcare, and other industries where you’re traditionally working with experts. Not long ago, you had people like Geoff Hinton, one of the godfathers of modern AI, quoted as saying, “We should stop training radiologists now”. And I think people are starting to better appreciate that AI is not going to replace all doctors, just like ATMs didn’t replace bank tellers. AI and doctors will work together. There will be a lot of changes in terms of how doctors and other experts fit into the picture - the roles are going to change and evolve, and in many ways for the better. 

For example, a lot of companies today are working on automating the creation of doctor's clinical notes based on “listening” to the conversation between a patient and a doctor. And that is going to free up doctors to actually look you in the eye as they're seeing you, instead of at a computer screen. They’re going to get to spend more time seeing more patients and less time writing notes. And that's going to lead to overall improved care.

So no - AI is not going to replace doctors. But I do think certain tasks are going to be replaced by AI, and humans are going to have more time to focus on other tasks where they add the most value.

Matt DeMello: [14:32] So it looks like one of the future roles of experts in healthcare is going to reside a lot in creating high quality data to build out these foundational models and make really strong bespoke models. And then also helping with AI oversight and quality control. What developments are you excited to see in this space for AI and healthcare when it comes to data and data labeling?

Erik Duhaime: [14:51] Well, I think that the data labeling space - and the need for experts as a result - is going to keep changing in a couple of exciting ways. First off, I think demand for data and data labeling is going to keep accelerating - the importance of the role of data labeling is only going to increase. 

Now, this is for a couple of reasons. One, as technologies improve and become lower cost and more accessible, there's more and more data. And that provides more and more opportunities for AI development. 

So for example, we work with a lot of companies in the point of care ultrasound space. These are handheld machines that are cheap, and portable. Ultrasound has been around for 67 years. But now you've got these little devices that cost $1,000 that you can take anywhere in the world and basically look anywhere inside the human body. This means there's a massive amount of things that you can do with the data for model development - to analyze the data that's coming from these machines. And this isn't a new thing that hasn't really been around for a couple decades.

The evolution of hardware and data capture is leading to exciting opportunities for model development. And that then requires a bunch of data labeling - you've got new devices entirely, that generate entirely new sources of data, creating more opportunities for model development. 

We work with one company, Eko Health, which has built a digital stethoscope. It's digital - so now instead of a doctor needing to be right there listening to your heart or your lung sounds, you can have a nurse doing a home call, take a recording and share with the doctor later, etc. But because it's digital, you can also build an AI to analyze this data. So Eko now has an FDA approved heart murmur detection product, but building that product required annotating tens of thousands of these recordings. And again, it's a new type of data. 

So there's exciting stuff in the data and data labeling space, because there's new types of data and new types of problems to be solved. Think also about all the different wearables and remote monitoring devices. It's a lot of exciting new data - data formats that can build new AI models. 

Now, the more that AI proliferates, the more I see demand for labeling increasing, and the more it's going to become relatively about expert labeling - the relative need for skilled labeling is going to increase. Because as these models get extremely good at relatively simple tasks, AI developers are going to increasingly focus on training models on more difficult things.

As we move from models that help you write an email or drive a car - these are things most people can do pretty well - to models that help lawyers draft documents, or doctors diagnose diseases, experts are going to become relatively more important in creating those models.

So as I mentioned, most people today, they think of the grunt work of AI as requiring low wage unskilled workers to do tedious data annotation work.

The AI models that are going to revolutionize industries like healthcare, like law, those require engaging experts at scale at every step in the development process.

Matt DeMello: [17:47] Absolutely. Especially given the influx of new technologies, as you were mentioning that we're we know in just a few years are probably going to rise to the standard of commonplace biometrics the Internet of Things, the wearable technology you mentioned. Yeah, it's going to be a very new world. Very, very quickly. Erik, thank you so much for joining us today on the program to explain it. 

Erik Duhaime: [18:15] Thank you for having me.

Related posts

January 10, 2021

From MVP to scaleup: how to 10X your medical data labeling pipeline

Understand why traditional labeling pipelines are so hard to scale and learn how our solution can 10X your labeling pipeline in a shorter time frame and with higher accuracy

Continue reading →
August 1, 2020

Building a scalable and accurate medical data labeling pipeline

Examine the unique challenges with medical data labeling, the relative lack of accuracy produced by traditional data labeling methods, and discover a more accurate and scalable alternative

Continue reading →