➡️ Register here for to view the webinar discussion created in partnership with AIMed! ➡️ https://aimed.swoogo.com/CentaurLabs
📣 Listen in to our discussion with Dr. Matt Lungren (CMIO @ Nuance/Microsoft), Tom Calef (CTO of Activ Surgical), and Dr. Veronica Rotemberg (MSKCC) to learn about the evolving role of expert feedback in healthcare AI development.
📚 Hear who these leaders are following and what they are reading to stay on top of AI in healthcare!
📖 Read the transcript below!
If there ever was a “Year of AI”, it would be 2023. With the launch of new general purpose services like GPT-3, as well as specialized services like MedPalm for healthcare, AI was center-stage, shaking the foundations of every industry. With so much focus on AI alone, the critical human-AI partnership has taken a backseat - the ‘better together’ idea of humans collaborating with AI, to both build new tools, and to get higher quality work done more efficiently.
In this webinar, experienced healthcare AI leaders working across areas like dermatology, surgery, and radiology discuss the nuances of the human/AI partnership. They’ll discuss the critical role that clinicians play in helping to build AI, provide feedback to model developers, and eventually collaborate with new AI-powered tools.
[00:00:20.920] - Freddy White
So good afternoon, everybody. Nice to see you all. It's Freddy here, CEO of AIMed. We've got a fantastic panel lined up for you this afternoon. And my good friend David Gruen is going to take the reins and be our moderator for the next hour. So I'm going to hand it over to you, David. Thank so much, Freddy. Thanks. And thanks, everybody who's out there joining us, and thanks to this incredible panel. Good morning, good afternoon, good evening. As we were just saying before we went live, this is really a unique opportunity for us. We've each been on multiple panels, but as a practicing radiologist, most of the panels I've been on have been radiology centric. Maybe that's because many of the algorithms that are available are radiology focused, but that's not what this panel is about. This panel is much broader. It's about bringing AI to all of healthcare and how AI can impact the entire human domain and really improve outcomes. So we're joined, as you'll hear in a moment, by surgery, by dermatology, and by radiology. And hopefully we'll bring you and by the entrepreneurial world and the business world and the tech world, hopefully we'll bring you a broad perspective on not just where we are, but where we collectively see AI going to improve human outcomes over the next couple of years.
[00:01:39.230] - Dr. David Gruen
So without further ado, let's get into it. Let's start with some brief intros. I'd like each of you to tell us a little about yourselves, where you come from, what your focus is. Clearly you're all not just sitting in a dark room as I am, reading mammograms or operating on people, what your connection to AI is, what your company's building, what your labs about, and how you see the industry changing over the next couple of years. So, Veronica, why don't you lead us off?
[00:02:04.640] - Dr. Veronica Rotemberg
[00:02:05.540] - Dr. Veronica Rotemberg
I'm Veronica Rotemberg. I'm a dermatologist practicing at Memorial Sloan Kettering Cancer center in New York. I also run an AI research group where we focus on building AI models for dermatology applications, as well as more general projects around model validation and assessment.
[00:02:27.800] - Dr. David Gruen
Great, thanks. A little later, I do want to hear about international skin. Know anything that has imaging in its name is interest to me. Tom, please. Yeah, sure.
[00:02:38.610] - Tom Calef
Thanks, David. And thanks. Aimed for the invite to the panel. So, Tom, Calif. I'm the chief technology officer here at Active Surgical, one of the founders of the company about six and a half years ago. And our focus right now is we've released a FDA clear device for real time intraoperative visualization of blood flow and tissue perfusion. And we're now using this new data set that we've been collecting for the past few years to drive new AI models for critical structure detection and complication detection, interoperably so during the procedure itself, as opposed to waiting till after the procedure is done. So really excited to talk about it. AI is the backbone of where we are here at active, and it's just such a good time to talk about it.
[00:03:26.300] - Freddy White
Cool. I can't wait to hear about improved outcomes, shortened our times. All the things potentially this can help. Matt. Yeah.
[00:03:33.660] - Matt Lungren
Hey, thanks everybody. Great to be here. I'm Matt Lungren. I'm the chief data science officer at Microsoft Health Life Sciences. I'm also a part time interventional radiologist at UCSF, and my background is I spent about ten years at Stanford with the clinical practice, but also helping construct a really large health AI center which kind of spanned computer science and the medical school and had lots of great experience kind of building that over the boom in AI and healthcare. But really excited to sort of talk about what we're all focusing on, which is sort of leveraging some of the newest technologies that we're transitioning, as you can tell, from sort of narrow to more general capabilities in healthcare applications.
[00:04:17.100] - Dr. David Gruen
Very exciting. For those of you who don't know, Microsoft is now nuanced power scribe, which is the major voice recognition system that most of us use, is now part of Microsoft. And clearly this integration of big data to improve workflows and outcomes is high on your radar. So a lot to hear about as well. How many hours do we all have to do this? Freddy? I hope we have all afternoon. Erik.
[00:04:40.800] - Erik Duhaime
Hi everyone. Thanks for being with us today. Erik, I'm one of the founders and the CEO of Centaur Labs. Centaur Labs is a company I founded about five years ago based on my PhD research at this place called the center for Collective Intelligence and MIT. I was interested in how do you combine people's opinions and get to the truth, either multiple people's opinions or people with AI. And the way that we've applied that research is by focusing on data annotation, specifically in healthcare, for companies that are developing AI and ML in the medical and life sciences. Now, I know that all of their intros were extremely impressive, but it's really nice for me to have this group of people here. I just want to quickly point out that Matt has been a longtime advisor since back when he was at Stanford and when I was doing my PhD research. Some of the data sets that they released at the Amy center were critical for me getting off the ground. And then same for the ISIC, the skin imaging collaboration out of memorial Sloan Kettering. I mean, one of my essays literally leveraged that data set. So there's this great importance of getting good, high quality data sets out there.
[00:05:48.540] - Erik Duhaime
And then finally, Tom. As Tom knows, my wife is a general surgeon, so I know very personally the importance of what he's working on and hope that he can make her job easier in the future and help patients, too. So I just wanted to give a shout out that it's especially nice for me to have all these people here together on the same call.
[00:06:10.320] - Dr. David Gruen
Erik, thanks. This is really exciting, as I said. And so, for those of you who are in the audience, please jot down your questions. We'll do our best to get to some of them, but if not, Freddy can pass them along and we'll follow up afterwards as best we can. Now that we know a little bit about each other and what you're doing, let's talk about human expertise to stars. The past year has been really interesting since the release of GPT-3 I think everyone's been equal parts, both enthusiastic, but also a little bit nervous. Right, what are the quality points? How do we measure it? How do we integrate it? Is there enough high quality validation for healthcare? Can we jump right in? How are the models trained? And so on. So, does anyone want to take a stab at this? I think this is been a topic at every meeting I've been to, much more so than all of the one off algorithms. How do we know it works and where do we bring it?
[00:07:05.680] - Tom Calef
I can. I can take a quick crack at it. We're using generative AI every day now, not only in some of our business systems, but a lot of the development on some of our AI products. And it's funny, yesterday I heard the time, word of the year. This year may be hallucination, right? And it's the kind of wrong output from a large language model or a generative output. And I think, one. It's really telling. It's telling that these generative models are getting out into common use, which is fantastic. The usability is way up. But that being said, I think that in and of itself shows a need for fine tuning with experts, especially within medical, where subject matter expertise. It takes a long time to become an expert that you are, that the others are on this panel. And so, as we're developing these models, I think it is of utmost importance to still drive validation just like we would with any other medical device, and make sure that we have experts that provide not only ground truth, but validation, and really just help teach these models a little bit better and become more generalizable.
[00:08:26.140] - Dr. David Gruen
So, for those who may not be familiar, how are surgeons? How is active site your main product using generative AI? If you can give us a 1 minute background. Yeah, sure.
[00:08:35.920] - Tom Calef
So we're teaching some of the latest models to speak more intelligently about surgery. We're able to give these models some really neat and interesting data sets around how the body is working, where blood is flowing, and then being able to tell the large models where specifically this is in an image. And so through tens of thousands of data sets, we're starting to fine tune and train just the conversation that we can have with things like lava med and things like GPT. And I think there's a lot to come. A lot of our products will have this at its core, with more specific AI models on top of it for the output. So, really excited to see it.
[00:09:29.360] - Dr. David Gruen
Veronica, what are you thinking? And then I'm going to circle back to Matt and Erik.
[00:09:33.120] - Dr. Veronica Rotemberg
I can't believe you're making me go before Matt.
[00:09:36.800] - Freddy White
[00:09:42.020] - Dr. Veronica Rotemberg
I think, as you point out, it's a hugely exciting time. I think one of our challenges when we apply the large models to medicine is that all of our benchmarking is still very specific, even if the models themselves are general. So if I'm going to tell you that a model works for dermatology prior auths, then I have to actually specifically test against dermatology prior auth benchmarks. And I have to calculate how many hallucinations I see and look into the specifics. And one of the things that I hope we see over the next period of time as we're looking at these models, is better science around how we benchmark them for healthcare. Better specific outcomes that are relevant to patients. Not just do they like this, but did it actually relay the information that they needed to understand, and then they can repeat it, or they can actually do the thing that we need them to do. These are just some examples. I think it just highlights how specific everything that we're doing to test them is and where we need to go. I think we also need to make sure that we're not causing harm to underrepresented populations that are not represented in data sets that are used for training.
[00:11:04.670] - Dr. Veronica Rotemberg
I think that's a huge area of active science, even over the past year. And so again, focusing on the science of benchmarking for healthcare applications is going to be really important.
[00:11:21.280] - Dr. David Gruen
Yeah, I think that you hit on a key point, which is AI has the opportunity to be a great democratizer, but we need to make sure that we're inclusive and that data doesn't make it less inclusive. Matt, you're chomping at the bed. I can see through.
[00:11:38.840] - Matt Lungren
Appreciate. I think all the comments are terrific. I kind of take maybe a step back just to say this technology is moving at a pace that, at least in my lifetime, never seen anything kind of accelerate the way it has. And I think Peter Lees had a great quote about our experiences with GBD four across domains, not obviously just in healthcare, but it's at the same time the smartest and dumbest technology we've ever worked with.
[00:12:07.760] - Dr. David Gruen
[00:12:08.320] - Matt Lungren
And I think part of that stems from that. It's still so early for us in understanding truly what the capabilities are, where the limits are in certain areas. You may have seen some of the most recent work was with GPT four alone, just with additional prompting, no fine tuning, no specific effort on feedback and reinforcement. It's outperforming massive models that were specifically meant for healthcare applications or have been fine tuned in other ways. So I think we just don't have yet a full picture of the extent of the capabilities of where the technology is now, let alone where it continues to accelerate. But on the topic of benchmarks, I think, again, this whole idea of testing these models on multiple choice tests, I think it provides some insight. In other words, it tells us, as we probe the model and then ask it to follow up, why did you choose this? Or can you change the question so that a different answer is correct? I mean, that interaction helps us understand. Does the model have internal representations of real medical concepts that are useful? Right. So that, I think is valuable. But the benchmarks themselves, I think, don't tell the whole story.
[00:13:24.030] - Matt Lungren
And we do need to have dedicated efforts, and this is going to be on the part of the various medical specialties and others in the space to really focus on what are those benchmarks and how do we ensure, just as we did in the quote unquote, narrow AI days, how do we ensure that they're fair and transparent, et cetera. But the other thing I'd say is that what makes this so incredibly powerful is this idea of, this idea of empathy, which is really just context awareness. So the ability to have a model that understands the medical concepts and then kind of has a sense of the context is so powerful for our applications. Again, you have people on this panel that represent completely different areas in healthcare, and yet we're all talking about, at some level, leveraging that same model and taking advantage of its extent, of its capabilities. But to the point about this idea of context awareness didn't just kind of come out of thin air. I think if you look back at the technical report from GPT four, what you recognize is that that reinforcement learning with human feedback really did get that model to a place where it could understand the context and the purpose of questions, right?
[00:14:42.160] - Matt Lungren
That turned that kind of lump of clay, so to speak, of embeddings, into something that we can start to use and interact with it in a way that we have expectations of being able to use it. And so I think in a similar way, there's a lesson there, and I think this ties into kind of some of what we're talking about, which is how do we provide expert feedback, right? Not reinforcement learning with broader feedback, but actual, truly domain expert feedback to some of these models to either steer them or even further tune them to be more powerful or more useful in our domains, but retaining some of that understanding of world knowledge, general knowledge, context awareness, the medical concepts, I think that's really where I see the direction a lot of this going. And as we continue to see the technology accelerate, we're going to see a lot more efforts in that space.
[00:15:31.760] - Dr. David Gruen
So we're going to come right back to that really important point about how we include and leverage domain experts. But Erik, I want to give you a chance to talk for a minute. Coming back to the issue of data curation, democratization, and making sure we have really good representation, I think that's a good intro to what you're doing at Centaur as well.
[00:15:51.080] - Erik Duhaime
Yeah, and I definitely want to piggyback off of, I guess, a little bit of what all three of them have. Just know, OpenAI trained chat GBT using low wage, low skill workers all over the world, I think, at least according to the Wall Street Journal, primarily in, you know, you can't use the same low skill, low wage workers who don't have domain expertise in healthcare to provide the expert feedback that Matt's talking about to get large language models to where they need to be. In healthcare, it's one thing if you use a large language model to draft an email or draft marketing copy. If it makes a mistake or hallucinates, that's no big deal. But in healthcare, it's a very big deal if it's hallucinating or saying something wrong. So it's been an interesting journey for us in the last year as this stuff has taken off, the way we operate is we have tens of thousands of doctors and medical students and nurses, et cetera, around the world that are most of the time literally competing with each other to tag data most accurately. And that's how we've sort of democratized the annotation step while focusing on providing that expert feedback.
[00:16:59.860] - Erik Duhaime
But it's been interesting. In the last few years, we've seen a lot relatively less of tasks where we need our people to draw around the pneumothorax and chest x rays over and over again, and a lot more tasks where it's things like, hey, is my chat bot being both medically accurate and empathetic right now and taking into consideration the fact that the person it's chatting with might have a disability, and it can't recommend that the person goes out for a run to reduce their blood pressure or whatever it is. So it's been a really interesting year from our perspective. And I think what we're seeing is this movement towards an appreciation that you do need to continue to measure and monitor these models and give them feedback to make sure that they actually work well enough in healthcare where you're not going to do anyone any harm.
[00:17:56.800] - Dr. David Gruen
[00:18:02.240] - Matt Lungren
I'll add one quick thing to that. I think that there's also this sort of concept that I think a lot of people, when they look at the technology, they look at some of the medical capabilities, they immediately jump to, okay, this could be used for clinical face, clinical decisions, support clinical facing things. And I'm always surprised to see that if you actually take a hard look, at least the US health system, and really look at the tasks that can be operationalized in ways that would be far less towards that clinical diagnosis, possibly more likely to cause harm if it didn't work correctly. And you say there's like 600 billion that we spend on administrators, right, in this country, in our health system. And I think I heard a statistic, ten to one administrators to clinicians in a lot of health systems in this country. And to me, when you look at some of the capabilities on the general purpose applications of this technology, plus you add in the idea that this technology does understand healthcare to some extent, there's still a lot of opportunity on the table without necessarily needing to push directly into clinical care.
[00:19:04.600] - Matt Lungren
So I do want to point that out. I find that to be overlooked in a lot of these kinds of conversations.
[00:19:10.370] - Erik Duhaime
Yeah, I also want to add to that. I think most of the early applications are going to be more operational, back office logistics or at least decision support, not making decisions. It's almost incredible to me how five years into this, or ten years, let's say, into the AI boom, that how many people still ask me like, oh, so is AI going to replace doctors? And to Matt's point, mean, yeah, being able to do well at the MCAT or whatever, it's interesting, it's nice. But acing the MCAT doesn't make you a good doctor either. I mean, there's a lot of different components to it. So I think for the clinicians on the call right now, I assume if people are involved with the AIMed community, they're already maybe enough in the know that they're not terrified of AI. But I'm not at all worried about, for example, Tom figuring out how to automate my wife's job. He's going to make her job easier and better. So that's definitely the direction we're headed in. And again, to just emphasize Matt's point, this isn't, hey, let's replace doctors. There's a lot of places where these tools can add incredible efficiency.
[00:20:18.700] - Erik Duhaime
Know, even if it's not an official diagnosis based on an image or whatever it is.
[00:20:25.280] - Dr. Veronica Rotemberg
I do feel like that's one of the challenges, though. It's actually pretty easy to grade, whether you got the melanoma right or not. And it's actually really hard to grade. Like, did you appropriately fix this workflow problem that a human was also not doing perfectly? And so I agree that's where we should really be thinking about what we do next. But it's not as easy as it sounds.
[00:20:56.120] - Dr. David Gruen
I'm sorry. Go ahead, Tom. Go ahead.
[00:20:57.720] - Tom Calef
Yeah, I was just going to say, there's a lot of meat on the bone, even around the clinicians. Right. I mean, there's a lot of administrative work that the clinicians have to do every day, and if we can take their time and optimize it into really helping patients hands on. We worked with surgery a lot, so I always think of the operating room and in surgery, how do we make sure that a surgeon is at the or at the right time, and then when they can move on to the next patient as quick as possible and minimize any of that activity in between? There's so much meat on the bone there, and I totally agree. I think the administrative aspects of healthcare will be first addressed, and then we can all work together into pushing that ball down the court.
[00:21:47.580] - Dr. David Gruen
It's about quality and trust in all of AI, no matter what we're talking about. And I'll push back a little when five years ago was Jeffrey Hinton, I think, who said that we should stop training radiologists. Right. Fortunately, I think he got that wrong. And this is controversial. I think there are things that we should do that replace clinicians and we can look an awful lot of medicine. There's good data that shows that some of the breast AI find can replace 30% of the screening mammograms that bog me down all day long and probably screening chest x rays. In Veronica's field, I'd rather have her looking at my skin. But if I'm in a part of the country where there's not a dermatologist and there's a gpus and seed melanoma or basal cell, in a half a decade, I'd probably rather have AI looking at my skin than that. So, you know, how do we get to that point where people are comfortable with and trust AI, both the users who are going to be us and the patients who are going to be the beneficiaries and the administrators? Right. Maybe that's controversial, but I think if it can do it way better than we can and faster and Coster, then it's addressed the quadruple aim of medicine.
[00:22:59.610] - Dr. David Gruen
Right. And doing a better job. So maybe, Erik, we're getting closer than we think.
[00:23:04.980] - Erik Duhaime
Yeah, I'll agree, but add a nuance there. So if AI is drastically better than people at certain tasks, yeah, I want the AI to read my chest x ray and flag certain things, but I think there's this fear about taking humans out of the loop entirely. Now, if the AI is better in all conditions, great, take the humans out of the loop. But the problem is, these algorithms often don't know when something's out of sample. So, an example that I really like, actually, is we were doing some research with Veronica's team at Memorial Sloan Kettering, where we were showing that our network of annotators could classify skin lesions more accurately than a previous benchmark paper they'd come out with, where they had 27 board certified dermatologists look at these images. But we wrote them an email. We said, hey, we found something weird in the data set and said, hey, what the heck is this? And I'm pretty sure when I first wrote Veronica, she didn't know and she needed to go digging around. But what they did is they released these data sets of all these skin lesions where everything's biopsy proven. It's really nice, highly validated data.
[00:24:10.280] - Erik Duhaime
And someone at the memorial Sloan Kettering team had slipped in an image of how, from a Space Odyssey 2001 or something. So it's like this red orb dot thing. It's not a skin lesion at all. And the memorial Sloan Ketering, the ISIC folks, they released these data sets for AI challenges. And I think so 100 different teams or something of AI developers had trained and submitted models on these images. And correct me if I'm wrong, but I don't think anybody could come and say, hey, what the heck is this image doing here? Because they fed that to the model and it would just say, hey, melanoma 0.7, Veronica, maybe looks like is going to correct me, people.
[00:24:49.410] - Dr. Veronica Rotemberg
Well, no, I'm not going to correct to you. I think that's right. I think we moved to more of the supervised applications here, which I think is fair because those are the ones that are most commonly approved for this direct patient contact. But yeah, you're right. One of our collaborators, Phil Shandel, put howe in our data set. Almost all the AI models thought it was a vascular lesion, and none of the humans, expert or not, got it wrong. I think one of the challenges that we have is defining relevant outcomes. If you think Hal is a vascular lesion, that's actually not a relevant outcome. You've said it's benign and it was benign. Maybe it wasn't a skin lesion, but it was benign. And so I think for dermatology especially, I think we're going to need to do more randomized control trials, specifically against the standard of care. And we're going to need to develop infrastructure that makes those not as expensive as a current randomized control trial is, and more efficient and able to be deployed in a family practice, wherever that is, as well as an academic center. And so that's one of the things that we need.
[00:26:08.590] - Dr. Veronica Rotemberg
I know this panel is talking about models, and I'm really excited to talk more about them, but we also need better infrastructure in healthcare overall to deploy models efficiently, evaluate models efficiently, and to define the outcomes that are going to actually change patients'lives. And so I hope that's a call to action for all attendees as well.
[00:26:35.260] - Dr. David Gruen
As know that gets us know, deployment and trust. How do you think the FDA or organized medicine can better partner with organizations? And here we are, academic medical centers, large community practices, industry, academia, entrepreneurial businesses. How can we collectively partner in an ecosystem to move this forward without sacrificing, or actually to enhance patient safety and outcomes?
[00:27:04.220] - Tom Calef
So we work with FDA quite a bit on the device and on the clearance and approval side. And one of the things that in the last eight to twelve months. That has surprised me is just the speed in which the application space is changing, right, with the models underneath it and the capabilities. And I think the brain trust that has got behind it and being creative with it is just expanding every day, meaning that also the underlying models are changing every day. And quite honestly, it's going to be very, very difficult for FDA to be able to keep up to the latest and greatest of what people are working on. Quite honestly, it's hard for a company that's focused on it to keep up, and we work with some really large partners to make sure that we can keep up. Our device is a platform to release models into the operating room. So we take third party models, we validate them, verify them, validate them, and release them into the operating room. And being able to bring FDA along that journey, they have come a long way in understanding AI in a very short time.
[00:28:18.980] - Tom Calef
I would say in the last two years, they became very intelligent around data aggregation, data bias, all of the things to make sure that the data sets encompass the indicated use and also the population. But I think we have to be able to bring FDA along with the story and also the development pathway, so that as scientists and technologists, they can help us in true validation. Now, that's going to be incorporating many medical professionals and data scientists to make sure that it's all done the right way. But this is absolutely a group effort at this point, where it's just moving so fast.
[00:29:06.560] - Dr. David Gruen
I'm sorry, one of the attendees asked a question that, I'll throw this out because it's related. What is an acceptable hallucination rate for tools like this, particularly when used for patient care, and what criteria are used to declare the tool ready safe for clinical purposes? And the third part is, what forms of continuous monitoring and validation are, or would need to be in place. And so, Matt, why don't I start with you. If you can define for those who may not have heard the term, what a hallucination rate is? And then I'd like to go back to Veronica, because one of the questions I have is, does one size fit all? If a model is looking for plaque psoriasis, as opposed to malignant melanoma, do we need the same criteria? So, Matt, why don't you lead us off here and everybody can feel free to trip. Sure.
[00:29:55.350] - Matt Lungren
I think, you know, just to level set for the audience. I think we recognize we've used this term hallucination. I think folks have thought about some different ways to name it, but there's two ways to look at it. I think in a lot of applications you look at it as something that's actually a feature, not a bug, right? In other words, if you're trying to do exploratory writing, let's say in sort of trying to figure out a new grant idea and trying to explore that feature space and think about different ideas creatively, sometimes that sort of stochastic interaction is actually beneficial, right? Especially as a human interacts with the model. And I think obviously when you think about very specific tasks, for example, taking a summary of a patient's chart and writing a discharge summary, sort of preview for the clinician, based on the record there, you really don't want to have any hallucinations at all. Frankly, I don't think it's the kind of issue, I think that people maybe perceive it as. In other words, when you see these applications, particularly for human in the loop healthcare use cases, it's not just the model API and that's it, right?
[00:31:17.150] - Matt Lungren
So there's actually quite a bit of work. And so at least where I sit there is an entire part of the organization focused on responsible AI. But clearly these challenges around how do you steer and guide and structure the output and allow it to provide reference. This is not just a healthcare challenge, right? This is something that's a challenge across industries. And so being able to arbitrage against the different tools. If you actually looked at the stack of what you would call a quote unquote copilot, again, it's not just the model. And there, here you go. That you might use in the consumer version of the application. There's actually several layers of different techniques that are used in order to sort of purpose build it and sort of constrain it so that the outputs are either verifiable or based in truth. And so we probably still don't have a lot of time to get into, but a lot of the techniques are incredibly useful. And so I do want to make that distinction that again, I think anyone can kind of pull up GPT four on OpenAI and show you a cool demo. Look what I did.
[00:32:21.940] - Matt Lungren
I summarized this chart, I did XYZ, and that's fine. But when you want to scale an application and you want to actually put it into the clinical space, there is quite a bit of development work that needs to be done. And that's kind of where we come in to provide those tools to the ecosystem to make that easier.
[00:32:43.080] - Dr. Veronica Rotemberg
Veronica, I guess to build off of that, I would say that I agree with Matt. The output of the model isn't the end of the story. It's the beginning of the story. We know, even in dermatology, that the way that you present information to the clinician will dramatically change their behavior from that information, even if it comes from the same information. So if you give, for example, a binary output, this is melanoma versus. It isn't melanoma versus, like a histogram of multiple diagnoses versus all down the pike, anything that you can think of, the decision of the clinician based on the exact same image and the exact same model output will be different. And so we really need a lot more information. It's almost like more psychology research around how these decisions and clinician behaviors are going to impact health. How are really taking this from the beginning, either from a GPT output or classification tool, to what the true interface will look like when it's deployed, to what the behavior change might be, to what the impact on the patient might be. And that needs to be able to happen at scale so that we can actually know what this is going to do.
[00:34:14.980] - Dr. Veronica Rotemberg
And that's where I don't think, at least in dermatology, we're not there yet. And so we don't even know how to answer those questions yet, and that's something that we really need to work on.
[00:34:26.280] - Freddy White
Tom, are we there in.
[00:34:31.400] - Tom Calef
I think. I think establishing rates is interesting. I think hallucination rates may be a little outside of kind of variance of model output we look at. And we actually did a study with some of the work from Erik's group and did just a concordance of surgeons, and we put out data to, say, label a structure and label what's happening. We put it out to 50 different surgeons, very small sample set. I understand that. And our concordance rate was actually pretty low. It was around, like 72%. So I think among humans themselves, as we teach these, we to make sure that the labeling approaches standards, that helps us teach and teach these models fundamentally. And then when we look at what's acceptable, we also have to understand what's acceptable in the state of the art today, what practices today. And I think it's a good way to at least balance the output and understand where we are versus where we are with technology and new products versus where we are today. When you take a cohort of surgeons at a surgical conference. So it's a couple of different ways to go about it, but I think establishing that rate upfront is very important.
[00:36:00.100] - Dr. David Gruen
[00:36:04.200] - Erik Duhaime
As you said, if it's a melanoma or plaque psoriasis, or know what rate of hallucination is okay depends on the use case. But something Tom said earlier, the FDA is doing a commendable job at learning extremely, extremely quickly, given how fast the pace is moving, given how quickly the pace of innovation is in this space. Now, what I'd say worries me, and where I think we need to go is thinking about them more. Like, maybe we think of approvals for drugs where you don't just have a clinical trial and something is approved, but you need continual monitoring. You need to do a phase four clinical trial and evaluate how these things are being impact in the real world. And you need to track adverse events. So regardless of what the use case is, maybe for a certain cancer drug, we're okay with adverse events to a higher degree because it's a life saving treatment compared to if it's for mild migraines or something like that. The same is true for AI use cases. But we do need to get to a world where we have structures in place for continual model monitoring. Now, today we've got this large network of people that are annotating data, and most of our clients are using us for annotating initial training data sets.
[00:37:21.670] - Erik Duhaime
We're only now starting to see more clients who, every time they deploy at a new hospital, they need to run another validation study to make sure that everything is working. There's no biases that they didn't expect. The devices are a little different, the patient populations are a little different, and so on. And where I think we're headed and where we've been building our product and technology, what we've been building our product and technology for is a world where you need more continual model monitoring. I mean, even once you deploy at that hospital, well, you can't just set it and forget it for the next ten years. So what I'm imagining is something more like experts are continually providing benchmarks and essentially tracking adverse events when these algorithms are deployed. So, I guess, to summarize, I'd say there's no fixed rate of what level of hallucination is appropriate. It depends on weighing the costs and benefits of an adverse outcome. But we do need mechanisms in place to make sure that we're actually measuring adverse.
[00:38:23.780] - Dr. David Gruen
I'm sorry, Tom. I mean, to me, we've been talking for years about the slow pace of AI adoption, and I completely agree that once we have real world deployment, assuming we get through FDA and get it into commercial use, we need experts. And I'll use that in quotes to make sure that it's working. How do we do that? Right. We don't have enough people, we don't have enough bodies, we don't have enough technology, we don't have enough resources. Is that slowing down adoption of this incredible technology?
[00:38:53.900] - Matt Lungren
Well, I can make a comment about the. So you may have seen that the onchhs just came up with another really great guidance. I think it was yesterday, or, you know, it really is doubling down on the idea, know, post market surveillance, I guess that concept. Right. Monitoring over time. And yes, I think to Erik's point, I think folks are recognizing that the distribution shifts even the model challenges in different populations, having been treated as software is not really the optimal use of some of the technology. Ultimately, these models can learn from new data, but yet after, of course, the approval process, at least currently, they're frozen. Right. They're treated as software from there on. So then it's incumbent on us to understand what data does the model expect and then what data is the model actually seeing in real time and be able to have a finger on the pulse of that. Part of that is maybe, as you're suggesting, understanding. Okay, let's audit this occasionally and see a confusion matrix, but obviously resource intensive approach. Right. So what are some other ways? And we've actually open source a lot of work on this, especially in the medical imaging space, where this gets really tricky because of the dimensionality.
[00:40:08.750] - Matt Lungren
But ultimately we can actually statistically look at what the data looked like that the model was trained on and then over a sliding window, compare to what the model is seeing in real time. Now, it may not perfectly relate to the actual performance of the model, but it gives you an indication of, say, hey, something's going on. There's been a change, there's been a drift. There's been a change across various axes, whether those protocol based, machine based, the percentage of expected disease, et cetera. And you can use that as a signal. So I do think that there's quite a bit of opportunity in sort of evaluating, but this is, again, kind of in that narrow AI world a little bit that we're sort of kind of, we kind of moving this conversation between narrow and general. And just to be extra clear, I do think that when it comes to multimodal general AI and then trying to pull that into the sort of domain space that we're talking about now in the purview of narrow AI, there's going to be a significant gap. Right. In sort of how we get to something that's acceptable for FDA and the processes around.
[00:41:12.840] - Matt Lungren
I think we're all still trying to learn what that might look like, but at the very base level, you can imagine that a general purpose model that has seen a wide variety of medical domain specific imaging data that might actually get you to a place where the development cycle, for some of those narrow applications, task adaptations, could get easier, and then that could sort of slot into the framework that we're all accustomed to using now, which is that I have a task specific adaptation. It's a narrow model. I go through the approval process. I need to monitor it over time, et cetera. But maybe there's an easier way to update and change that, taking advantage of the fact that these models can certainly learn, and they're not necessarily static unless we force them to be. And I look forward to where that's evolving.
[00:41:56.940] - Dr. David Gruen
That's really exciting. We clinicians are notoriously and historically really bad at peer review, and we're even worse at reinforcing, finding, identifying, and helping each other get better and correcting errors and even admitting errors. And this technology with a capital t is the opportunity to do a much, much better job than humans. And yet we're putting that as a barrier to adoption. And somewhere there's a paradox that we need to get over. Tom, I think you wanted to say something.
[00:42:28.610] - Tom Calef
Yeah, it was actually kind of along the same lines Matt had about this post commercialization, post market surveillance is just so super important. We do this again from the med device side, a whole bunch of different things.
[00:42:47.320] - Tom Calef
[00:42:47.560] - Tom Calef
Not only is it software, as kind of, Matt was saying, software is kind of locked and loaded today. It's very static, but things that aren't static are things like. Right, where we have to make sure that we do our things, like pen testing all the time. Right. And I think we could draw a lot of parallels to the way that models develop over time to the amount of data that they're given and taught and then how they react. The good thing is that device organizations and oems are used to this type of periodic maintenance, surveillance and truly questioning their own devices to make sure that there's a high quality product that's going out. So I think if we can learn from some of the mechanisms in place, and it also kind of helps the FDA come along and say, look, this is a very standard process that you all are used to over here. Let's see if we can take it and tweak it a little bit, maybe for things like data ingestion and dynamic models.
[00:43:57.200] - Dr. David Gruen
We only have 15 minutes left. So I'd like to get a little practical for those in the audience. If people are thinking about getting started with AI, who should be the composition of the team. They're thinking about to weigh in on the various aspects of risk. What do they need in place to get started? And part two is, do you think they're better off starting with a narrow AI solution, or should we bypass that and come right to large language models or administrative tasks? Where should people start? Erik, you're smiling. You go first here.
[00:44:29.720] - Erik Duhaime
Yeah, this might be. Let's see what Matt thinks about this. But I think if you're a small team starting out, you should probably start out thinking relatively narrow. You should think of a specific use case, some sort of operational workflow or something. Maybe you're using a generalizable model as sort of the base that you're training on. But I'm smiling because I think the challenge with going broad initially is that you're going to lose to teams at Microsoft and other giant companies that are doing some incredible work internally. I think it's hard if you don't have large teams of brilliant people, lots of compute power, et cetera. So I would think about what a smaller use case is just to get started, because also, once you've built a model, one thing that's tough for innovators in the space is to think about distribution models after the fact.
[00:45:27.340] - Erik Duhaime
[00:45:27.740] - Erik Duhaime
Even if you build something that works pretty well, that's relatively general purpose. That doesn't mean you're going to convince a hospital to fire everybody on staff and replace them with your solution. You need to focus on something small that works, that adds value. But I'm very curious to hear what Matt thinks on whether people should be starting broad or narrow.
[00:45:50.980] - Matt Lungren
Yeah, I guess I'd phrase it a little different, but I see what you're saying. So from my perspective, it's not necessarily that the use cases have shifted, it's just that the underlying model and technology has shifted. Right. So I don't need to spend eight to 16 months development cycle to automate one task, which required me to really get crisp on my ROI and my development time and how much data I had and all that work that used to go into just automating one thing, right. And there's probably a massive list. And this is really, to me the shift in sort of now there are platforms and capabilities, including from us, that provide both big, super powerful, extremely useful models like GPT four, but also an ecosystem of other models that are slightly less performant across certain categories, but still very useful. And then all these other tools to fine tune and put guardrails on and all these other things, knowledge, database, and in a secure environment. And then to me it becomes almost a bottom up technology instead of a top down. So if I'm in an enterprise, I might say, listen, number one, I think that there's a copilot era that we're about to sort of enter into generally as a society where we have these LLMs in our email and our word documents.
[00:47:05.780] - Matt Lungren
So we're going to already kind of have a familiarity with where the technology can really help us. And to me, if you look at everyone in a health system ecosystem broadly, has probably a list of things that they could say. I'd love to be able just to automate these series of things that I do that are repetitive, they don't require a lot of knowledge. I'd love to be able to do that. And I think that now that's the promise, that's why everyone's so excited is because again, I don't need to have a team of data scientists to develop narrow models to accomplish that. Now I have this ability to take that same base core technology and platform to enable all kinds of use cases across. Now some of those, if you start to get into the clinical spaces we've sort of touched on here, I think that's a completely more. There's a lot more work that needs to be done there right before we get to that on a day to day basis. But I'm thinking about just all these other tasks that we sort of touched on, which is just to say that technology is now from maybe a common place as opposed know, scattered across tiny little narrow models everywhere.
[00:48:13.280] - Dr. David Gruen
Veronica then Tom, I guess I want.
[00:48:16.840] - Dr. Veronica Rotemberg
To take another step back, if at all possible, to know if you're starting in this space from scratch, you need at least like three pillars, and we've talked a lot about them. Know, you need to have a strategy, like exactly what Matt has outlined could be your strategy, or you could think about like, I want to do a more narrow model strategy or whatever it is that makes sense for you. You need to have the operational infrastructure to actually do the things that we're talking about. A lot of places still don't have that, especially outside of radiology and governance. So who is going to be in charge of all of these fun things that we talked about in terms of monitoring, validation, testing, asking Erik's annotators to tell us when something is wrong, or when there is in my domain when a new camera comes out and then all of a sudden it has a weird light that shows up in all the images. So I think you need to have some pillars in place before you start deciding what are you going to actually buy? And that has to do with people who are experts in a lot of the general topics that we've talked about as well as the medical subjects.
[00:49:43.750] - Dr. Veronica Rotemberg
But then I agree with Erik and know you have to think about what you actually want to use in your day to day life and go from there.
[00:49:54.960] - Dr. David Gruen
I was hoping for more controversy in this conversation. We haven't had very much. There's a lot of agreement here. Tom, can you shake us up a little bit in our last couple minutes? I don't know if I can shake.
[00:50:03.800] - Tom Calef
You up at all, but I think maybe I'll go to the practicality. Know, I'm a small company guy. All the startups that I've ever been part of have been small and have gone from kind of zero to one. And I think jumping into starting a new problem is similar in this realm too, where it's be focused on the need, be focused on what you're trying to solve and have the expert that's around it. Right. Because if I'm coming in trying to solve a problem in Veronica's field, I just don't know that field very well.
[00:50:39.330] - Tom Calef
[00:50:39.540] - Tom Calef
So it's going to be really difficult for me to say, hey, look, I can solve this. Yeah, but what about the rest of the ecosystem that you sit in, everything you just mentioned? So I think being able to have the expert that can not only talk to you about the problem, or maybe it's that person that wants to start and really understands everything around it, I think that's super important. But the biggest thing that I've seen personally working with AI over the last probably four or five years is creative. You have to be creative because it's going to take multiple different approaches, combining large language models and narrow models. And then you kind of have to be hungry, right? Because it's going to take some time. The first solution is not going to work. It's going to take conversing with either a large model or doing a lot of training on narrow models. So it takes a lot of work and just be ready for it. But at the end of the day, what it can actually show truly is amazing.
[00:51:42.980] - Dr. David Gruen
Which brings us to the last couple of questions, each of you in a minute or two. Which applications of generative AI in healthcare are you most excited about and what do you think is going to bring us across the proverbial chasm to get there? What's going to be the transitional moment where all of a sudden it's happening. And let's go in the reverse order. Tom, you get to go first this time.
[00:52:06.300] - Tom Calef
[00:52:07.660] - Tom Calef
So the first thing we're working on is surgical notation and notes directly from the surgical video. We're already in the video workflow in the operating room. So why have a surgeon go back and dictate it if we can pull it right from a video? So I think we have a lot of promise the product is on the way. And I think that's a really good use of generative models to actually tell what happened. So in the OR, instead of having one short sentence from a surgeon actually kind of go through. Here are all the steps. This is what happened. And be able to build off that in the future. So when we do provide a discharge note, surgery isn't as much of a black hole as it may be today.
[00:52:56.560] - Dr. David Gruen
Sounds like a real benefit to have automated, real time, completely accurate of op notes. I can think of a lot of people in the healthcare continuum that would like.
[00:53:07.360] - Dr. Veronica Rotemberg
Me, but I think it's very similar to Veronica.
[00:53:11.850] - Dr. David Gruen
I keep calling you.
[00:53:12.560] - Dr. Veronica Rotemberg
It's okay. I got know. I think structuring healthcare data for easy understanding, easy interpretation, better metrics around what we're actually doing when we provide care. It's a huge opportunity for the language models, honestly, as they exist today. And so that's something I'm really excited about. Even as someone who works in a more narrow domain, getting the structured data that I need out of the health record as it exists today would be amazing.
[00:53:50.860] - Dr. David Gruen
[00:53:52.620] - Matt Lungren
Yeah, I think I agree with both of you in the sense that I think you're kind of touching on two themes that I'm also excited about. So there's. The one is, I think the idea of the multimodality capabilities here is not just constraining ourselves to text based things, because we recognize healthcare is multimodal. It's episodic. It's time like. So we really need a way that we can tokenize all of the knowledge, bring it together, connect the dots, and allow us to interact with the medical information in the way that serves our use case. So, whether it's getting your surgery note done, whether it's getting a chart biopsy done before you walk in the room for a patient, or whether it's actually tying medical clinical knowledge to even some of the basic science, genomics, some of the work that's going on there, being able to do that is what I'm again so incredibly excited about this technology, because it does shift the idea of a static information just to take our reports that we create and turn them into something potentially dynamic. Right. So in other words, a surgeon has different questions about the CT scan that I read than the patient would or the primary care doc, et cetera.
[00:54:57.660] - Matt Lungren
And so that fluidity of the data filtered through a reasoning engine that understands the space, it's a really powerful idea, and I'm feeling like it's no longer a science fiction concept. It's something that we can actually drive towards.
[00:55:13.880] - Dr. David Gruen
Thank you, Erik.
[00:55:17.080] - Erik Duhaime
I think we'll have made it. And I'll be most excited when I go to the doctor and don't get handed a clipboard and need to write down.
[00:55:26.140] - Dr. Veronica Rotemberg
That's going to be the last thing to.
[00:55:29.980] - Erik Duhaime
Know. One thing that's cool about my job at Centaur is I get to talk to innovators across areas like dermatology, radiology, surgery, and things I never would have expected. And one thing that we've been seeing more and more of is people making use of unstructured text data. We've talked about a handful of applications here, but we even see pharma companies that are having us tag scientific paper abstracts so that they can mine the scientific literature and come up with new ideas for drugs. So most information in the world is text data. It's difficult to understand medical and scientific text without good annotation or without making sure you're validating the results. But there is an enormous amount of text data out there. So most people think of AI. When they think about AI in healthcare, they think about radiology instinctively, because that's where there's been a lot of progress and where there was the most exciting sort of headlines five or ten years ago. But I think we're going to see greater advances with the ability to mine and summarize text data of all different sorts. If I were to, you know, it's not an either or, it's multimodal, like Matt and Veronica are saying.
[00:56:45.770] - Erik Duhaime
But I think that's what I find most exciting about large language models right now in the state of AI today, is just that most of the information that's unstructured, that contains most of the info out there, actually is text information in scientific papers, medical notes, things being dictated, patient doctor conversations that are now being recorded and can be summarized and can be used to extract findings. So I think that's what's most exciting. And we're seeing folks, even the folks working on computer vision algorithms, are digging into that text data and are having us annotate radiology or pathology reports and things like that. So I think that's what I find most exciting, and hopefully that leads to me not needing to fill out a clipboard anymore next time I'm with the doctor.
[00:57:30.420] - Dr. David Gruen
So I'll take the last word here. As we shared last night, Erik, I asked my personal doc, who's a smart, Ivy League, fellowship trained guy, should I take a baby aspirin? And I think we're coming close to the day where he should be asking a large language model, should Dr. Gruen be taking a baby aspirin? And look at all of my familial and genomic and history and everything else far better than he could. And the list goes on, right? Should we start patients on Prozac or Paxil, do they need surgery or not surgery? Is this lesion bad? And how often does it need to be followed based on the patient's age and all their other comorbidities? Do they need dermatology or do they really need hospice? Right, and the list goes on. Of the things that can help us as clinicians make better decisions for every patient at every moment of every day that are data driven. And I see that transforming the way we practice, but not replacing us. And I'm really excited. Last question, if we have 1 minute. This is clearly changing so very quickly. Who do you all follow? Can each of you share a tip on how people might keep up with this fast changing pace?
[00:58:38.160] - Dr. David Gruen
Do you follow somebody on LinkedIn? Do you read? What do you do? How should people keep up with this? What do y'all think? 1 second each, because Freddy's giving me the yankee here.
[00:58:49.850] - Erik Duhaime
I follow Matt. Follow Matt Lungren on LinkedIn. No, but if you search around, there's also, I mean, not. No, genuinely. Yeah, follow Matt. He's one of the central nodes in the space, but there's a bunch of great liftserves out there if you google around. I mean, follow the AIMed community. There's a newsletter I'm on called the sequence that I think is quite good for AI work. It only takes a little bit of poking, and you only need to follow one or two AIMeds or Matt Lungren's before you find a bunch of other great.
[00:59:22.280] - Dr. Veronica Rotemberg
I agree, you know, Matt, amazing. But I also know there's actually great healthcare AI reporters now. I think it's becoming a really growing space, and even very traditional publications are learning how to write about this technology for a wider audience. We're in a very lucky time that more critical thought is going into that. So I also follow healthcare tech reporters, and I think as a group, they're doing a lot better even than they were five years ago, which is great.
[01:00:06.260] - Tom Calef
Well, Matt's already been so on the other side of that. I think Renee Yao and Kimberly Powell at Nvidia do a great job. I think they have a great voice in this space.
[01:00:20.680] - Freddy White
Okay, Matt, I always want to know who cuts my barber's hair. Who do you.
[01:00:26.220] - Matt Lungren
My. So my favorite people. Andre Karpathi is probably my favorite account to follow because so practical. Really great explanation of some of the know breakthroughs. Pranav Rajpakar obviously love his work and the things that he's doing. And then clearly the accounts of MSR, but then also DeepMind and Google, I think they're just putting out great work. Neurips is going to be a great example of some of the incredible work. And then Roxanna from Stanford, our mutual friend Veronica does some incredible work as well. I would definitely not miss out on her content either.
[01:00:58.790] - Freddy White
Yeah. So, Freddy, will you share that list with the participants if you got that all from the recording and people can add that to the reading list, I think we're at the top of the hour. So on behalf of Med and this incredible group in Centaur that sponsored us, we hope this has been an interesting and enjoyable break from your daily grind and look forward to following up over the next months about how you all are adopting AI and moving forward with generative and narrow, or whatever it is that you're pursuing to improve patient outcomes. So thank you all and have a great day. Thank you, everybody. To have everyone on. Thanks a lot.
Chief Medical Information Officer - Nuance Communications
Dr. Lungren is Chief Medical Information Officer at Microsoft+Nuance. As a physician and clinical machine learning researcher, he has a part-time clinical practice at UCSF while also maintaining research and teaching roles for other leading academic medical centers including Stanford and Duke. Prior to joining Microsoft, Dr Lungren was an interventional radiologist and research faculty at Stanford University Medical School where he led the Stanford Center for Artificial Intelligence in Medicine and Imaging (AIMI). His scientific work has led to more than 150 publications, including work on multi-modal data fusion models for healthcare applications, new computer vision and natural language processing approaches for healthcare specific domains, opportunistic screening with machine learning for public health applications, open medical data as public good, and prospective clinical trials for clinical AI translation. He has served as advisor for early stage startups and large fortune-500 companies on healthcare AI technology development and go-to-market strategy. Dr. Lungren's work has been featured in national news outlets such as NPR, Vice News, Scientific American, and he regularly speaks at national and international scientific meetings on the topic of AI in healthcare.
Director of the Dermatology Imaging Informatics Group
Memorial Sloan Kettering Cancer Center
Veronica Rotemberg, MD, PhD is the director of the Dermatology Imaging Informatics Group, an organization within Memorial Sloan Kettering Cancer Center focused on developing, benchmarking, and validating artificial intelligence (AI) tools for dermatologic applications, especially melanoma detection. The group also explores model development, human-AI interaction, and prospective validation of image classification tools. In collaboration with the International Skin Imaging Collaboration (ISIC), the group also develops and hosts large imaging datasets for dermatology AI research, and proposes both imaging standards for interoperability, and metadata standards for AI development and for dermatology clinical applications. Dr. Rotemberg received her MD-PhD from Duke University with a PhD in biomedical engineering focusing on elasticity imaging.
Chief Technology Officer
Mr. Calef is a highly accomplished medical roboticist and seasoned engineering leader who has delivered breakthrough innovation in surgical robotics. Previously, he was the Director of Advanced Robotics at Medrobotics, where he led the development and release of the first flexible surgical robotics platform, grew the engineering team from two to 50 people, and guided numerous surgical robotics products through 510k clearance. He holds both a B.S. in Computer Engineering and a M.S. in Mechanical Engineering from the University of Massachusetts.
Cofounder and CEO
Erik Duhaime, PhD, is the founder and CEO of Centaur Labs, a venture-backed startup that provides a platform and services for scalable data annotation. The company focuses on skilled data annotation tasks - especially those in the medical and scientific domains - and works with leading biotechnology companies, medical device companies, digital health startups, and researchers at leading academic medical centers. The company is based on his PhD research at the MIT Center for Collective Intelligence on how to best measure the performance of data annotators, motivate them, and intelligently combine their opinions. Prior to his PhD, Erik completed his BA in Economics and Biology at Brown University, and his MPhil in Human Evolution at the University of Cambridge.
We’re humbled and honored to be recognized by CB Insights as one of the top 150 digital health startups in the world!
Learn about why we're called Centaur Labs!