9 most common types of medical text datasets

Ali Devaney, Marketing
August 31, 2022

This is the second post in our NLP in Healthcare blog series. Read the announcement here - [New Series] Introducing our NLP in Healthcare blog series

As AI leaders consider what medical datasets can support their AI projects, medical text offers a rich source of information, if it can be accessed and annotated accurately. Below are 9 of the most common types of medical text we see being leveraged by AI leaders in the medical and life sciences.

Most common types of medical text datasets for NLP projects

  1. Direct-to-patient communication - When the pandemic made in-person healthcare visits dangerous, the digital front door to healthcare blew open. Patients and providers accelerated their use of many text-based communication tools, such as SMS, chatbots, chat in video conferencing, email, digital forms, and other virtual care tools. Some of these text-based programs even saved lives, bringing only  the right patients into the hospital at the right time to get the care they need. These new sources of medical text offer both inspiration for new care programs, as well as new sources of text for NLP.
  2. Electronic Health Record (EHR) - Clinical notes written by providers summarizing each patient interaction have long been the source of truth that provides a record of what care has been delivered and should be reimbursed, and informs what a patient’s care plan will be going forward. With more than 75% of healthcare institutions now having some EHR in place, the structured text and unstructured clinical notes it houses are ideal text data sources for NLP projects. 
  3. Insurance claims - When we think of insurance, we often think of health insurance - the insurance that covers everything from routine doctor's visits, to unplanned diagnoses, to pregnancy care. However, many other types of insurance cover injury and bodily harm - such as dental, auto, property, and life insurance, as well as workers compensation, general and product liability, and disability insurance. Whenever a situation arises that causes injury or bodily harm, and this situation is covered by one of these types of insurance, the affected groups must file insurance claims summarizing the situation for the relevant insurance companies to review. Much of these insurance claims are text-based.
  4. Prescriptions - 66% of adults in the US take prescription drugs and each of those prescriptions has both free text directions written by the clinician prescribing the medication, and instructions written by the pharmacist filling the prescription. Most of this text is managed and shared in ePrescription software. 
  5. Transcripts of patient engagements - While more and more healthcare is being delivered asynchronously, the lion's share of care is still delivered face to face - both virtually and in person. As a result, medical transcription is a large and growing source of medical text. Some of this transcription is completely machine generated, some has a human in the loop, and others are still completed by professional human transcriptionists. 
  6. Pathology and radiology reports - Pathology reports are requested whenever a tissue requires testing in a lab, i.e. a blood test, or a tumor biopsy. While we often think of imaging when we think of pathology and radiology, these reports also contain written descriptions. For example, a pathology report contains a gross description, microscopic description, and diagnosis section, and may also have disease-specific or open comment sections, all of which are written by the pathologist. 
  7. In-product user-generated content - As direct-to-consumer health and wellness applications continue to meet many consumer health needs, more and more health information is shared in 1-to-many in-product user discussion groups, 1-1 chats between users and company representatives, user initiated in-product search, and in-product feedback and ratings portals. All this user-generated content is provided as text.
  8. Scientific studies - In the biomedical field alone, more than 1 million papers are published in the PubMed database annually — about two papers per minute. On top of this publicly available research, academic institutions and other funders of scientific research - primarily pharmaceutical companies - have a stable of unpublished intellectual property they own and manage, some of which is text based. Pharmaceutical companies are transforming this biomedical text data into a knowledge graph of biological, chemical, and medical concepts, and leveraging NLP to infer connections for downstream experimental validation led by their R&D teams.
  9. Social media and public forums - As many as 29% of consumers turn to Google and other search engines for reliable health information, whether to evaluate treatment options, or to understand medication side effects, while many others share health information on social networking sites, and participate in online health-related support groups. Search queries, activity on social media and public patient support groups are large and growing sources of text for NLP projects. 

All of these sources of medical text are being leveraged for interesting NLP development in healthcare, and could be helpful sources of information for your next project as well. We’ll share some of our favorite companies leveraging these medical text data sets in our final blog in the series. 

Learn more here about how Centaur Labs can support your medical text annotation needs, so you can label faster, and get your AI to market faster where it can impact patient outcomes. 

NLP in healthcare blog series

  1. [New Series] Introducing our NLP in Healthcare blog series 
  2. 9 most common types of medical text data
  3. Coming soon - 5 types of medical text annotation
  4. Coming soon - 8 best practices for annotating medical text
  5. Coming soon - 12 impressive healthcare companies leveraging medical text for AI

Related posts

August 1, 2022

Centaur Labs partners with Mayo Clinic spin out Lucem Health to accelerate medical AI development

Learn about our partnership with Mayo Clinic spin out Lucem Health, and how clinical AI development teams can access high quality medical data annotations at scale.

Continue reading →
August 1, 2020

Building a scalable and accurate medical data labeling pipeline

Examine the unique challenges with medical data labeling, the relative lack of accuracy produced by traditional data labeling methods, and discover a more accurate and scalable alternative

Continue reading →