Google Med-PaLM: The AI Clinician

Author:Murphy | View: 21041 | Time: 2025-03-23 19:18:25

ARTIFICIAL INTELLIGENCE | MEDICINE | NLP |

Medicine is also based on the interaction between patient and physician. moreover, although the patient undergoes different tests and imaging techniques there is always a written report. So why AI models for applications in medicine and healthcare have failed to fully utilize language?

A foundation model for medicine?

The trend in recent years has been to try to constrain large language models (LMs) and then fine-tune them for the required applications. A similar approach can be attempted on large corpora of the medical text so that the model can learn a useful representation. A model that had a good understanding of medical subject matter could be useful for countless applications (triage of patients, retrieval of knowledge, summarization of key findings, diagnosis assistance, and so on).

The problem is that the medical domain is a special domain. In contrast to other fields, there are different issues and even greater safety issues. As we have seen models like ChatGPT can also hallucinate, and be capable of the spread of misinformation.

Everything but everything you need to know about ChatGPT

A new study by Google, focused on the utility of a wide LM to encode clinical knowledge and assess its potential in medicine. They decided to start with a specific task: medical question answering. This is because it is a fundamental but also difficult task: the model must provide high-quality answers to medical questions. To do this, the model must also understand the medical context, find the relevant information, and reason with an expert's questions.

Large Language Models Encode Clinical Knowledge

In fact, the idea of creating an LMs model for medicine is not new. In fact, there are already attempts that have been made over the years. This is because an LM can be trained in an unsupervised manner using a large amount of text (usually general text such as books or Wikipedia).

Models are trained without a specific task, but as the scaling law shows, LMs are capable of emergent behaviors that allow them to adapt to particular tasks without the need for gradient updates. One example is in-context few-shot learning that allows models to "rapidly generalize to unseen tasks and even exhibit apparent reasoning abilities with appropriate prompting strategies." In addition, the models implicitly act as a knowledge base, though also have the disadvantage of amplifying the biases present in the training dataset.

In any case, several approaches have been attempted, as there are millions of medical articles and countless medical data that can be exploited. Early models were based on BERT (sciBERT, BioBERT, PubMedBERT, DARE, ScholarBERT). In addition, models based on the GPT architecture, such as the recent BioGPT, have also been attempted.

Microsoft BioGPT: Towards the ChatGPT of life science?

So what does this study bring that is new?

A new dataset that allows better evaluation of LM in medical question answering.
State-of-the-art results on medical question answering benchmarks.
Instruction prompts tuning to improve alignment to the medical domain.
An in-depth analysis of LM limitations in the medical domain.

"Overview of our contributions We curated MultiMedQA, a benchmark for medical question answering spanning medical exam, medical research, and consumer medical questions. We evaluated PaLM and its instructed-tuned variant, Flan-PaLM, on MultiMedQA. With a combination of prompting strategies, Flan-PaLM exceeded SOTA performance on MedQA (USMLE), MedMCQA, PubMedQA, and MMLU clinical topics. In particular, it improved over the previous SOTA on MedQA (USMLE) by over 17%. We next proposed instruction prompt tuning to further align Flan-PaLM to the medical domain, producing Med-PaLM. Med-PaLM's answers to consumer medical questions compared favorably with clinician-generated answers under our human evaluation framework, demonstrating the effectiveness of instruction prompt tuning". figure source: here

How to evaluate a language model in medicine?

First, we need a good dataset. The authors note that there are several datasets for research but each one focuses on a specific aspect or task: medical exam questions, medical exam questions, and helpful answers to medical information needs.

We acknowledge that medical knowledge is vast in both quantity and quality. Existing benchmarks are inherently limited and only provide partial coverage of the space of medical knowledge. Nonetheless, bringing together a number of different datasets for medical question answering enables deeper evaluation of LLM knowledge than multiple-choice accuracy or natural language generation metrics such as BLEU. (source)

The authors in other words say that one dataset alone is not enough, since the domain is quite large. Moreover, evaluating a metric such as BLEU (or another metric) does not demonstrate the model's ability to understand the domain.

The newly constituted dataset requires that the model be capable of answering multiple choice questions, open-ended questions (long form), closed domain (the answer must be found in the reference text), and open domain questions (limited information is present in a specific source).

So in summary, the authors constructed a new benchmark by combining already used datasets and datasets of curated commonly searched health queries. The entire dataset is in English and covers either medical examinations, medical searches, or even consumer queries. There are also labels and metadata.

Given the complexity of the domain, they have enriched the dataset with answers written by clinicians. In addition:

Secondly, given the safety-critical requirements of the medical domain, we believe it is important to move beyond automated measures of long-form answer generation quality using metrics such as BLEU to those involving more nuanced human evaluation frameworks such as the one proposed in this study.

an example of multiple choice question:

and long answer question:

The authors then defined a framework for which clinicians could measure the robustness of the model. Indeed, the use of metrics although useful omits many important details and can be misleading in the medical context.

The authors used focus groups and interviews with clinicians based in the UK, US, and India to define the axes of evaluation. In addition, they emphasized "notions of agreement with scientific consensus, possibility and the likelihood of harm, completeness, and missingness of answers and possibility of bias."

"Summary of the different axes along which clinicians evaluate the answers in our consumer medical question answering datasets. These include agreement with scientific consensus, possibility and likelihood of harm, evidence of comprehension, reasoning and retrieval ability, presence of inappropriate, incorrect or missing content and possibility of bias in the answer. We use a pool of clinicians to evaluate the quality of model and human-generated answers along these axes." (source)

The table summarizes the kinds of issues the form presents. As the authors state both harm and bias are complex concepts that have no single answer. For example, harm can be defined at different levels: "physical health, mental health, moral, financial, and many others." Therefore, the authors created a form with different questions and provided it to clinicians in different countries (US, UK, and India).

On the other hand, not everyone has medical knowledge, so the authors decided to evaluate the helpfulness and utility of the answers to the consumer. They created a form and the rating was conducted by people who had no medical background:

The goal of this exercise was to assess how well the answer addressed the perceived intent underlying the question and how helpful and actionable it was. (source)

"Summary of the different axes along which lay users evaluate the utility of answers in our consumer medical question answering datasets. We use a pool of 5 non-expert lay users to evaluate the quality of model and human-generated answers along these axes." (source)

Which model?

The authors started by using the Pathways Language Model (PaLM) and the Flan-PaLM family.

PaLM is "a densely-activated decoder-only transformer language model trained using Pathways." The model was trained with a huge corpus of 780 B tokens including internet data, Wikipedia, source code, social media conversation, and books. PaLM in the largest version contains 540 B of parameters and has achieved state-of-the-art in several benchmarks.

Flan-PaLM is the PaLM instruction-tuned counterpart. Flan-PaLM has been trained using different datasets for instruction tuning. As demonstrated above, the use of chain-of-thoughts allows the model to generalize better.

Multimodal Chain of Thoughts: Solving Problems in a Multimodal World

Once we have the model the main problem is how to adapt it to the medical domain:

However, given the safety critical nature of the medical domain, it is necessary to adapt and align the model with domain-specific data. (source)

The authors decided to use prompt and prompt tuning as a strategy. LMs as demonstrated are few-shot learners (need a few examples) for in-context learning. In other words, with a few examples of carefully selected prompts, the model can learn a new task without any gradient updates or finetuning. The authors used three prompting strategies:

Few-shot prompting. few-shot examples describing the task through text-based descriptions (input-output pairs). The best demonstrations were done in agreement with qualified clinicians (for each of the datasets).
Chain-of-thought prompting. A set of intermediate reasoning steps toward the final answer is added to the prompt (an approach that mimics human reasoning in problem-solving). These prompts were also created in conjunction with clinicians.
Self-consistency prompting. One strategy for improving the performance of the model in multiple-choice questions is to sample multiple decoding outputs from the model (the final answer will then be the one that received the majority of votes). The ‘idea behind this is that in a complex domain, there can be multiple potential routes to the correct answer.

As mentioned earlier, prompting methods make it possible to improve the model relatively inexpensively (fine-tuning such large models is computationally expensive). Pero prompting is not enough for many tasks, but they would benefit from fine-tuning.

How to do fine-tuning of a 540 B model?

The answer: prompting tuning. In short, prompt tuning is some prompts (human or AI model-generated vectors) to guide the model to a desired task. There are two types of prompts, those encoded by humans (hard prompts) and those learned using backpropagation (soft prompts). This narrows the learnable parameters to only those representing a small number of tokens (the rest of the model is frozen).

The authors in this study used both soft prompts and relevant task-specific human-engineered prompts (hard).

We refer to this method of prompt tuning as "instruction prompt tuning". Instruction prompt tuning can thus be seen as a lightweight way (data-efficient, parameter-efficient, compute-efficient during both training and inference) of training a model to follow instructions in one or more domains. (source)

Then they used instruction prompt tuning on a small set of exemplars to adapt Flan-PaLM to the medical domain. Since these are few exemplars, these must be good examples "of medical comprehension, recall of clinical knowledge, and reasoning on medical knowledge unlikely to lead to patient harm." In other words, these examples were particularly curated in collaboration with medical experts (clinicians from different disciplines).

Instruction prompt tuning for Med-PaLM (source)

Med-PaLM: does it is better than the other models?

The model has reached and surpassed the state of the art:

MedQA, multiple choice questions on general medical knowledge (U.S. medical licensing exam).
MedMCQA, medical entrance exam questions from India (multiple choice questions).
PubMedQA, biomedical scientific literature.
MMLU, multiple-choice questions on various topics of clinical knowledge, medicine, and biology. related topics

left: Comparison of SOTA LLMs on MMLU clinical topics. right: summary of the performance of PaLM and Flan-PaLM models across different model size variants. adapted from here.

The results also are consistent along the various categories of clinical topics, showing that Flan-PaLM reaches SOTA in all categories.

The authors also decided to evaluate the performance of the model even at different model sizes using medical question-answering datasets in MultiMedQA. This shows that scaling improves the performance of the model when using few-shot prompting Also the result shows that instruction tuning improves the performance in comparison to baseline.

In addition, the authors note two interesting factors:

first, Chain-of-Thought (CoT) prompting does not bring improvement in this case (which is actually surprising).
Self-consistency (SC) leads to strong improvement in multiple-choice performance (which was expected)

left: Summary of the performance of Flan-PaLM models with few-shot and chain-of-thought (CoT) prompting. right: Summary of the performance of Flan-PaLM with and without self-consistency prompting (SC). adapted from here.

The model is also capable of generating an explanation of why it chose a particular response:

"example explanations generated by the Flan-PaLM 540B model to support its multiple-choice answer" (source)

LMs are capable of hallucinating, and in the medical context, this can be disastrous. Therefore, the authors investigated the relationship between LLM uncertainty and statement accuracy. In other words, they used model confidence, and with higher confidence, they noticed higher accuracy.

Analysis of deferral behavior of Flan-PaLM 540B model with self-consistency (source)

Does the model convince the clinicians?

Best accuracy is enough to use the model in clinics?

Metrics are important but can be misleading. Especially for a sensitive field like medicine, there is a need for more than just the result on a benchmark.

The authors selected 100 questions that may represent the inquiries of real consumers. After that, they used Flan-PaLM and Med-PaLM (both 540B models) to predict responses and submitted them to a panel of 9 clinicians (based in the US, UK, and India).

While clinicians showed scientific consensus in 92 % of the questions, Flan-PaLM was in agreement in 61 % of the cases.

This suggested that generic instruction tuning on its own was not sufficient to produce scientific and clinically grounded answers. However, we observed that 92.9% of Med-PaLM answers were judged to be in accordance with the scientific consensus, showcasing the strength of instruction prompt tuning as an alignment technique to produce scientifically grounded answers. (source)

The models were trained on articles and books that have been published previously, so the authors point out that this may be a reason for failure and continued learning should be explored.

The authors then asked clinicians whether the model contained errors in comprehension, knowledge retrieval, and reasoning capabilities of medical knowledge. Here again, Med-PaLM was shown to be superior to Flan-PaLM.

The authors asked whether the answers contained errors or missing content (the completeness and correctness of the generated answers). That is, whether the model omitted information that should have been there or there was information in the answer that should not have been there. Med-PaLM answers showed omission of important information in 15 % of cases (compared with 47 % in Flan-PaLM). Surprisingly, though, Med-PaLM contained more errors than Flan-PaLM (18 % vs. 16 %). The authors explain this result thus:

instruction prompt tuning teaches the Med-PaLM model to generate significantly more detailed answers than the Flan-PaLM model, reducing the omission of important information. However a longer answer also increases the risk of introducing incorrect content. (source)

The authors also explored the severity and likelihood of potential harm from the generated responses. In other words, they asked whether these responses could lead to actions by either clinicians or consumers/patients that would lead to health-related harm. Although the definition is relative and the rating is in this case a subjective measure, instruct fine-tuning produces safer responses in comparison with the baseline model.

In addition, the authors analyzed the potential for the model to amplify bias in healthcare. The model could reflect or amplify patterns present in training data that reflect disparities in health outcomes and access to care. The results show that the new approach significantly reduces the risk of bias.

Finally, the authors analyzed how non-expert assess the answers.

While Flan-PaLM answers were judged to be helpful in only 60.6% of the cases, the number improved to 80.3% for Med-PaLM answers. (source)

But remain inferior to clinicians. Similar results have been obtained by asking the non-expert consumers if the answers directly answered the user's question.

Limitations of the model

The authors note several limitations to the study and future directions:

The dataset they proposed (MultiMedQA) is not exhaustive (despite including several sources). For example, it is lacking in biology. In addition, the model should also include questions and answers closer to the real world (multiple-choice questions are easy to fill in but are far from the real world).
Performance evaluation with experts shows that this model is not at the level of clinicians.
The same human evaluation approach could be improved. Certainly, it is limited and the number of experts should be increased. The very concept of consensus is context and time-dependent. In addition, scientific consensus often does not take minorities into account and thus could itself be a source of bias. Not to mention that it is influenced by the background of the clinicians themselves.
The analysis of bias and harm is limited, considering that this is exploratory work. On the other hand, the medical field is an extremely sensitive field and these are ethical issues that cannot be ignored. Therefore, the analysis should be expanded to include patients as well. In addition, specific benchmark datasets for similar tasks are lacking.

Parting thoughts

This paper shows how instruction prompt tuning can improve the performance of a model in a complex field such as medical question answering. However, this behavior emerges with the scale of the model. Moreover, this model manages to achieve state-of-the-art in comparison to other models

The 540 B version of PaLM alone still manages to achieve appreciable results. Probably the training data contained several medical sources, and the model stored this information among the parameters.

Evaluation with human experts shows that scaling alone is not sufficient anyway. Even Med-PaLM itself can produce answers that are either incomplete or wrong.

In any case, it is still premature to be able to use such a model in healthcare. First, more research is needed to ensure the safety of the model. While for the time being it is difficult to hypothesize using it to treat a disease, it could be considered as an approach to provide information to a patient about disease and medication.

On the other hand, physicians also have bias, and LMs could be efficient assistants. In the future, LMs could also be useful in mitigating bias and allowing greater access to therapies.

Lastly, Google released an API of PaLM, with the idea that it can be used for prototyping and building generative AI applications (more here)

If you have found this interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn.

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

PCA: Bioinformatician's Favorite Tool Can Be Misleading

Stable diffusion and the brain: how AI can read our minds

Stable diffusion to fill gaps in medical image data

Why Do We Have Huge Language Models and Small Vision Transformers?

Tags: Artificial Intelligence Data Science Healthcare Medicine Science