Cognition and Behavior · Human Capacity
Large language models encode clinical knowledge
Summary
Google's medical AI team tested whether large language models could pass the same kinds of questions doctors face, building a benchmark called MultiMedQA out of seven existing medical question sets and adding a new set drawn from common consumer health searches. The point was not just to see if a model could memorize textbook facts but to test reasoning across realistic medical situations.
GPT-3.5 scored 50.3% on USMLE-style multiple-choice questions, already surprising. Med-PaLM, an instruction-tuned variant of Google's Flan-PaLM, jumped to 67.6%. A clinician panel rated Med-PaLM's free-text answers as on par with expert physicians on most quality dimensions, with gaps remaining on factual completeness and on bias.
The encouraging finding is that language models can encode meaningful clinical knowledge and answer in a form that physicians and patients find useful. The honest caveats matter just as much. The models still hallucinate, the gap to expert performance varies by question type, and questions about training-data equity remain open. The future of AI in medicine is not whether models can know things. It is whether that knowledge can be made trustworthy and accountable in clinical settings.