Welcome to Pulse: Amplify, where we sit down with the leaders and changemakers shaping the future of health.
A recent Nature Medicine study went viral after reporting that ChatGPT Health under-triaged more than half of emergency cases when tested using clinician-written scenarios. The finding raised serious concerns about whether consumer AI tools are safe for medical triage.
But researchers from Macquarie University’s Australian Institute of Health Innovation took a closer look at the study design and suspected the results might reflect the evaluation format rather than the AI’s clinical capability.
In this episode of Pulse Amplify, Louise and George speak with David Fraile Navarro about their follow-up study testing five frontier AI models across more than a thousand trials. Their research suggests that when AI systems are evaluated using more natural, patient-style interactions rather than exam-style prompts, triage performance improves significantly.
The discussion explores why prompt structure, forced answer formats, and restrictions on clarifying questions can dramatically alter model behaviour, and why designing realistic evaluation methods is essential as millions of people begin using AI for health advice.
The conversation also examines broader questions:
How should AI triage tools be evaluated?
What role should clinicians play in AI-mediated care?
And what do patients need to know before trusting AI with health decisions?
References
Fraile Navarro D, Magrabi F, Coiera E. (2026).
Evaluation format, not model capability, drives triage failure in the assessment of consumer Health AI. Zenodo.
https://doi.org/10.5281/zenodo.18975048
Connect with David Fraile Navarro:
LinkedIn
Visit Pulse+IT.news to subscribe to breaking digital news, weekly newsletters and a rich treasure trove of archival material. People in the know, get their news from Pulse+IT – Your leading voice in digital health news.