OpenEvidence's gen AI bests ChatGPT, Med-PaLM 2 on US medical licensing exam

Top Story

By: Olivia Roger

Ref: PR Newswire, OpenEvidence

Published: 07/14/2023

OpenEvidence's gen AI bests ChatGPT, Med-PaLM 2 on US medical licensing exam

OpenEvidence, which is working on aligning large language models to the medical domain, on Friday said its generative artificial intelligence (AI) tool is the first to score over 90% on the US medical licensing exam (USMLE), outperforming OpenAI's ChatGPT and Google's Med-PaLM 2.

The three-step exam demands a broad understanding of biomedical and clinical sciences, testing for factual recall and decision-making ability to provide safe and effective patient care. The company used the official 2022 sample exam to measure the performance of its model.

Earlier this year, Google revealed that Med-PaLM 2 scored 85% on the USMLE practice test, an improvement from its first-generation AI chatbot that had achieved over 67%. Meanwhile, a study published in PLOS Digital Health in February found that OpenAI's ChatGPT performed at or near the passing threshold of about 60% accuracy, while its GPT-4 model earned a grade of 88% on medical challenge problems, according to recent findings posted to the arXiv preprint archive.

OpenEvidence's founder Daniel Nadler said "single-point differences…translate into highly impactful differences in AI performance, since the USMLE contains hundreds of questions." He explained that "each additional USMLE score point represents multiple additional correct answers – each one of which corresponds to medical knowledge that could translate into life or death for a patient, if the AI system is used as a physician co-pilot in a clinical setting."

Lowest error rate

OpenEvidence said its tool also achieved the lowest error rate of any AI on the exam, making 77% fewer errors than ChatGPT, 24% fewer errors than GPT-4 and 31% fewer errors than Med-PaLM 2.

Meanwhile, ChatGPT recently passed a radiology board-style exam, and also attained a higher average score of 77% compared to human candidates in a mock obstetrics and gynaecology clinical exam, although it has fallen short on American College of Gastroenterology exams.

Don't want to miss our top stories? Sign up for our free daily newsletter here.