Science

Medical chatbots fail BMJ Open accuracy test

ChatGPT Gemini Grok Meta AI and DeepSeek rarely refuse to answer, citations look scientific until readers try to click them

Images

The chatbots were each asked 50 health and medical questions spanning cancer, vaccines, stem cells, nutrition and athletic performance (Getty/iStock) The chatbots were each asked 50 health and medical questions spanning cancer, vaccines, stem cells, nutrition and athletic performance (Getty/iStock) Getty/iStock

Medical chatbots give wrong answers in BMJ Open test, ChatGPT Gemini Grok Meta AI and DeepSeek rarely refuse prompts, fabricated citations turn reassurance into a product

Five widely used medical chatbots produced answers judged “highly problematic” in nearly one in five cases in a new study published in BMJ Open, according to reporting by The Independent. A team of seven researchers asked the free versions of ChatGPT, Gemini, Grok, Meta AI and DeepSeek 250 questions spanning cancer, vaccines, stem cells, nutrition and athletic performance. Only two questions were refused outright.

The researchers had two experts independently rate every response, and found that about half of all answers were “problematic” to some degree. The tools did better on vaccines and cancer than on nutrition and sports performance, but even in the stronger categories roughly a quarter of answers were flagged as problematic. Open-ended prompts — closer to what patients actually type into a search box — produced a much higher share of highly problematic answers than closed questions.

The study also tested whether the systems could support their claims with scientific references. When asked for ten citations, the median completeness score was 40%, and none of the chatbots produced a single fully accurate reference list across 25 attempts, The Independent reports. The errors ranged from broken links to wrong authors and papers that did not exist. In practice, that means a confident answer can arrive with a bibliography that looks academic while being unusable for verification.

The paper’s setup was deliberately adversarial: the team crafted prompts designed to push the models toward misleading outputs, a technique often used in “red teaming”. That matters because consumer-facing chatbots are increasingly offered as a first stop for health questions, and the frictionless interface rewards speed and plausibility rather than careful triage. In the study, the five systems performed “roughly the same”, suggesting the problem is not confined to one vendor’s model but to the broader pattern of tools trained to keep responding.

The researchers note that newer releases or paid tiers may perform better than the free versions tested in February 2025. But the free tier is the product most people use, and it is also the one most likely to be embedded in search, messaging apps and operating systems where a health query can be answered before a user ever reaches a clinician’s advice page.

In the BMJ Open dataset, the chatbots almost always answered. The citations, when requested, often could not be checked.