Medical chatbots fail safety test in BMJ Open study, reviewers flag about one fifth of answers as highly problematic across ChatGPT Gemini Grok Meta AI and DeepSeek, citation lists often incomplete or fabricated

Images

(Getty/iStock) Getty/iStock

Five widely used AI chatbots gave “highly problematic” health advice in nearly one in five answers in a new evaluation published in BMJ Open, according to an article in The Independent. Across 250 prompts, expert reviewers rated about half of responses as problematic and found that refusals to answer were rare.

The researchers tested ChatGPT, Google’s Gemini, xAI’s Grok, Meta AI and DeepSeek on 50 questions each spanning cancer, vaccines, stem cells, nutrition and athletic performance. Two experts independently assessed every response, The Independent reports, and the models clustered tightly in performance: Grok was flagged most often, but none reliably avoided errors.

The failure mode was not limited to obscure edge cases. Even in areas with large bodies of established research—vaccines and cancer—the chatbots still produced problematic answers roughly a quarter of the time, according to the write-up. The biggest drop-off came in nutrition and athletic performance, domains where online information is noisy, contradictory and often commercially motivated.

Open-ended prompts were where the systems most often went off the rails. The study found 32% of open-ended answers were “highly problematic”, compared with 7% for closed questions. That matters because real users rarely ask tidy yes/no queries; they ask for rankings, “best” supplements, or alternative treatments—exactly the kind of framing that rewards confident-sounding speculation.

Citations, when provided, did not function as a safety net. When the chatbots were asked to supply ten scientific references, the median completeness score was 40%, and none produced a fully accurate reference list in 25 attempts, The Independent reports. Errors ranged from incorrect bibliographic details to broken links and fabricated papers—problems that can be hard for non-experts to detect but can lend false authority to the surrounding text.

The authors describe the exercise as a stress test: prompts were designed to push the models toward misleading answers, a standard “red teaming” approach. That likely inflates error rates relative to neutral questions. But it also matches how people actually use these systems in moments of uncertainty—probing, anxious, and often primed to seek confirmation.

In the study, only two out of 250 questions were flatly refused. The rest produced an answer, polished enough to pass as clinical guidance, even when it wandered into overclaiming. The practical effect is that a chatbot can look like a second opinion while behaving more like an autocomplete engine trained on the internet.

The paper’s dataset was built using free versions of the tools available in February 2025, and performance may change as models are updated. For users, the interaction will still be the same: a text box, a plausible response, and a reference list that usually cannot be checked in real time.