Passing part of a medical licensing exam doesn’t make ChatGPT a good doctor


The difference between such specialized medical AIs and ChatGPT, though, lies in the data they have been trained on. “Such AIs may have been trained on tons of medical literature and may even have been trained on similar complex cases as well,” Kirpalani explained. “These may be tailored to understand medical terminology, interpret diagnostic tests, and recognize patterns in medical data that are relevant to specific diseases or conditions. In contrast, general-purpose LLMs like ChatGPT are trained on a wide range of topics and lack the deep domain expertise required for medical diagnosis.”

The lack of domain expertise manifested in how ChatGPT interpreted medical shades of gray. “Health care providers learn to look at lab values as part of a bigger picture, and we know that if the ‘normal range’ for a blood test result is ‘10–20’ that a value of 21 is very different from a value of 500,” Kirpalani said. ChatGPT, being ignorant of more nuanced medical knowledge, got side-tracked whenever test results were even slightly outside of the normal range.

But there was another, more grave issue. Part of the reason that AMIE and most other medical AIs are not publicly available is what they do when they are wrong. And what they do is exactly what ChatGPT does: They try to con you into thinking they’re right.

Medical AI con man

While ChatGPT may have been wrong in diagnosing more than half of Medscape cases, the rationale behind the answers it offered, even when it was wrong, was really good. “This was both interesting and concerning. On the one hand, this tool is really effective at taking complex topics and simplifying explanations. On the other hand, it can be very convincing, even if it’s wrong, because it explains things in such an understandable way,” Kirpalani said.

The problem with large language models, and all modern AIs in general, is that they have no real comprehension of the subject matter they talk or write about. All they do is predict what the next word in a sentence should be based on probabilities obtained from a huge amount of text (medical or not) that they ingested during training. Sometimes, this leads to AI hallucinations that make the responses kind of gibberish. But more often, chatbots make very compelling, well-structured, and well-written arguments for something that may not be true.

Scroll to Top