Read the conclusions section from the paper you linked:
> GPT-4o’s performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.
The ability of an LLM to pass a multiple-choice test has no relationship to its ability to make correlations between things it's observing in the real world and diagnoses on actual cases. Being a doctor isn't doing a multiple choice test. The paper is largely making the determination that GPT might likely be used as a study aid by med students, not by experienced doctors in clinical practice.
From the protocol section:
> This protocol for eliciting a response from ChatGPT was as follows: “Answer the following question and provide an explanation for your answer choice.” Data procured from ChatGPT included its selected response, the rationale for its choice, and whether the response was correct (“accurate” or “inaccurate”). Responses were deemed correct if ChatGPT chose the correct multiple-choice answer. To prevent memory retention bias, each vignette was processed in a new chat session.
So all this says is in a scenario where you present ChatGPT with a limited number of options and one of them is guaranteed to be correct, in the format of a test question, it is likely accurate. This is a much lower hurdle to jump than what you are suggesting. And further, under limitations:
> This study contains several limitations. The 750 MCQs are robust, although they are “USMLE-style” questions and not actual USMLE exam questions. The exclusion of clinical vignettes involving imaging findings limits the findings to text-based accuracy, which potentially skews the assessment of disciplinary accuracies, particularly in disciplines such as anatomy, microbiology, and histopathology. Additionally, the study does not fully explore the quality of the explanations generated by the AI or its ability to handle complex, higher-order information, which are crucial components of medical education and clinical practice—factors that are essential in evaluating the full utility of LLMs in medical education. Previous research has highlighted concerns about the reliability of AI-generated explanations and the risks associated with their use in complex clinical scenarios [10,12]. These limitations are important to consider as they directly impact how well these tools can support clinical reasoning and decision-making processes in real-world scenarios. Moreover, the potential influence of knowledge lagging effects due to the different datasets used by GPT-3.5, GPT-4, and GPT-4o was not explicitly analyzed. Future studies might compare MCQ performance across various years to better understand how the recency of training data affects model accuracy and reliability.
To highlight one specific detail from that:
> Additionally, the study does not fully explore the quality of the explanations generated by the AI or its ability to handle complex, higher-order information, which are crucial components of medical education and clinical practice—factors that are essential in evaluating the full utility of LLMs in medical education.
Finally:
> Previous research has highlighted concerns about the reliability of AI-generated explanations and the risks associated with their use in complex clinical scenarios [10,12]. These limitations are important to consider as they directly impact how well these tools can support clinical reasoning and decision-making processes in real-world scenarios.
You're saying that "LLMs are much more accurate than medical students in licensing exam questions" and extrapolating that to "LLMs can currently function as doctors."
What the study says is "Given a set of text-only questions and a list of possible answers that includes the correct one, one LLM routinely scores highly (as long as you don't include questions related to medical imaging, which it cannot provide feedback on) on selecting the correct answer but we have not done the necessary validation to prove that it arrived at it in the correct way. It may be useful (or already in use) among students as a study tool and thus we should be ensuring that medical curriculums take this into account and provide proper guidelines and education around their limitations."
This is not the success you believe it to be.