IMAJ | volume 28
Journal 4, April 2026
pages: 232-236
1 Department of Ophthalmology, Shamir Medical Center (Assaf Harofeh), Zerifin, Israel
2 Wolfson Medical Center, Holon, Israel
3 Gray Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
Summary
Background:
The rapid evolution of large language models warrants updated benchmarking in ophthalmology to determine whether newer versions offer clinically meaningful improvements over earlier models and human comparators.
Objectives:
To evaluate the diagnostic accuracy of ChatGPT-4o and ChatGPT-5 in ophthalmic cases and to compare it with previously reported results of ChatGPT-3.5, residents, and specialists.
Methods:
This retrospective cohort study was conducted in one academic tertiary medical center. We reviewed data of patients admitted to the ophthalmology department from June 2022 to January 2023. We then created two clinical cases for each patient. The first was according to medical history alone (Hx). The second added the clinical examination (Hx and Ex). For each case, we asked for the three most likely diagnoses from ChatGPT-4o and ChatGPT-5. We then compared the accuracy rates (at least one correct diagnosis) with previous results of ChatGPT-3.5, residents, and specialists.
Results:
A total of 63 cases were analyzed, first using history alone and then with examination findings. Based on history alone, GPT-5 and GPT-4o correctly identified 73% and 70% of cases, respectively, outperforming GPT-3.5 (54%,
P < 0.05) and approaching the accuracy of residents (75%) and attending physicians (71%,
P < 0.05). When physical examination was included, diagnostic accuracy rose to 94% for GPT-5 and 89% for GPT-4o, surpassing GPT-3.5 (68%,
P < 0.05) and closely matching or exceeding human performance (residents 94%, attendings 87%).
Conclusions:
ChatGPT-4o and ChatGPT-5 significantly outperformed GPT-3.5 and achieved diagnostic accuracy similar or even higher to clinicians in diagnosing ophthalmology cases.