Diagnostic interpretation of pure tone audiograms by multimodal LLMs: A comparative study of ChatGPT-5.0 and Gemini 2.5


Karaçaylı C., Tahir E., ALTUNTAŞ E. E.

European Archives of Oto-Rhino-Laryngology, cilt.283, sa.4, ss.2227-2236, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 283 Sayı: 4
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1007/s00405-025-09932-6
  • Dergi Adı: European Archives of Oto-Rhino-Laryngology
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, EMBASE, MEDLINE
  • Sayfa Sayıları: ss.2227-2236
  • Anahtar Kelimeler: Artificial intelligence, ChatGPT, Gemini, Large language models, Pure tone audiometry, Visual diagnostic tools
  • Lokman Hekim Üniversitesi Adresli: Evet

Özet

Objectives: This study aimed to evaluate and compare the diagnostic accuracy of two multimodal large language models—ChatGPT-5.0 Plus and Gemini 2.5—in interpreting pure tone audiograms. The primary hypothesis was that ChatGPT-5.0 Plus would outperform Gemini 2.5 in identifying auditory thresholds, determining type and degree of hearing loss, detecting masking, and providing treatment recommendations based on standardized visual inputs. Design: A diagnostic simulation study was conducted using 80 software-generated audiograms representing common hearing loss profiles, including normal, conductive, sensorineural, and mixed types. Each audiogram was interpreted by both models using a structured seven-question diagnostic prompt aligned with professional audiological guidelines. Two independent evaluators—an audiologist and an otolaryngologist with audiology expertise—rated each model’s response using a five-point Likert scale. Inter-rater agreement and comparative analyses were performed using non-parametric statistical tests. Results: ChatGPT-5.0 Plus outperformed Gemini on six of seven diagnostic items and in the overall score. Inter-rater agreement for ChatGPT was almost perfect overall (κ = 0.951), with moderate concordance on Q1 (κ = 0.490) and very high agreement across the remaining items (κ = 0.912–0.981). Gemini also showed strong but lower consistency (overall κ = 0.823; item-level 0.833–0.949), with the weakest agreement on treatment recommendation (Q7). Comparative analyses revealed statistically significant advantages for ChatGPT in air and bone conduction threshold identification, classification of hearing loss type and degree, and diagnostic accuracy. Differences in masking evaluation (Q5) were not significant. Median score differences of 1–2 points on the 5-point scale underscored the clinical relevance of ChatGPT’s superior performance. Conclusions: ChatGPT-5.0 Plus demonstrated superior accuracy and consistency in interpreting pure tone audiograms compared to Gemini 2.5. While not suitable as standalone diagnostic tools, large language models may serve as useful adjuncts in primary care and telehealth environments for preliminary audiological assessment. Further validation in real-world clinical settings is necessary before broader implementation.