Expert Evaluation of Artificial Intelligence Generated Answers to Frequently Asked Questions About Rhinoplasty Rinoplasti Sık Sorulan Sorularına Yapay Zekâ Yanıtlarının Uzman Onaylı Karşılaştırması

Şerifle, Serkan; ÇELİK, BURAK; BULUT, KADİR; Gül, Fatih; BOZDEMİR, KAZIM; BABADEMEZ, MEHMET

doi:10.54005/geneltip.1756002

Expert Evaluation of Artificial Intelligence Generated Answers to Frequently Asked Questions About Rhinoplasty Rinoplasti Sık Sorulan Sorularına Yapay Zekâ Yanıtlarının Uzman Onaylı Karşılaştırması

Şerifle S., ÇELİK B., BULUT K. Ş., Gül F., BOZDEMİR K., BABADEMEZ M. A.

Genel Tip Dergisi, cilt.36, sa.2026, 2026 (Scopus, TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 36 Sayı: 2026
Basım Tarihi: 2026
Doi Numarası: 10.54005/geneltip.1756002
Dergi Adı: Genel Tip Dergisi
Derginin Tarandığı İndeksler: Scopus, Directory of Open Access Journals, TR DİZİN (ULAKBİM)
Anahtar Kelimeler: Artificial intelligence, comprehension, large language models, postoperative care, rhinoplasty
Lokman Hekim Üniversitesi Adresli: Evet

Özet

Aim: Large language models (LLMs) such as ChatGPT-4, DeepSeek, and Gemini are increasingly explored as tools for patient education and clinical decision support. However, concerns remain regarding their factual accuracy, completeness, and readability, especially when addressing frequently asked patient questions in postoperative care. This study aimed to directly compare three leading AI models—ChatGPT-4, DeepSeek, and Gemini—in terms of their accuracy, clarity, relevance, and completeness when answering common postoperative rhinoplasty FAQs. A secondary objective was to assess the readability of these AI-generated responses for a general patient audience. Methods: We selected 14 frequently asked questions based on authoritative AAO-HNS guidelines. Responses from each AI model were independently evaluated by 15 otorhinolaryngologists using a 5-point Likert scale across four domains: accuracy, clarity, relevance, and completeness. Readability was measured using the Flesch Reading Ease Score and Flesch–Kincaid Grade Level. Data were analyzed using appropriate statistical tests to identify significant differences among the models. Results: Expert evaluations showed significant performance differences among the models. DeepSeek underperformed in both accuracy (p=0.00003) and completeness (p=0.0042) compared to ChatGPT-4 and Gemini. No statistically significant differences were observed for clarity (p=0.52) or relevance (p=0.42). Although readability scores did not significantly differ across models, all responses were deemed too complex for the average patient to understand fully. Conclusions: While ChatGPT-4 and Gemini demonstrated higher accuracy and completeness than DeepSeek, none of the evaluated AI models produced content that met essential patient readability standards. These findings underscore the need for improved content accessibility and ongoing human oversight before LLMs can be reliably integrated into clinical patient education. This study establishes an important benchmark and highlights the urgency for future AI development to prioritize both factual integrity and true patient comprehension.