
Assessing the Quality of Narrative Feedback in Undergraduate Medical Education Using Large Language
Tauqeer Iftikhar
Purpose: High-quality narrative feedback is critical for competency development in medical education, but its quality varies widely. This study aimed to evaluate the quality of narrative feedback in undergraduate medical education (UGME) using the Quality of Assessment for Learning (QuAL) score and explore the potential of large language models (LLMs) to automate this assessment.
Methods: A sample of 7,470 de-identified evaluations were obtained from UGME pre-clerkship clinical skills courses. Eleven trained raters scored each evaluation using the QuAL score.
Results: The median QuAL score was 2 out of 5 (IQR: 2–4). Although 70.2% of comments included a suggestion for improvement, only 36.9% of those suggestions explicitly referenced a specific observed performance, and 52.6% of all comments contained no direct evidence of the learner’s performance. Seventeen percent achieved the highest score of 5, indicating some excellent quality despite the low median.
Discussion: Most feedback offered advice for improvement, however, it often lacked specificity and clear behavioral linkage, potentially limiting its educational value. Evidence of learner performance was also often insufficient. This human-rated dataset will next be used to train and validate an NLP model for automated QuAL scoring and developing a model for real-time feedback coaching.