ABSTRACT
Objective
The aim of this study is to assess the difficulty level of artificial intelligence (AI)-generated multiple-choice questions (MCQs) created by large language models (LLMs) using different prompts across various chatbots, compared to human-written questions.
Methods
We generated case-based MCQs on obstetrics and gynecology using two distinct prompts across four LLM-based chatbots. Expert-reviewed MCQs were administered to 97 medical students who were undergoing clerkship training in obstetrics and gynecology. Subsequently, item difficulty indices were calculated for each MCQ.
Results
The mean difficulty index of the AI-generated questions was 0.30. One prompt produced questions with a difficulty index of 0.34 (classified as difficult), while the other produced a lower difficulty index of 0.25 (classified as more difficult). In contrast, the mean difficulty index of the human-written questions was 0.63, indicating a moderate level of difficulty.
Conclusion
Our study highlights the challenges of using AI-generated MCQs in medical education. Although AI offers promising benefits for question generation, the questions produced were generally too difficult for undergraduate medical students. This underscores the need for more detailed and contextually informed prompt designs to better align AI outputs with assessment requirements. Although BDM-based chatbots enhance efficiency in question generation, expert review remains essential to ensure the appropriateness and quality of the items.
INTRODUCTION
Case-based multiple-choice questions (MCQs) are widely used in medical education,1 particularly in fields such as obstetrics and gynecology,2, 3 because of their efficiency in assessing knowledge and cognitive skills. However, creating high-quality MCQs is a labor-intensive and time-consuming process that demands significant expertise. To address this challenge, leveraging artificial intelligence (AI) for the development of MCQs has been proposed as a promising solution, given its potential to play a pivotal role in medical education.4-6
Large language models (LLMs) have demonstrated significant potential to automate the generation of MCQs for medical education.7-12 More specifically, recent literature provides evidence supporting the validity of ChatGPT-generated MCQs.13 However, the findings suggest that employing simple prompts, such as “write four MCQs about this topic,” often leads to suboptimal outputs.14, 15 Therefore, well designed11 and detailed16 prompts should be used to generate higher-quality MCQs. However, it remains unclear whether these commands can generate MCQs at targeted difficulty levels—defined as the proportion of test-takers who answer an item correctly—across different LLM-based chatbots. This study aims to fill this gap by examining the difficulty levels of AI-generated MCQs in obstetrics and gynecology using various prompts across four chatbots.
This study examines how prompt design and model selection influence the difficulty of AI-generated MCQs in obstetrics and gynecology and addresses how these compare with human-written questions. Our goal was to enhance understanding of how AI can be effectively utilized for case-based MCQ generation and to provide evidence supporting the development of better MCQs that are appropriately challenging for medical students.
METHODS
Study Setting, Design and Participants
This study was conducted in the Department of Obstetrics and Gynecology at University of Health Sciences Türkiye, Gülhane Faculty of Medicine, during the 2023–2024 academic year. Undergraduate medical education at the school spans six years.
The first three years focus on basic medical sciences, offering limited clinical exposure. In the fourth year, the gynecology and obstetrics clerkship is conducted in six rotations with 35–50 students per group, and the passing grade is calculated based on an assessment consisting of a multiple-choice examination. To avoid bias, only fourth-year students who had completed the obstetrics and gynecology clerkship were included. Using convenience sampling, we recruited 97 (88.1%) volunteer participants for the study. The participants completed a test comprising 18 obstetrics and gynecology MCQs, including 14 AI-generated questions and four human-written ones. Since the participants were native Turkish speakers, the AI-generated questions in English were translated into Turkish. Following completion of the internship, the test was conducted in a supervised classroom setting after informed consent had been obtained from all participants.
Multiple-Choice Question Generation
In March 2024, we aimed to generate four MCQs for each of the four LLM-based chatbots. The chatbots used were ChatGPT-3.5, ChatGPT-4, Gemini 1.0 Pro, and Mixtral-8x7B-Instruct-v0.1. We used two prompts for each chatbot: Kıyak’s16 promptand Zuckerman et al.’s11 prompt. Previous studies have demonstrated their effectiveness.9, 11 These prompts were also used to develop a custom GPT to generate case-based MCQs for medical education because of their proven effectiveness.17
To generate the questions, we selected two topics from the learning objectives of the obstetrics and gynecology clerkship: the differential diagnosis of vaginal bleeding and the management of postpartum hemorrhage. These topics were provided to the four chatbots using two distinct prompts. Ultimately, we generated 14 questions—seven each on vaginal bleeding and postpartum hemorrhage—because Gemini 1.0 Pro declined to create a question using Zuckerman et al.’s11 prompt, likely due to its requirement for an NBME-style question. We used ChatGPT’s webpage for ChatGPT-3.5, Case-based MCQ Generator18 for ChatGPT-4, Gemini’s webpage for Gemini 1.0 Pro, and Hugging Face’s webpage19 for Mixtral. We added “differential diagnosis of vaginal bleeding” and “management of postpartum hemorrhage” to the prompts and generated each question on a separate conversation page. Additionally, since Kıyak’s16 prompt requires selecting a difficulty level, we specifically requested the generation of difficult questions.
In addition to the 14 AI-generated questions, four expert-written MCQs—two on vaginal bleeding and two on postpartum hemorrhage—were provided to the participants.
The test was administered independently of the obligatory exams in the medical school and did not affect the students’ grades. Informed consent was obtained from each participant. The study was approved by the University of Health Sciences Türkiye, Gülhane Scientific Research Ethics Committee (approval no: 2024/04, date: 24.04.2014).
Expert Panel
The AI-generated questions were reviewed and revised by two obstetrician–gynecologists with expertise with LLMs to ensure clinical accuracy, consistency, and suitability for the intended student level. Revisions were primarily limited to minor refinements in wording, clinical accuracy, and internal consistency, leaving the fundamental structure of the questions unchanged. None of the items necessitated major revisions.
Statistical Analysis
In line with our research question, we calculated the item (question) difficulty index for each question as the number of correct answers divided by the total number of test-takers. The difficulty index ranges from zero to one, with one representing the easiest level and zero representing the most difficult level. We classified values <0.30 as “too difficult”, 0.30–0.40 as “difficult”, 0.40–0.80 as “moderate”, and >0.80 as “easy”.20, 21
RESULTS
Of the fourth-year medical students who completed their obstetrics and gynecology internship, 88.1% participated in an 18-item multiple-choice test. The mean difficulty index of the 14 AI-generated items was 0.30, placing the items at the threshold of the “difficult” category. In contrast, the four expert-authored items demonstrated a substantially higher mean difficulty index of 0.63, aligning with the “moderate” difficulty range (Table 1).
Notable differences were observed in the more detailed classification based on prompt type. Items generated using Kıyak’s16 prompt (n=8) were classified as “difficult”, with an average difficulty index of 0.34. However, items derived from Zuckerman et al.’s11 prompt (n=5) showed a lower average difficulty index of 0.25 and were classified as “very difficult.” In light of these findings, prompt design plays a critical role in shaping the cognitive accessibility of AI-generated questions.
In the two thematic areas evaluated—differential diagnosis of vaginal bleeding (n=7) and management of postpartum hemorrhage (n=7)—items generated by AI consistently exhibited lower difficulty indices compared to those written by experts. AI-generated items have a uniform level of difficulty, items written by experts provide a more balanced spectrum of difficulty. Specifically, none of the AI-generated items reached the “easy” classification (>0.80), while two of the four items written by experts fell within the “moderate-easy” range.
DISCUSSION
This study aimed to evaluate the difficulty levels of human-generated and AI-generated MCQs in obstetrics and gynecology. The findings highlighted the combined influence of prompt architecture and human judgment on the psychometric properties of MCQs in medical education. To the best of our knowledge, this is the first study to investigate the effect of different prompts and chatbots on the difficulty of MCQs.13 Our key findings indicate that the mean difficulty index of the 14 AI-generated questions falls into the “too difficult” category when administered to a group of undergraduate medical students. Questions generated using Kıyak’s16 prompt were classified as difficult, whereas those generated using Zuckerman et al.’s11 prompt were classified as too difficult. In comparison, the human-written questions reflected a moderate level of difficulty.
While our study found AI-generated questions difficult, previous researches reported moderate mean difficulty levels indices such as of 0.71,11 0.69,10 and 0.689 in ChatGPT-generated case-based questions. This discrepancy may be attributed to differences among the participant groups; variations in their background knowledge, experience, and familiarity with the subject matter could have influenced their performance. Our results emphasize that prompt design is not just a technical input for AI tools, but an educational intervention that can shape learning outcomes by modulating assessment difficulty. This positions prompt engineering as an emerging field of educational design.
The findings of this study highlight a critical issue in AI-based educational content creation: mismatched difficulty levels. Assessments that fail to accurately reflect the expected competency of learners may lead to suboptimal or even negative learning outcomes. When AI-generated items consistently fall outside the optimal difficulty range—particularly in formative assessments—this can compromise both learning efficiency and student motivation. Consequently, aligning the prompt structure with the curriculum level and learner profiles is not merely desirable, but a pedagogical necessity.
Our findings indicate that AI-generated questions, particularly those created using Zuckerman et al.’s11 prompt, tend to be too difficult for undergraduates, likely because the prompt did not specify the intended difficulty. In contrast, Kıyak’s16 prompt produced questions with varying difficulty, ranging from too difficult to moderate, despite having explicitly requested difficult questions. This variability may arise from two factors: (1) limited detail provided in the prompts and (2) inherent limitations of LLMs.22-24 The prompt templates can be improved by updating them to allow inclusion of additional context about local needs in the curriculum and target population, through prompt-engineering tactics.25 However, the inherent limitations of LLMs will still require experts to review and revise the outputs to ensure that the MCQs are appropriate for the target population’s needs. Better prompts can reduce the effort required after generation.
Study Limitations
Several limitations should be considered when interpreting the results of this study. First, the study was conducted with a specific group of fourth-year medical students at a single institution, which may limit the generalizability of the findings. Second, the study evaluated only a small number of questions generated by four AI models. Additionally, the study focused on two specific topics in obstetrics and gynecology, and the results might differ for other medical topics. Future research should expand the study to include multiple institutions and a more diverse group of participants to enhance generalizability, investigate the performance of various AI models and versions to identify those most effective in generating optimally difficult questions, and assess AI-generated question difficulty across a wider range of medical topics. Furthermore, the reviewers’ expertise with LLMs could have affected their assessment of question clarity and perceived difficulty.
CONCLUSION
In conclusion, our study highlights both the challenges and potential of utilizing AI-generated MCQs in medical education, particularly in obstetrics and gynecology. Despite the promising advancements in AI, the questions generated were generally too difficult for undergraduate medical students. This underscores the necessity for more detailed and contextually informed prompt designs to better align AI outputs with assessment requirements. Additionally, while LLM-based chatbots provide valuable support in efficient question generation, expert review remains crucial to ensure compliance with ethical guidelines.


