Evaluating ChatGPT and DeepSeek for Obstetric Anesthesia Q&A: A Comparative Study Based on ACOG Guidelines and Prompting Strategies

DOI:https://doi.org/10.65613/736774

Ting Tang¹, Shoujun Tao², Xingcen Wang¹, Huijuan Huang¹, Jianzhi Shi¹, Hua Zhang¹

¹Department of Anesthesiology and Pain Management, Hangzhou Third Hospital, Affiliated to Zhejiang Chinese Medical University, 310009, China
²Department of Anesthesiology, Hangzhou First People’s Hospital, China

Corresponding Author:
Hua Zhang, Hangzhou Third Hospital, Affiliated to Zhejiang Chinese Medical University, 310009, China
Email: [email protected]

Abstract

LLMs are increasingly applied in clinical medicine, but their performance in protocol-driven fields such as anesthesiology remains underexplored. Obstetric anesthesia requires timely and accurate decision-making with strict adherence to established guidelines. This study investigated how prompting strategies and model architectures influence LLM performance in a high-stakes clinical domain by evaluating four models—ChatGPT-4o, ChatGPT-4o-mini, DeepSeek-V3, and DeepSeek-R1—on guideline-based obstetric anesthesia questions using three prompting strategies: Isolated Prompting (IP), Batch Prompting (BP), and Contextual Isolated Prompting (CIP). Eleven clinical questions derived from the 2019 ACOG Practice Bulletin No. 209 were posed to each model, and responses were rated on a 5-point Likert scale across four clinical dimensions (Accuracy, Overconclusiveness, Supplementary Value, and Incompleteness) and assessed for readability using Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL), with statistical analyses including ANOVA and post hoc comparisons. No significant differences were found across models for Accuracy, Overconclusiveness, or Incompleteness (p > 0.05); however, Supplementary Value differed significantly (p = 0.008), with ChatGPT-4o under CIP outperforming DeepSeek-V3 under IP (p = 0.021). ChatGPT-4o demonstrated the highest overall readability (lowest FKGL), while ChatGPT-4o-mini’s readability improved significantly under CIP; additionally, DeepSeek-V3 under BP outperformed DeepSeek-R1 under IP in FKGL scores (p = 0.0294). Overall, LLMs showed comparable core clinical accuracy in obstetric anesthesia tasks, with ChatGPT-4o providing the most readable and context-rich responses, and prompting strategy—particularly CIP—enhancing response quality, supporting the potential use of LLMs as clinical aids contingent on thoughtful prompt design and domain validation.

Keywords: Large Language Models; Obstetric Anesthesia; Prompting Strategies; ChatGPT; DeepSeek

1.Introduction

Large Language Models (LLMs), such as those developed by OpenAI and DeepSeek, have demonstrated remarkable capabilities in natural language understanding and generation. Their applications in clinical medicine are gaining increasing attention, with significant potential in areas ranging from patient communication and literature integration to decision support[1-3]. In obstetric care specifically, LLMs have been explored for tasks like explaining labor analgesia options to patients, summarizing complex comorbidity profiles for interdisciplinary teams, and providing on-demand guideline reminders —use cases that address longstanding challenges such as information asymmetry and provider burnout. However, in medical subfields that heavily rely on standardized protocols and precise judgment, such as anesthesiology, the actual reliability of LLMs remains to be thoroughly validated. Unlike general medicine, where clinical judgment may allow for flexibility, obstetric anesthesia is governed by strict guidelines to mitigate risks like maternal hypotension, fetal bradycardia, and anesthesia-related complications—making LLM accuracy and guideline adherence non-negotiable.

Obstetric anesthesia, as a critical branch of anesthesiology, requires timely and accurate clinical decision-making that profoundly impacts maternal and neonatal outcomes. The American College of Obstetricians and Gynecologists (ACOG) published Practice Bulletin No. 209: Obstetric Analgesia and Anesthesia in 2019[4], providing evidence-based recommendations and expert consensus on labor analgesia, cesarean section anesthesia, and the management of parturients with comorbidities.

This bulletin is considered the gold standard for obstetric anesthesia practice, with recommendations updated based on rigorous systematic reviews and expert panel consensus—making it an ideal framework for evaluating LLM performance in a guideline-driven domain. In such a well-defined field, the ability of LLMs to accurately reproduce or comprehend guideline content becomes a crucial metric for evaluating their potential in clinical support.

Previous studies have explored the performance of LLMs in medical domains, such as answering internal medicine examination questions and generating radiology reports[5]. For example, Kung et al. (2023) demonstrated that ChatGPT achieved passing scores on USMLE exams, while Wang et al. (2025) reported that ChatGPT-4o provided clinically relevant recommendations for lumbar disc herniation management. However, these studies focus on general medical or surgical specialties, where LLM errors may have less immediate consequences compared to obstetric anesthesia. However, research focusing on anesthesiology, particularly the specialized area of obstetric anesthesia, remains limited. This field necessitates the integration of complex factors, including fetal safety, maternal hemodynamics, and labor progression, thereby imposing higher demands on model capabilities[6]. For instance, selecting an anesthetic agent for a parturient with preeclampsia requires balancing maternal blood pressure control with fetal oxygenation—decisions that depend on nuanced guideline interpretation, not just factual recall.

Moreover, prompt strategies significantly influence the quality of LLM responses. Research indicates that employing different prompt strategies (such as chain-of-thought, role definition, and scenario setting) can markedly enhance LLM performance in medical tasks[7]. Sivarajkumar et al. (2024) showed that contextual prompting improved LLM accuracy in clinical natural language processing tasks by 15–20%, while Mejia et al. (2024) highlighted that role-definition prompts (e.g., “Act as a senior anesthesiologist”) increased response comprehensiveness in spine surgery guidelines. These findings underscore the need to tailor prompting strategies to the unique demands of obstetric anesthesia, where context such as labor stage, maternal comorbidities, and fetal status directly impacts decision-making. In clinical practice, designing effective questioning methods to improve LLM application in specialized fields warrants in-depth exploration.

This study systematically evaluates the performance of four mainstream LLMs (DeepSeek V3, DeepSeek R1, ChatGPT-4o, and ChatGPT-4o-mini) in answering questions derived from the 2019 ACOG Practice Bulletin No. 209. Three distinct questioning strategies were implemented to investigate how the interplay between prompting methods and model architectures affects their adherence to clinical guidelines, thereby offering evidence-based insights for LLM applications in clinical anesthesiology support.

While prior research has predominantly focused on diagnostic accuracy or procedural knowledge, this study extends the evaluation to include dimensions such as Supplementary Valueand Readability, which are critical for real-world clinical utility. For instance, models that provide contextual explanations or prioritize accessible language may be more effective in educational or patient-facing scenarios. Supplementary Value is particularly relevant in obstetric anesthesia, where clinicians often require additional context (e.g., drug dosage adjustments for renal impairment, troubleshooting tips for epidural failures) to translate guidelines into practice. Readability, meanwhile, is essential for rapid information retrieval during time-sensitive scenarios—such as emergency cesarean sections, where providers may have only seconds to access key guideline points. The inclusion of prompting strategies like Contextual Isolated Prompting (CIP) allows us to assess whether priming LLMs with clinical background information enhances their ability to generate nuanced, guideline-compliant responses. This approach aligns with recent calls for domain-specific validation of LLMs in high-stakes settings [8, 9].

2.METHODS

This research used publicly accessible versions of four advanced large language models (LLMs): ChatGPT-4o, ChatGPT-4o mini (available for download from the Microsoft Store), DeepSeek-V3, and DeepSeek-R1 (accessible via https://www.deepseek.com/

). Because all models were publicly available and no human participants or identifiable patient data were involved, institutional review board approval was not required. To minimize potential priming effects and ensure response consistency, we implemented a strict temporal separation protocol when testing different model versions.

Model responses were evaluated using a 5-point Likert scale across four dimensions: Accuracy, Overconclusiveness, Supplementary Value, and Incompleteness [8,9]. In addition, readability was assessed for each response using the Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL) indices [9]. Higher FRE scores indicate easier readability, whereas lower FKGL scores indicate that the text is suitable for readers at a lower grade level.

Accuracy was scored as follows: (1) completely incorrect; (2) more incorrect than correct (>75% incorrect); (3) approximately equal correct and incorrect; (4) more correct than incorrect (>75% correct); and (5) completely correct. Accuracy scoring was directly anchored to ACOG Practice Bulletin No. 209, with each response evaluated against specific guideline recommendations. For example, a response to a question about contraindications to epidural analgesia was scored “5” only if it included all absolute contraindications (e.g., coagulopathy, infection at the injection site) and did not introduce any false contraindications.

Overconclusiveness was scored as: (1) non-overconclusive (0% conflicting); (2) minimally overconclusive (<25% conflicting); (3) partially overconclusive (50% conflicting); (4) mostly overconclusive (>75% conflicting); and (5) fully overconclusive (100% conflicting). Overconclusiveness was defined as making definitive claims that are not supported by ACOG guidelines (e.g., stating that “epidural analgesia is never safe for preeclamptic patients” when guidelines recommend cautious use with appropriate hemodynamic monitoring).

Supplementary Value was scored as: (1) no supplementary value (0% added); (2) low supplementary value (25% added); (3) moderate supplementary value (50% added); (4) high supplementary value (>75% added); and (5) exceptional supplementary value (100% novel). Supplementary Value reflected the extent to which the response provided useful context beyond direct guideline text, such as: (1) evidence-based rationales for recommendations; (2) practical implementation tips (e.g., “monitor maternal blood pressure every 5 minutes after epidural placement”); (3) references to related guideline sections; or (4) warnings about common pitfalls (e.g., “avoid excessive local anesthetic doses to reduce fetal exposure and toxicity risk”).

Incompleteness was scored as: (1) fully complete (100% covered); (2) complete (>75% covered); (3) moderately complete (50–75% covered); (4) incomplete (25–50% covered); and (5) very incomplete (0–25% covered). Incompleteness was assessed by comparing each response with the full scope of ACOG recommendations relevant to the question. For example, a response describing anesthesia options for cesarean delivery was scored “3” if it omitted spinal anesthesia as a first-line choice, which is a key recommendation in the guideline.

Prompting Strategies:

Isolated Prompting (IP): Each question was asked in a new, separate session. This strategy simulated ad-hoc queries from clinicians seeking quick guideline references without prior context (e.g., a trainee asking about labor analgesia options during a busy shift).
Batch Prompting (BP): All questions were asked sequentially in a single conversation.
Contextual Isolated Prompting (CIP): Each question was asked in a new session preceded by a fixed clinical background sentence (“Next, I would like to ask you some questions related to obstetric analgesia and anesthesia.”).

The responses from ChatGPT and DeepSeek were independently assessed by a panel of three reviewers (pain specialists) to ensure the reliability of the scoring process. Discrepancies in ratings were resolved through consultation with a fourth senior investigator. Statistical analysis was performed using SPSS 26.0. Inter-model comparisons within each evaluation dimension were conducted using Pearson’s chi-square test, while one-way analysis of variance (ANOVA) was applied to compare performance differences across models under identical dimensions.

The three prompting strategies were designed to simulate common clinical scenarios: IP mimics ad-hoc queries from trainees, BP reflects sequential decision-making during patient rounds, and CIP incorporates contextual priming akin to preoperative briefings. For example, in CIP, the background sentence (“Next, I would like to ask you some questions related to obstetric analgesia and anesthesia.”) explicitly frames the query within a clinical workflow, potentially triggering more structured responses. This methodology builds on prior work showing that contextual prompts improve LLM performance in specialized domains ^[7].

3.RESULTS

In this study, we evaluated the performance of four large language models—DeepSeek-V3, DeepSeek-R1, ChatGPT-4o, and ChatGPT-4o-mini—by inputting 11 clinical scenario questions derived from ACOG Practice Bulletin No. 209. Each question was posed using three distinct prompt strategies to assess the models’ responses. The outputs were evaluated using a 5-point Likert scale across four dimensions: Accuracy, Overconclusiveness, Supplementary Information, and Incompleteness. This approach allowed for a comprehensive comparison of each model’s alignment with established clinical guidelines. The responses generated by all evaluated models were systematically documented in Supplementary 1(For more information, please consult the corresponding author).

Significant differences in Supplementary Value were observed among the four models across the three prompting strategies (one-way ANOVA: F(11, 120) = 2.47, p = 0.008, R² = 0.185). Post hoc Tukey’s HSD tests indicated that the V3_IP (DeepSeek V3 with isolated prompting) and 40_CIP (ChatGPT-4o with contextual isolated prompting) groups differed significantly (adjusted p = 0.021, 95% CI: -2.096 to -0.086). No other pairwise comparisons reached statistical significance (adjusted p > 0.05 for all other model-strategy combinations) (Figure 1).

Figure 1. Comparison of Supplementary Value Scores Across Models and Prompting Strategies. Asterisks (*, **, and ***) represent statistical significance levels of P < 0.05, P < 0.01, and P < 0.001, respectively.

No significant differences were observed among the four models across the three prompting strategies in terms of Accuracy (p = 0.743), Overconclusiveness (p = 0.118) and Incompleteness (p = 0.391), as determined by one-way ANOVA (Figure 2-4).

Figure 2. Comparison of Accuracy Value Scores Across Models and Prompting Strategies.

Figure 3. Comparison of Overconclusiveness Value Scores Across Models and Prompting Strategies.

Figure 4. Comparison of Incompleteness Value Scores Across Models and Prompting Strategies.

The analysis of FRE scores revealed significant overall differences among the four models under three prompting strategies (one-way ANOVA: F(11, 120) = 2.195, p = 0.0188, R² = 0.168). However, post hoc Tukey’s tests demonstrated no statistically significant pairwise differences across all 66 model-strategy combinations (adjusted p > 0.05 for every comparison), with confidence intervals consistently spanning zero. (Figure 5.)

Figure 5. Comparison of FRC Scores Across Models and Prompting Strategies.

Significant differences in FKGL scores were observed across models and prompting strategies (ANOVA: F(11,120) = 6.261, p < 0.0001). ChatGPT-4o consistently achieved the lowest FKGL scores across all strategies, indicating the highest readability. Although differences among its internal strategies were not statistically significant, BP yielded slightly lower values. In contrast, FKGL scores for ChatGPT-4o-mini varied significantly by prompting strategy, with both IP and BP producing higher scores than CIP (e.g., 4o-mini_IP vs 4o-mini_CIP: Mean Diff. = 4.234, p = 0.0007), suggesting that CIP enhances readability in models with lower baseline performance. Among DeepSeek variants, V3 under BP showed significantly lower FKGL scores than R1 under IP (Mean Diff. = -3.254, 95% CI [-6.341, -0.167], adj. p = 0.0294), indicating better readability in this specific comparison. Other pairwise differences among DeepSeek models were not statistically significant.

Figure 6. Comparison of FKGL Scores Across Models and Prompting Strategies. Asterisks (*, **, ***, and ****) represent statistical significance levels of P < 0.05, P < 0.01, P < 0.001, and P < 0.0001 respectively.

Beyond quantitative metrics, qualitative analysis of responses in Supplementary 1(For more information, please consult the corresponding author) revealed nuanced differences in how models handled complex clinical scenarios. For instance, when asked about contraindications to regional analgesia (Q1), ChatGPT-4o under CIP provided detailed stratification of absolute vs. relative contraindications, while DeepSeek-V3 under IP offered a more concise list without prioritization. Similarly, in discussing epidural-related maternal fever (Q3), ChatGPT-4o explicitly differentiated non-infectious mechanisms from true chorioamnionitis, whereas DeepSeek-R1 under BP focused predominantly on thermoregulatory explanations. These variations highlight how prompting strategies can steer LLMs toward either comprehensive or streamlined responses, depending on clinical needs.

4.DISCUSSION

This study systematically evaluated the performance of four LLMs—DeepSeek-V3, DeepSeek-R1, ChatGPT-4o, and ChatGPT-4o-mini—in responding to clinical scenario questions derived from ACOG Practice Bulletin No. 209^[4]. The models were assessed across six dimensions: Accuracy, Overconclusiveness, Supplementary Value, Incompleteness, FRE, and FKGL, using three distinct prompting strategies: IP, BP, and CIP. To our knowledge, this is the first study to evaluate LLM performance in obstetric anesthesia using both quantitative (Likert scales, readability metrics) and qualitative (response structure, contextualization) measures, providing a holistic assessment of clinical utility.

Across the dimensions of Accuracy, Overconclusiveness, and Incompleteness, no statistically significant differences were observed among the four models, regardless of the prompting strategy employed. This uniformity suggests that, in terms of content correctness and fundamental trustworthiness, these LLMs perform comparably. This finding aligns with recent research by Sandmann et al. (2025), who reported similar accuracy across DeepSeek and ChatGPT models in general clinical decision-making tasks. The consistent accuracy observed here may reflect the models’ training on large datasets that include medical guidelines, leading to a shared baseline of guideline knowledge. This may reflect a baseline level of clinical reasoning and factual alignment achieved by contemporary models. However, it also highlights a potential ceiling effect when using guideline-derived prompts alone to discriminate between models on correctness-focused metrics.

Importantly, the comparable performance in core dimensions also underscores the growing readiness of LLMs for use in obstetric anesthesiology—a field characterized by high-stakes, time-sensitive decision-making such as during emergency cesarean sections, neuraxial anesthesia complications, or the management of obstetric hemorrhage^{[2, 10]}. In emergency scenarios, LLMs could serve as “second opinions” for junior providers or rapid-reference tools for senior clinicians, reducing the time spent searching through lengthy guidelines. For example, a trainee facing an emergency cesarean section for fetal distress could use an LLM to quickly confirm anesthesia options and hemodynamic monitoring recommendations—potentially improving decision-making speed and reducing errors. In such scenarios, LLMs could serve as real-time reference tools that assist junior anesthesiologists or serve as decision-support during off-hours staffing. Their ability to instantly retrieve and summarize complex clinical recommendations into digestible, context-specific formats may reduce decision-making latency and improve adherence to guidelines—factors closely linked to maternal outcomes^[11].

In contrast, the models demonstrated statistically significant differences in terms of Supplementary Value, with ChatGPT-4o and ChatGPT-4o-mini generally outperforming DeepSeek-V3. Notably, ChatGPT-4o under the CIP strategy significantly outperformed DeepSeek-V3 under the IP strategy. This finding indicates that both model architecture and prompting strategy substantially influence the depth, contextuality, and educational value of responses^[12]. ChatGPT-4o’s superior Supplementary Value may be attributed to its larger training dataset, which includes more clinical case studies and educational materials, allowing it to generate context-rich responses. In contrast, DeepSeek models—while accurate—may prioritize factual recall over explanatory content, limiting their utility in educational or complex clinical scenarios.

CIP appears particularly advantageous in prompting models to generate content with richer clinical framing. The additional background context likely activates model-internal priors, resulting in responses that go beyond the minimal correct answer to include interpretive or educational elements. This is especially relevant as LLMs are increasingly used for patient-facing communication and clinician education. By preloading clinical context, CIP can help generate more tailored, didactic content that aligns with real-world consultation or training scenarios^[7]. For example, a CIP response to a patient’s question about epidural risks might include both guideline-based risk data and reassurance framed in plain language, whereas an IP response might only list risks without context—reducing patient understanding and trust.

In terms of readability, ChatGPT-4o consistently delivered the lowest FKGL scores, indicating the most readable output across all conditions. This suggests that its advanced architecture may be better tuned to produce concise, accessible language suitable for clinical communication. Interestingly, while prompting strategy did not significantly affect ChatGPT-4o’s FKGL scores, models with lower baseline performance—such as ChatGPT-4o-mini—benefited noticeably from CIP, producing outputs with significantly improved readability. This finding reinforces the potential of CIP as a mitigation strategy for lower-capacity models in clinical settings. Regarding FRE scores, although the overall ANOVA indicated significant differences among the model-strategy groups, no pairwise comparisons reached statistical significance in the post hoc tests. This pattern likely reflects a weak effect size rather than a methodological issue. ChatGPT-4o’s readability advantage is particularly relevant in time-constrained clinical settings, where providers need to quickly grasp key information. A FKGL score of ~8 means the content is understandable to a high school graduate, making it accessible to interdisciplinary team members (e.g., nurses, midwives) who may not have specialized anesthesiology training.

As LLMs continue to improve in clinical grounding and interpretability, their role in obstetric anesthesia may evolve from passive reference tools to interactive assistants—for instance, helping design personalized postpartum analgesia plans or guiding anesthetic decisions in patients with cardiac conditions. Nevertheless, such applications must be approached with caution and rigorous validation, particularly in safety-critical specialties. The present findings suggest a foundational readiness for LLMs in accuracy-sensitive applications but also underscore the importance of optimizing prompting strategies and interface design for maximum clinical utility.

This study has several limitations. First, the evaluation was based on a limited number of structured guideline-derived scenarios, which may not fully capture the diversity of real-world clinical interactions. Second, the assessment relied on subjective expert ratings, which—despite using standardized scoring criteria—could introduce rater bias. Third, all prompts and outputs were conducted in English, which limits the generalizability of findings to multilingual or non-English-speaking settings. Future studies should explore LLM performance in more dynamic, multilingual, or real-time clinical settings, ideally involving direct clinician interaction and outcome-based evaluation.

The superior Supplementary Value of ChatGPT-4o under CIP suggests that this model-strategy combination may be particularly suited for educational applications, such as generating teaching materials for residents or patient education handouts. For example, in responding to Q9 (optimal post-cesarean analgesia), ChatGPT-4o under CIP not only listed medications but also included evidence-based rationales for multimodal regimens, whereas DeepSeek-V3 under IP provided a bullet-point list without justification. Similarly, the readability advantage of ChatGPT-4o aligns with prior findings that LLMs with lower FKGL scores reduce cognitive load for clinicians under time pressure [9]. These insights can guide the selection of LLMs for specific clinical tasks—e.g., ChatGPT-4o for patient communication and DeepSeek-V3 for quick reference. DeepSeek-V3’s concise, accurate responses make it suitable for rapid guideline checks (e.g., “What is the maximum dose of bupivacaine for epidural analgesia?”), while ChatGPT-4o’s context-rich outputs are better for complex tasks like case discussions or patient counseling.

5.CONCLUSION

This study evaluated the performance of four LLMs in answering obstetric anesthesia questions under three prompting strategies. While all models performed comparably in core clinical dimensions, ChatGPT-4o and 4o-mini offered more context-rich and informative responses, particularly under contextual isolated prompting. Notably, ChatGPT-4o consistently produced outputs with the highest readability, which may enhance comprehension for a wide range of users, including clinicians seeking fast and clear access to guideline-based information. In contrast, smaller models such as 4o-mini and DeepSeek-V3 showed greater variability depending on prompt format, highlighting the importance of prompt design in optimizing model performance. As LLMs evolve, they show promise as clinical aids in obstetric anesthesia, though careful prompt design and domain validation remain essential.

Acknowledgements

We used AI tools ChatGPT and DeepSeek to answer questions from the research survey, and the full answers are attached to the supplementary materials. We would like to thank the panel of pain specialists and the senior obstetric anesthesiologist for their meticulous review of model responses, as well as the OpenAI and DeepSeek teams for making their models publicly accessible for research purposes.

Funding
Not applicable

Competing interests
The authors have nothing to disclose.

Data availability

No datasets were generated or analysed during the current study

Abbreviations

LLMs: Large Language Models

ACOG: The American College of Obstetricians and Gynecologists

FRE: Flesch Reading Ease

FKGL: Flesch–Kincaid Grade Level

IP: Isolated Prompting

BP: Batch Prompting

CIP: Contextual Isolated Prompting

ANOVA: one-way analysis of variance

References

[1] Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science. Nature. 2023. 614(7947): 214-216.

[2] Goh E, Gallo R, Hom J, et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open. 2024. 7(10): e2440969.

[3] Temsah A, Alhasan K, Altamimi I, et al. DeepSeek in Healthcare: Revealing Opportunities and Steering Challenges of a New Open-Source Artificial Intelligence Frontier. Cureus. 2025. 17(2): e79221.

[4] American College of Obstetricians and Gynecologists’ Committee on Practice B, . ACOG Practice Bulletin No. 209: Obstetric Analgesia and Anesthesia. Obstet Gynecol. 2019. 133(3): e208-e225.

[5] Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023. 2(2): e0000198.

[6] Metzger L, Teitelbaum M, Weber G, Kumaraswami S. Complex Pathology and Management in the Obstetric Patient: A Narrative Review for the Anesthesiologist. Cureus. 2021. 13(8): e17196.

[7] Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, Wang Y. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study. JMIR Med Inform. 2024. 12: e55318.

[8] Mejia MR, Arroyave JS, Saturno M, et al. Use of ChatGPT for Determining Clinical and Surgical Treatment of Lumbar Disc Herniation With Radiculopathy: A North American Spine Society Guideline Comparison. Neurospine. 2024. 21(1): 149-158.

[9] Wang S, Wang Y, Jiang L, et al. Assessing the clinical support capabilities of ChatGPT 4o and ChatGPT 4o mini in managing lumbar disc herniation. Eur J Med Res. 2025. 30(1): 45.

[10] Brügge E, Ricchizzi S, Arenbeck M, et al. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial. BMC Med Educ. 2024. 24(1): 1391.

[11] Silva Y, Araújo FG, Amorim T, Martins EF, Felisbino-Mendes MS. Obstetric analgesia in labor and its association with neonatal outcomes. Rev Bras Enferm. 2020. 73(5): e20180757.

[12] Sandmann S, Hegselmann S, Fujarski M, et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat Med. 2025 .

Evaluating ChatGPT and DeepSeek for Obstetric Anesthesia Q&A: A Comparative Study Based on ACOG Guidelines and Prompting Strategies

Evaluating-ChatGPT-and-DeepSeek-for-Obstetric-Anesthesia-Q-A-A-Comparative-Study-Based-on-ACOG-Guidelines-and-Prompting-Strategies.docx

Leave a Comment Cancel Reply