Publications

Publication No. 1

Publication Year: 2025

Journal: International journal of surgery (London, England), IF: 12.5

Title: ChatGPT’s role in alleviating anxiety in total knee arthroplasty consent process: a randomized controlled trial pilot study

Full Author: Wenyi Gan, Jianfeng Ouyang, Guorong She, Zhaowen Xue, Lingxuan Zhu, Anqi Lin, Weiming Mou, Aimin Jiang, Chang Qi, Quan Cheng, Peng Luo, Hua Li, Xiaofei Zheng

Link: https://doi.org/10.1097/JS9.0000000000002223

Summary: Recent advancements in artificial intelligence (AI) like ChatGPT have expanded possibilities for patient education, yet its impact on perioperative anxiety in total knee arthroplasty (TKA) patients remains unexplored. In this single-blind, randomized controlled pilot study from April to July 2023, 60 patients were randomly allocated using sealed envelopes to either ChatGPT-assisted or traditional surgeon-led informed consent groups. In the ChatGPT group, physicians used ChatGPT 4.0 to provide standardized, comprehensive responses to patient queries during the consent process, while maintaining their role in interpreting and contextualizing the information. Outcomes were measured using Hospital Anxiety and Depression Scales (HADS), Perioperative Apprehension Scale-7 (PAS-7), Visual Analogue Scales for Anxiety and Pain (VAS-A, VAS-P), Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), and satisfaction questionnaires. Of 55 patients completing the study, the ChatGPT group showed significantly lower anxiety scores after informed consent (HADS-A: 10.48 ± 3.84 vs 12.75 ± 4.12, P = .04, Power = .67; PAS-7: 12.44 ± 3.70 vs 14.64 ± 2.11, P = .01, Power = .85; VAS-A: 5.40 ± 1.89 vs 6.71 ± 2.27, P = .02, Power = .75) and on the fifth postoperative day (HADS-A: 8.33 ± 3.20 vs 10.71 ± 3.83, P = .01, Power = .79; VAS-A: 3.41 ± 1.58 vs 4.64 ± 1.70, P = .008, Power = .85). The ChatGPT group also reported higher satisfaction with preoperative education (4.22 ± 0.51 vs 3.43 ± 0.84, P<.001, Power = .99) and overall hospitalization experience (4.11 ± 0.65 vs 3.46 ± 0.69, P = .001, Power = .97). No significant differences were found in depression scores, knee function, or pain levels. ChatGPT-assisted informed consent effectively reduced perioperative anxiety and improved patient satisfaction in TKA patients. While these preliminary findings are promising, larger studies are needed to validate these results and explore broader applications of AI in preoperative patient education.

Citation: Gan W, Ouyang J, She G, Xue Z, Zhu L, Lin A, Mou W, Jiang A, Qi C, Cheng Q et al.. ChatGPT’s role in alleviating anxiety in total knee arthroplasty consent process: a randomized controlled trial pilot study. Int J Surg. 2025;:. doi:10.1097/JS9.0000000000002223

Publication No. 2

Publication Year: 2025

Journal: NPJ digital medicine, IF: 12.4

Title: Computational frameworks transform antagonism to synergy in optimizing combination therapies

Full Author: Jinghong Chen, Anqi Lin, Aimin Jiang, Chang Qi, Zaoqu Liu, Quan Cheng, Shuofeng Yuan, Peng Luo

Link: https://doi.org/10.1038/s41746-025-01435-2

Summary: While drug combinations are increasingly important in disease treatment, predicting their therapeutic interactions remains challenging. This review systematically analyzes computational methods for predicting drug combination effects through multi-omics data integration. We comprehensively assess key algorithms including DrugComboRanker and AuDNNsynergy, and evaluate integration approaches encompassing kernel regression and graph networks. The review elucidates artificial intelligence applications in predicting drug synergistic and antagonistic effects.

Citation: Chen J, Lin A, Jiang A, Qi C, Liu Z, Cheng Q, Yuan S, Luo P. Computational frameworks transform antagonism to synergy in optimizing combination therapies. NPJ Digit Med. 2025;8(1):44. doi:10.1038/s41746-025-01435-2

Publication No. 3

Publication Year: 2025

Journal: Clinical gastroenterology and hepatology : the official clinical practice journal of the American Gastroenterological Association, IF: 11.6

Title: Ensuring Consistency and Accuracy in Evaluating ChatGPT-4 for Clinical Recommendations

Full Author: Lingxuan Zhu, Weiming Mou, Peng Luo

Link: https://doi.org/10.1016/j.cgh.2024.05.028

Summary: The commentary discusses the study by Chang et al. on the use of ChatGPT-4 for providing colonoscopy follow-up recommendations based on clinical data. The authors express concerns regarding the randomness inherent in large language models (LLMs) like ChatGPT-4, which can generate inconsistent or contradictory responses to the same input. They emphasize the importance of evaluating the consistency of AI models through repeated queries to capture variability and ensure reliable outputs for clinical decision-making. Additionally, the commentary critiques the method of converting continuous or categorical variables into numerical ones by taking the mean of a range. This simplification may introduce inaccuracies and fail to reflect the clinical flexibility needed in some situations. The authors suggest that more accurate handling of variable transformations and addressing the randomness of LLM responses are crucial for improving the clinical applicability and reliability of AI systems. In conclusion, the commentary advocates for further research to ensure that AI models like ChatGPT-4 provide consistent and precise recommendations before being integrated into clinical practice.

Citation: Zhu L, Mou W, Luo P. Ensuring Consistency and Accuracy in Evaluating ChatGPT-4 for Clinical Recommendations. Clin Gastroenterol Hepatol. 2025;23(1):189-190. doi:10.1016/j.cgh.2024.05.028

Publication No. 4

Publication Year: 2024

Journal: Journal of hematology & oncology, IF: 29.5

Title: ChatGPT’s ability to generate realistic experimental images poses a new challenge to academic integrity

Full Author: Lingxuan Zhu, Yancheng Lai, Weiming Mou, Haoran Zhang, Anqi Lin, Chang Qi, Tao Yang, Liling Xu, Jian Zhang, Peng Luo

Link: https://doi.org/10.1186/s13045-024-01543-8

Summary: The rapid advancements in large language models (LLMs) such as ChatGPT have raised concerns about their potential impact on academic integrity. While initial concerns focused on ChatGPT’s writing capabilities, recent updates have integrated DALL-E 3’s image generation features, extending the risks to visual evidence in biomedical research. Our tests revealed ChatGPT’s nearly barrier-free image generation feature can be used to generate experimental result images, such as blood smears, Western Blot, immunofluorescence and so on. Although the current ability of ChatGPT to generate experimental images is limited, the risk of misuse is evident. This development underscores the need for immediate action. We suggest that AI providers restrict the generation of experimental image, develop tools to detect AI-generated images, and consider adding “invisible watermarks” to the generated images. By implementing these measures, we can better ensure the responsible use of AI technology in academic research and maintain the integrity of scientific evidence.

Citation: Zhu L, Lai Y, Mou W, Zhang H, Lin A, Qi C, Yang T, Xu L, Zhang J, Luo P. ChatGPT’s ability to generate realistic experimental images poses a new challenge to academic integrity. J Hematol Oncol. 2024;17(1):27. doi:10.1186/s13045-024-01543-8

Publication No. 5

Publication Year: 2024

Journal: JAMA oncology, IF: 22.5

Title: Ensuring Safety and Consistency in Artificial Intelligence Chatbot Responses

Full Author: Lingxuan Zhu, Weiming Mou, Peng Luo

Link: https://doi.org/10.1001/jamaoncol.2024.4324

Summary: The commentary emphasizes the potential of artificial intelligence (AI) chatbots in providing empathetic and readable responses to cancer-related questions but raises significant concerns regarding their accuracy and consistency. While AI models are effective in certain aspects like engagement and communication, the authors stress that inaccuracies in the information provided could pose serious risks to patient safety. The inherent randomness of large language models (LLMs) can lead to inconsistent responses, further complicating their reliability. The authors argue that for AI to be safely integrated into clinical settings, models must prioritize accuracy, consistency, and safety, ensuring that they consistently deliver reliable information to patients.

Citation: Zhu L, Mou W, Luo P. Ensuring Safety and Consistency in Artificial Intelligence Chatbot Responses. JAMA Oncol. 2024;10(11):1597. doi:10.1001/jamaoncol.2024.4324

Publication No. 6

Publication Year: 2024

Journal: JAMA internal medicine, IF: 22.5

Title: Potential of Large Language Models as Tools Against Medical Disinformation

Full Author: Lingxuan Zhu, Weiming Mou, Peng Luo

Link: https://doi.org/10.1001/jamainternmed.2024.0020

Summary: The commentary explores the potential of large language models (LLMs) in combating medical disinformation, acknowledging both the risks and opportunities they present. While agreeing with concerns about LLMs enabling the spread of false medical information, the authors highlight that the problem predates AI technology. They argue that rather than focusing solely on restricting LLMs, efforts should be made to empower users to assess the reliability of online health information. The commentary demonstrates how well-trained LLMs can be powerful tools in identifying and correcting inaccurate medical claims. The authors’ experiment with multiple popular LLMs showed that most responses flagged misinformation, particularly in areas like vaccine safety, by providing evidence-based explanations. These models’ ability to reference the latest authoritative information, such as CDC data, reinforces their potential to challenge health misinformation, including emerging threats like novel infectious diseases. The commentary stresses that the positive potential of LLMs in addressing medical disinformation should not be overlooked.

Citation: Zhu L, Mou W, Luo P. Potential of Large Language Models as Tools Against Medical Disinformation. JAMA Intern Med. 2024;184(4):450. doi:10.1001/jamainternmed.2024.0020

Publication No. 7

Publication Year: 2024

Journal: International journal of surgery (London, England), IF: 12.5

Title: Advancing generative artificial intelligence in medicine: recommendations for standardized evaluation

Full Author: Anqi Lin, Lingxuan Zhu, Weiming Mou, Zizhi Yuan, Quan Cheng, Aimin Jiang, Peng Luo

Link: https://doi.org/10.1097/JS9.0000000000001583

Summary: This paper proposes a comprehensive framework for standardized evaluation of generative AI in medicine, addressing the current lack of standardized assessment methods. The authors recommend a three-pronged approach: establishing standardized scoring criteria with multiple complementary evaluation methods (including Reference Answer Accuracy Rate, Subjective Answer Accuracy Rate, and Strict Accuracy Rate), implementing rigorous evaluation processes (including pre-review alignment, multi-reviewer scoring, and independent audits), and utilizing statistical analysis to quantify scoring differences and refine evaluation methods. The recommendations aim to enhance reliability and consistency in assessing generative AI’s capabilities in healthcare applications, acknowledging both the technology’s potential and the need for careful validation before clinical implementation.

Citation: Lin A, Zhu L, Mou W, Yuan Z, Cheng Q, Jiang A, Luo P. Advancing generative artificial intelligence in medicine: recommendations for standardized evaluation. Int J Surg. 2024;110(8):4547-4551. doi:10.1097/JS9.0000000000001583

Publication No. 8

Publication Year: 2024

Journal: International journal of surgery (London, England), IF: 12.5

Title: Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)’s ability to interpret radiological images

Full Author: Lingxuan Zhu, Weiming Mou, Yancheng Lai, Jinghong Chen, Shujia Lin, Liling Xu, Junda Lin, Zeji Guo, Tao Yang, Anqi Lin, Chang Qi, Ling Gan, Jian Zhang, Peng Luo

Link: https://doi.org/10.1097/JS9.0000000000001359

Summary: The introduction of ChatGPT-4V’s ‘Chat with images’ feature represents the beginning of the era of large multimodal models (LMMs), which allows ChatGPT to process and answer questions based on uploaded images. This advancement has the potential to transform how surgical teams utilize radiographic data, as radiological interpretation is crucial for surgical planning and postoperative care. However, a comprehensive evaluation of ChatGPT-4V’s capabilities in interpret radiological images and formulating treatment plans remains to be explored. Three types of questions were collected: (1) 87 USMLE-style questions, submitting only the question stems and images without providing options to assess ChatGPT’s diagnostic capability. For questions involving treatment plan formulations, a five-point Likert scale was used to assess ChatGPT’s proposed treatment plan. The 87 questions were then adapted by removing detailed patient history to assess its contribution to diagnosis. The diagnostic performance of ChatGPT-4V was also tested when only medical history was provided. (2) We randomly selected 100 chest radiography from the ChestX-ray8 database to test the ability of ChatGPT-4V to identify abnormal chest radiography. (3) Cases from the ‘Diagnose Please’ section in the Radiology journal were collected to evaluate the performance of ChatGPT-4V in diagnosing complex cases. Three responses were collected for each question. ChatGPT-4V achieved a diagnostic accuracy of 77.01% for USMLE-style questions. The average score of ChatGPT-4V’s treatment plans was 3.97 (Interquartile Range: 3.33-4.67). Removing detailed patient history dropped the diagnostic accuracy to 19.54% (P<0.0001). ChatGPT-4V achieved an AUC of 0.768 (95% CI: 0.684-0.851) in detecting abnormalities in chest radiography, but could not specify the exact disease due to the lack of detailed patient history. For cases from ‘Diagnose Please’ ChatGPT provided diagnoses consistent with or very similar to the reference answers. ChatGPT-4V demonstrated an impressive ability to combine patient history with radiological images to make diagnoses and directly design treatment plans based on images, suggesting its potential for future application in clinical practice.

Citation: Zhu L, Mou W, Lai Y, Chen J, Lin S, Xu L, Lin J, Guo Z, Yang T, Lin A et al.. Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)’s ability to interpret radiological images. Int J Surg. 2024;110(7):4096-4102. doi:10.1097/JS9.0000000000001359

Publication No. 9

Publication Year: 2024

Journal: Cell reports. Medicine, IF: 11.7

Title: Harnessing artificial intelligence for prostate cancer management

Full Author: Lingxuan Zhu, Jiahua Pan, Weiming Mou, Longxin Deng, Yinjie Zhu, Yanqing Wang, Gyan Pareek, Elias Hyams, BeneditoA Carneiro, MatthewJ Hadfield, WafikS El-Deiry, Tao Yang, Tao Tan, Tong Tong, Na Ta, Yan Zhu, Yisha Gao, Yancheng Lai, Liang Cheng, Rui Chen, Wei Xue

Link: https://doi.org/10.1016/j.xcrm.2024.101506

Summary: Prostate cancer (PCa) is a common malignancy in males. The pathology review of PCa is crucial for clinical decision-making, but traditional pathology review is labor intensive and subjective to some extent. Digital pathology and whole-slide imaging enable the application of artificial intelligence (AI) in pathology. This review highlights the success of AI in detecting and grading PCa, predicting patient outcomes, and identifying molecular subtypes. We propose that AI-based methods could collaborate with pathologists to reduce workload and assist clinicians in formulating treatment recommendations. We also introduce the general process and challenges in developing AI pathology models for PCa. Importantly, we summarize publicly available datasets and open-source codes to facilitate the utilization of existing data and the comparison of the performance of different models to improve future studies.

Citation: Zhu L, Pan J, Mou W, Deng L, Zhu Y, Wang Y, Pareek G, Hyams E, Carneiro BA, Hadfield MJ et al.. Harnessing artificial intelligence for prostate cancer management. Cell Rep Med. 2024;5(4):101506. doi:10.1016/j.xcrm.2024.101506

Publication No. 10

Publication Year: 2024

Journal: Resuscitation, IF: 6.5

Title: What is the best approach to assessing generative AI in medicine?

Full Author: Lingxuan Zhu, Weiming Mou, Jiarui Xie, Peng Luo, Rui Chen

Link: https://doi.org/10.1016/j.resuscitation.2024.110164

Summary: This correspondence discusses the assessment of generative AI in the field of medicine, specifically focusing on ChatGPT’s capabilities in passing the American Heart Association (AHA) Basic Life Support (BLS) and Advanced Cardiovascular Life Support (ACLS) exams. The authors critique earlier studies that attempted to evaluate ChatGPT-3.5’s performance on these exams, noting that the lack of repeated testing may have underestimated its abilities. The authors’ subsequent study, which involved multiple rounds of questioning, found that ChatGPT-3.5 was capable of passing both exams. However, earlier limitations such as the inability to process image-based questions were overcome with the release of ChatGPT-4V, which was tested on complete AHA exams. While acknowledging the importance of exams in assessing knowledge, the authors argue that evaluating AI’s clinical potential should extend beyond these exams. Just as medical students progress through various stages of evaluation—beginning with written exams and advancing to case-based assessments and clinical practice—the authors suggest that generative AI should also be assessed through simulated clinical scenarios that reflect real-world applications. They emphasize the need for comprehensive evaluations of AI tools in healthcare, moving towards practical, real-world testing to better understand their potential impact in clinical practice.

Citation: Zhu L, Mou W, Xie J, Luo P, Chen R. What is the best approach to assessing generative AI in medicine?. Resuscitation. 2024;197:110164. doi:10.1016/j.resuscitation.2024.110164

Publication No. 11

Publication Year: 2024

Journal: Journal of translational medicine, IF: 6.1

Title: Language and cultural bias in AI: comparing the performance of large language models developed in different countries on Traditional Chinese Medicine highlights the need for localized models

Full Author: Lingxuan Zhu, Weiming Mou, Yancheng Lai, Junda Lin, Peng Luo

Link: https://doi.org/10.1186/s12967-024-05128-4

Summary: This study highlights the significant language and cultural biases inherent in large language models (LLMs), particularly in the context of Traditional Chinese Medicine (TCM). The research compares the performance of eight prominent LLMs—four developed by Chinese companies and four by Western companies—on the National Medical Licensing Examination for TCM. The results revealed that Chinese-developed models, such as Qwen-max and GLM-4, significantly outperformed their Western counterparts, such as ChatGPT-3.5 and ChatGPT-4, in answering TCM-related questions. The Western models, with their training predominantly on English-language data, struggled with the cultural nuances and specific terminology used in TCM. The study emphasizes the need for localized models that are trained on culturally relevant data to ensure better accuracy and applicability in specific fields, like TCM. It suggests that AI models tailored to local languages and medical practices can offer more precise solutions and meet the unique needs of different populations. Additionally, these localized models could contribute to preserving cultural knowledge, improving data security, and addressing specific healthcare needs. In conclusion, the research advocates for the development of AI systems that integrate multilingual and culturally diverse training to enhance their effectiveness in global healthcare settings.

Citation: Zhu L, Mou W, Lai Y, Lin J, Luo P. Language and cultural bias in AI: comparing the performance of large language models developed in different countries on Traditional Chinese Medicine highlights the need for localized models. J Transl Med. 2024;22(1):319. doi:10.1186/s12967-024-05128-4

Publication No. 12

Publication Year: 2024

Journal: Journal of medical Internet research, IF: 5.8

Title: Multimodal ChatGPT-4V for Electrocardiogram Interpretation: Promise and Limitations

Full Author: Lingxuan Zhu, Weiming Mou, Keren Wu, Yancheng Lai, Anqi Lin, Tao Yang, Jian Zhang, Peng Luo

Link: https://doi.org/10.2196/54607

Summary: This study evaluated the capabilities of the newly released ChatGPT-4V, a large language model with visual recognition abilities, in interpreting electrocardiogram waveforms and answering related multiple-choice questions for assisting with cardiovascular care.

Citation: Zhu L, Mou W, Wu K, Lai Y, Lin A, Yang T, Zhang J, Luo P. Multimodal ChatGPT-4V for Electrocardiogram Interpretation: Promise and Limitations. J Med Internet Res. 2024;26:e54607. doi:10.2196/54607

Publication No. 13

Publication Year: 2024

Journal: JMIR mHealth and uHealth, IF: 5.4

Title: The Evaluation of Generative AI Should Include Repetition to Assess Stability

Full Author: Lingxuan Zhu, Weiming Mou, Chenglin Hong, Tao Yang, Yancheng Lai, Chang Qi, Anqi Lin, Jian Zhang, Peng Luo

Link: https://doi.org/10.2196/57978

Summary: The increasing interest in the potential applications of generative artificial intelligence (AI) models like ChatGPT in health care has prompted numerous studies to explore its performance in various medical contexts. However, evaluating ChatGPT poses unique challenges due to the inherent randomness in its responses. Unlike traditional AI models, ChatGPT generates different responses for the same input, making it imperative to assess its stability through repetition. This commentary highlights the importance of including repetition in the evaluation of ChatGPT to ensure the reliability of conclusions drawn from its performance. Similar to biological experiments, which often require multiple repetitions for validity, we argue that assessing generative AI models like ChatGPT demands a similar approach. Failure to acknowledge the impact of repetition can lead to biased conclusions and undermine the credibility of research findings. We urge researchers to incorporate appropriate repetition in their studies from the outset and transparently report their methods to enhance the robustness and reproducibility of findings in this rapidly evolving field.

Citation: Zhu L, Mou W, Hong C, Yang T, Lai Y, Qi C, Lin A, Zhang J, Luo P. The Evaluation of Generative AI Should Include Repetition to Assess Stability. JMIR Mhealth Uhealth. 2024;12:e57978. doi:10.2196/57978

Publication No. 14

Publication Year: 2024

Journal: JCO clinical cancer informatics, IF: 3.3

Title: Multimodal Approach in the Diagnosis of Urologic Malignancies: Critical Assessment of ChatGPT-4V’s Image-Reading Capabilities

Full Author: Lingxuan Zhu, Yancheng Lai, Na Ta, Liang Cheng, Rui Chen

Link: https://doi.org/10.1200/CCI.23.00275

Summary: ChatGPT-4V model with image interpretation tested for distinguishing kidney & prostate tumors from normal tissue.

Citation: Zhu L, Lai Y, Ta N, Cheng L, Chen R. Multimodal Approach in the Diagnosis of Urologic Malignancies: Critical Assessment of ChatGPT-4V’s Image-Reading Capabilities. JCO Clin Cancer Inform. 2024;8:e2300275. doi:10.1200/CCI.23.00275

Publication No. 15

Publication Year: 2023

Journal: Resuscitation, IF: 6.5

Title: ChatGPT can pass the AHA exams: Open-ended questions outperform multiple-choice format

Full Author: Lingxuan Zhu, Weiming Mou, Tao Yang, Rui Chen

Link: https://doi.org/10.1016/j.resuscitation.2023.109783

Summary: The study by Fijačko et al. tested ChatGPT’s ability to pass the BLS and ACLS exams of AHA, but found that ChatGPT failed both exams. A limitation of their study was using ChatGPT to generate only one response, which may have introduced bias. When generating three responses per question, ChatGPT can pass BLS exam with an overall accuracy of 84%. When incorrectly answered questions were rewritten as open-ended questions, ChatGPT’s accuracy rate increased to 96% and 92.1% for the BLS and ACLS exams, respectively, allowing ChatGPT to pass both exams with outstanding results.

Citation: Zhu L, Mou W, Yang T, Chen R. ChatGPT can pass the AHA exams: Open-ended questions outperform multiple-choice format. Resuscitation. 2023;188:109783. doi:10.1016/j.resuscitation.2023.109783

Publication No. 16

Publication Year: 2023

Journal: Journal of translational medicine, IF: 6.1

Title: Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge?

Full Author: Lingxuan Zhu, Weiming Mou, Rui Chen

Link: https://doi.org/10.1186/s12967-023-04123-5

Summary: This correspondence explores the best approach to assessing generative AI, particularly in the context of medical applications like ChatGPT. The authors discuss previous studies evaluating ChatGPT’s ability to pass the American Heart Association (AHA) BLS and ACLS exams, noting that earlier assessments underestimated its potential due to methodological limitations, such as the lack of repeated testing and exclusion of image-based questions. With the release of ChatGPT-4V, these limitations were overcome, allowing for more comprehensive testing. The authors argue that evaluating AI’s capabilities should not be confined to exams alone. Similar to how medical students progress from basic knowledge assessments to real-world clinical practice, AI should also be assessed in simulated clinical scenarios to better understand its application in healthcare. The authors advocate for comprehensive evaluations of generative AI that move beyond theoretical exams to real-world clinical simulations, enabling a deeper understanding of its potential impact in clinical settings.

Citation: Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge?. J Transl Med. 2023;21(1):269. doi:10.1186/s12967-023-04123-5