I am a medical AI researcher working at the intersection of clinical medicine, large language model evaluation, and trustworthy AI for healthcare.
My current research focuses on building evaluation and governance infrastructure for medical AI systems, including clinical LLM benchmarks, patient-facing AI safety evaluation, specialty triage, clinical agents, and human-verified reasoning datasets.
- Trustworthy medical AI and clinical LLM evaluation
- Patient-facing AI safety, escalation behavior, and in-loop governance
- Clinical agents, multimodal medical AI, and benchmark infrastructure
- Human-verified clinical reasoning datasets and medical QA-CoT pipelines
- High-stakes agent evaluation and multi-turn safety assessment
- Real-world clinical data science using NHANES, BRFSS, and EHR-derived records
-
Advancing medical AI through benchmarking and competition for specialty triage
npj Digital Medicine, 2026.
DOI: 10.1038/s41746-026-02433-8 · Project: MedBench -
Beyond Knowledge to Agency: Evaluating Expertise, Autonomy, and Integrity in Finance with CNFinBench
KDD 2026, accepted.
arXiv: 2512.09506 · Project: CNFinBench · Code: GitHub -
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
CVPR 2026, accepted.
arXiv: 2601.03054 -
TyG index, depression, and cognitive dysfunction: NHANES with machine learning support
Journal of Affective Disorders, 2025.
DOI: 10.1016/j.jad.2025.01.051 -
Smoking types and stroke risk: development of a predictive model for identifying stroke risk
Frontiers in Physiology, 2025.
DOI: 10.3389/fphys.2025.1528910 -
Development of a predictive model for the U-shaped relationship between the triglyceride-glycemic index and depression using machine learning
Heliyon, 2024.
DOI: 10.1016/j.heliyon.2024.e38615
I build evaluation frameworks for medical LLMs, multimodal models, and clinical agents. My work includes MedTriage / MedBench, a real-world specialty-triage benchmark derived from hospital intake records, online guidance dialogues, and outpatient clinical notes. The benchmark evaluates strict multi-label department recommendation and supported a MedBench competition with 37 teams.
Project: MedBench
I study how medical LLMs behave in multi-turn patient-facing scenarios, especially when models should escalate urgent symptoms rather than continue generic conversation. I designed a 150-case simulated consultation benchmark and evaluated in-loop governance actions such as PASS, REWRITE, ASK-MORE, ESCALATE, and REFUSE.
This line of work focuses on delayed escalation, unsafe reassurance, insufficient triage, and safety-control trade-offs in patient-facing medical AI.
I contribute to clinician-audited benchmark infrastructure and human-verified clinical reasoning datasets, including rotating test pools, safety and ethics rubrics, LLM-as-a-judge calibration, and QA-CoT validation pipelines.
This work aims to make medical AI evaluation more auditable, clinically grounded, and deployment-aware.
I also contributed to CNFinBench, a benchmark for evaluating financial LLM agents across expertise, autonomy, and integrity. The project studies how LLMs behave under decision-intensive workflows, tool-use settings, and multi-turn adversarial pressure.
Project: CNFinBench · Code: GitHub
I apply statistical modeling and machine learning to large-scale clinical datasets, including NHANES and BRFSS. My first-author and co-first-author studies examine TyG index, depression, cognitive dysfunction, smoking behavior, and stroke-risk prediction, with cohorts ranging from 1,352 to 273,028 participants.
Programming and analysis: Python, R, statistical analysis, reproducible visualization, pipeline development
Machine learning: XGBoost, Random Forest, logistic regression, LASSO, SHAP, ROC/AUC, calibration, cross-validation, bootstrap confidence intervals
Medical AI evaluation: benchmark construction, LLM-as-a-judge, rubric design, multi-turn dialogue evaluation, safety/ethics scoring, agent evaluation
Clinical data: NHANES, BRFSS, EHR/EMR-derived records, patient simulation, clinical terminology
-
Shanghai University of Traditional Chinese Medicine
M.Med., Integrated Traditional Chinese and Western Clinical Medicine, expected Dec 2026 -
Shanghai Artificial Intelligence Laboratory
Research Intern, Medical AI Group -
Putuo District Central Hospital, Shanghai
Resident Physician / Clinical Trainee
- Email: dingchaochao58@gmail.com
- Google Scholar: Chao Ding
- ORCID: 0009-0004-9652-4585
- GitHub: Dingsuper-creator
- Location: Shanghai, China