Auditing the clinical-evidence claims in OpenAI's HealthBench — its own gold answers & rubrics — for hallucinated, overgeneralized, overlooked & misweighted evidence. By NoBSmed.
benchmark clinical-trials ai-safety rag evidence-based-medicine medical-ai llm-evaluation citation-verification healthbench clinical-evidence medical-ai-evaluation applicability-benchmark
-
Updated
May 30, 2026 - Python