The complexity of clinical practice—characterized by incremental information gathering, sequential irreversible decisions, and inherent uncertainty—remains a significant challenge for AI evaluation. Existing benchmarks fall short, often compromising on critical aspects of this dynamic process. To address this, researchers have introduced ClinEnv, an interactive benchmark designed to simulate real inpatient admissions and rigorously assess Large Language Models (LLMs) as attending physicians.
Related startups
Simulating the Physician's Sequential, Uncertain Workflow
ClinEnv moves beyond static evaluations by constructing each medical case into an ordered sequence of decision stages. At every stage, LLMs must actively query specialized agents to gather heterogeneous information before committing to crucial decisions like medications, procedures, and diagnoses. This paradigm, termed Longitudinal Inpatient Simulation, mirrors the actual, step-by-step nature of clinical reasoning, providing a far more realistic assessment environment than static datasets.
Quantifying Information Acquisition and Decision Quality
The benchmark meticulously scores both the final decisions made by the LLM and the quality of the information-gathering process itself. Through deterministic ontology-grounded matching, ClinEnv provides concrete metrics for decision accuracy. Crucially, it makes the information-acquisition gap—often invisible in outcome-only evaluations—directly measurable. Across seven evaluated models, the strongest performer achieved only a 0.31 decision F1 score, with outcome quality sharply decoupled from process quality. A notable concentration of difficulty was observed in management decisions and later stages of patient care, where models reliably recovered discharge diagnoses (0.51 F1) but struggled significantly with management actions (0.17 F1) and continued to issue redundant queries.