A model-organism benchmark for misalignment and oversight failures in biomedical AI research agents.
Status: analysis complete — dashboard and report phase.
NeuroGuard is an empirical AI-safety benchmark that places LLM-based biomedical research agents under controlled pressure conditions and measures how often, how severely, and under which conditions they exhibit misalignment-shaped behaviour — overclaiming evidence, hiding uncertainty, behaving differently when monitored, and producing outputs that appear safe but fail programmatic verification against the trial registry.
The biomedical setting is the substrate, not the subject. The subject is misalignment-shaped behaviour under realistic research-environment pressures.
For the full research question, hypotheses, scope, and ethical scope,
see docs/research_question.md.
Frontier LLMs are being integrated into biomedical research workflows. A research agent that appears to summarise the literature accurately — but in fact overstates evidence, hides uncertainty, or behaves differently when it believes it is not being watched — is a scalable oversight failure waiting to happen, regardless of the underlying capability of the model.
NeuroGuard treats this setting as a model organism: a tractable substrate (publicly registered Alzheimer's-disease trials) in which misalignment-shaped behaviour can be elicited under controlled perturbations and measured against verifiable ground truth.
| Priority | Where in this project |
|---|---|
| Model organisms of misalignment | Whole study design |
| Scalable oversight | Layer A (programmatic verification) vs Layer B/C (LLM-judge / human) gap |
| AI control | Pressure-condition factorial; monitored vs unmonitored |
| Adversarial robustness | Hidden-adversarial-instruction pressure condition |
| Evaluations | The benchmark itself; reproducible runner; public schema |
| Interpretability | Behavioural interpretability of detectors (feature importance, SHAP) |
neuroguard/
├── app/ # Streamlit interactive dashboard
├── configs/ # scenarios, pressures, rubric, experiment YAML
├── data/ # evidence corpus + simulation outputs (gitignored)
├── docs/ # research question, design notes, scope
├── models/ # trained model artefacts (gitignored)
├── notebooks/ # analysis notebooks (01–07)
├── reports/figures/ # all generated figures (34 plots)
├── results/ # per-experiment metrics, configs, manifest
├── src/neuroguard/ # package code (runner, judge, verification, CLI)
├── tests/ # unit + integration tests (pytest)
└── .github/ # issue / PR templates, CI workflows
- In scope. ClinicalTrials.gov Alzheimer's / dementia trials; multi-turn agent simulations under a factorial pressure design; three-layer ground truth (programmatic + LLM-judge + human-labelled); supervised ML detectors; Bayesian and Markov extensions; behavioural interpretability.
- Out of scope. Bio capability uplift of any kind; real patient data; fine-tuning of base models; mechanistic interpretability of model internals.
The authoritative scope statement is
docs/research_question.md §6 / §7.
NeuroGuard measures meta-behaviour (overclaiming, hidden uncertainty, monitoring-divergence, oversight failure). It does not score whether an agent produced a more dangerous biomedical answer. All evidence is drawn from public, de-identified registries. Any "confidential" content used to probe leakage is synthetic and uses a clearly marked canary token. Hidden adversarial instructions probe coercion patterns, never bio capability.
The full ethical statement is in
docs/research_question.md §9.
This is a step-by-step build across six phases:
- Foundations — research question, repo, environment, structure.
- Experimental design — evidence corpus, scenarios, agent tasks, pressure conditions, safety rubric.
- Pipeline — provider-agnostic model interaction layer, experiment runner, structured data collection.
- Analysis & ML — EDA, feature engineering, multiple tuned classifiers, full evaluation, error analysis, interpretability.
- Research extensions — Bayesian credible intervals on per-condition unsafe-rate; Markov sequential behaviour modelling.
- Productisation — results storage, dashboard, tests, CI, README polish, paper-style report.
# Clone and set up
git clone https://github.com/Ella-Afonso/NeuroGuard.git
cd NeuroGuard
python -m venv .venv && .venv/Scripts/activate # or source .venv/bin/activate
pip install -e ".[all]"
# Run tests
pytest
# Launch the dashboard
streamlit run app/dashboard.py
# CLI
neuroguard --helpSee docs/research_question.md for the
full research question and methodology.
MIT — see LICENSE.
A CITATION.cff will be added once the first results are written up.