AutoDataScientists: Self-Organizing Agent Teams for Clinical, Healthcare, and Biomedical Data Science
Research use only. This system is for methods research and exploratory analysis. It is not a medical device, is not FDA/CE cleared, and must not be used for diagnosis, treatment decisions, or any clinical purpose without independent validation and the appropriate regulatory clearance. See Disclaimer.
AutoDataScientists is a decentralized team of AI agents that run end-to-end data science for clinical, healthcare, and biomedical data science problems, ingesting messy real-world data (EHR, multi-omics, imaging, trials), proposing and critiquing analysis plans, building and validating models, and producing interpretable, reproducible write-ups.
Unlike a single autonomous agent that follows one analysis trajectory, AutoDataScientists agents self-organize into teams around competing hypotheses and modeling strategies, critique each other's analysis plans before spending compute, and share results, dead-ends, and intermediate artifacts so the system avoids redundant work and sustains parallel exploration as evidence accumulates over hours or days. Domain guardrails are privacy/de-identification, statistical rigor, subgroup fairness, and leakage detection are first-class steps in the workflow, not afterthoughts.
This repository packages the system as Claude Code subagents coordinating through a local message-board / workspace server. The orchestrator is a pure coordinator , it launches agents and harvests their results; it never analyzes data itself.
Generic AutoML tends to fail in biomedicine for predictable, domain-specific reasons. AutoDataScientists is built to respect them:
- Privacy & governance : PHI/PII handling, de-identification, IRB/ethics approval, and data-use agreements gate what leaves the machine. No identifiable data is sent to external APIs without a BAA / equivalent.
- Statistical rigor over leaderboard chasing : proper study design, confounder handling, multiple-testing correction, calibration, and time-to-event (survival) framing rather than naive accuracy.
- Leakage is everywhere : patient-level and site-level grouping in cross-validation, temporal splits for prospective questions, and explicit checks for target leakage from clinical workflows.
- Batch & site effects : harmonization across assays, platforms, and hospitals before modeling.
- Fairness & equity : performance reported across demographic and clinical subgroups, not just in aggregate.
- Interpretability & biological plausibility : feature importance, pathway/enrichment context, and clinically meaningful explanations are required outputs, not optional.
- Reproducibility : every run logs data versions, seeds, parameters, and decisions.
The agents cover the full data-science lifecycle for a given task:
- Ingest & profile : load data; map to standards (FHIR / OMOP / HGNC / SNOMED-CT / LOINC); profile schema, missingness, and distributions.
- De-identify & govern : verify de-identification, flag PHI, record consent/IRB constraints.
- Design : frame the question (predictive vs. causal vs. descriptive), define endpoints, power/sample-size sanity checks, and the validation protocol.
- Prepare : QC, batch-effect correction/harmonization, feature selection, dimensionality reduction, embeddings.
- Model : propose candidate approaches, run sweeps, train with grouping-aware cross-validation.
- Validate : leakage audit, external/temporal validation, calibration, subgroup/fairness analysis.
- Interpret & report : feature attributions, biological/clinical context, limitations, and a reproducible report.
Clinical
- Risk prediction (ICU mortality, 30-day readmission, sepsis early warning, length-of-stay)
- EHR phenotyping and cohort definition
- Clinical deterioration / time-to-event modeling
Personalized / precision medicine
- Patient subtyping and stratification
- Treatment-response prediction
- Polygenic risk scoring and pharmacogenomics
Biomedical
- Biomarker discovery and prioritization
- Drug-response and target-identification analyses
- Assay/screen data modeling
Computational biology
- Single-cell cell-type annotation and differential expression
- Variant effect prediction and interpretation
- Multi-omics integration (genomics + transcriptomics + proteomics)
EHR / claims (FHIR, OMOP CDM) · genomics (WGS/WES, variants) · transcriptomics (bulk & single-cell) · proteomics / metabolomics · medical imaging (radiology, digital pathology) · clinical-trial data · wearable / sensor time series.
Each modality has its own loader and QC profile under
task-<name>/. Start with the bundled examples and add your own.
A lightweight coordination layer hosts shared workspaces and a message board; agents post proposals, critiques, and results there. Roles include:
| Agent | Responsibility |
|---|---|
| Data Steward | Ingestion, standards mapping, de-identification & PHI checks, missingness, batch detection |
| Biostatistician | Study design, confounders, power, multiple testing, survival/causal framing |
| Feature / Representation | Feature selection, harmonization, embeddings, dimensionality reduction |
| Modeler(s) | Candidate models and sweeps; grouping-aware cross-validation (parallel teams) |
| Validator / Critic | Leakage audits, external/temporal validation, calibration, subgroup fairness |
| Translator | Feature attribution, pathway/clinical context, limitations, final report |
| Orchestrator | Pure coordinator : launches agents, harvests results, never analyzes data |
Agents self-organize around promising directions and must pass peer critique before consuming compute, so the system explores several strategies in parallel without duplicating effort.
Prerequisites: Python 3.10+, Node.js 22+ (for npx), and the Claude Code CLI (claude).
# 1. Start the local coordination server (agents coordinate through this)
# Replace with your chosen message-board/workspace server.
npx <coordination-server> start
# 2. Python dependencies
pip install -r requirements.txtData security: run on infrastructure approved for your data classification. Configure external model access so that no identifiable data leaves the environment without a BAA / DUA in place.
From the repo root, in a separate shell:
claude -p "Read runbook.md and execute. Task: task-readmission-risk. Run name: readmit_v1."
claude -p "Read runbook.md and execute. Task: task-singlecell-annotation. Run name: scrna_v1."
claude -p "Read runbook.md and execute. Task: task-biomarker-discovery. Run name: biomarker_v1."Each launch materializes a new sibling directory ../<run-name>/ with its own copy of the system, agents, workspace, and logs, so the template stays clean across runs. Hardware/data requirements vary per task : see each task-<name>/README.md.
Drop a task-<name>/ directory at the repo root with:
TASK.md: the spec. YAML frontmatter setstask_type(e.g.ehr-risk,omics-classification,survival,imaging,singlecell),name,endpoint, andvalidation(e.g.patient-grouped-cv,temporal-split). The body describes the data, cohort, constraints, and success criteria.LAUNCH.md: fills the workflow hooks the runbook references (launch_command,deident_policy,cv_strategy,fairness_subgroups,leakage_checks,promotion_criteria,exit_condition, …). Easiest path: copy the closest bundledtask-*/LAUNCH.mdand edit.
Optionally add a download_data.sh / loader to fetch the dataset. Then launch with --task task-<name>.
| Benchmark | Metric | AutoDataScientists | Strongest baseline | Δ |
|---|---|---|---|---|
| e.g. MIMIC readmission | AUROC | TBD | TBD | TBD |
| e.g. single-cell annotation | macro-F1 | TBD | TBD | TBD |
| e.g. ProteinGym subset | Spearman | TBD | TBD | TBD |
Report subgroup/fairness breakdowns and calibration alongside headline metrics.
- Use only data you are authorized to use, under an approved IRB/ethics protocol and any applicable DUA.
- De-identify per HIPAA Safe Harbor / Expert Determination or your local equivalent (e.g. GDPR) before analysis.
- Keep an audit log of data access and agent decisions.
- Do not transmit identifiable data to third-party services without a Business Associate Agreement (or equivalent).
- This repository ships no patient data; example tasks reference public/synthetic datasets only.
This software is provided for research and educational purposes only. It is not a medical device and has not been reviewed or cleared by any regulatory authority. Outputs may be incorrect, biased, or incomplete and must be independently validated. Nothing produced by this system constitutes medical advice or a substitute for the judgment of qualified clinicians. The authors accept no liability for use of this software in clinical or operational settings.