AI-augmented clinical trial enrollment forecasting and decision support.
DecisionLENS is an end-to-end data science portfolio project that applies predictive modeling, survival analysis, and large-language-model reasoning to the problem of clinical trial enrollment planning — one of the costliest and most schedule-critical challenges in drug development.
Built with: Python · XGBoost · Lifelines · Groq (Llama 3.3-70B) · Streamlit · Plotly · AACT / ClinicalTrials.gov data
| Module | Question answered |
|---|---|
| Enrollment Forecast | What is the probability this trial will complete enrollment? When? |
| Competitive Intelligence | How saturated is the patient pool for this condition? Who are the competing sponsors? |
| Investigator Insights | Which sites and countries have the strongest historical completion records? |
| Eligibility Analyzer | Which inclusion/exclusion criteria most restrict the eligible population, and how can they be relaxed? |
| Layer | Technology |
|---|---|
| Data | AACT / ClinicalTrials.gov (public) |
| Data wrangling | pyarrow, polars |
| Modeling | Cox PH (lifelines), XGBoost, scikit-learn |
| Explainability | SHAP (TreeExplainer) |
| Generative AI | llama-3.3-70b-versatile - Groq API |
| Visualization | Plotly Express / Graph Objects |
| Dashboard | Streamlit |
| Testing | pytest |
┌─────────────────────────────────────────────────────────────────┐
│ Data Layer │
│ AACT (ClinicalTrials.gov) flat-file parquets │
│ setup_data.py ──► data/processed/*.parquet │
│ (falls back to 10,000-study synthetic dataset if unavailable) │
└────────────┬────────────────────────────────────────────────────┘
│
┌────────────▼────────────────────────────────────────────────────┐
│ src/ — Core Modules │
│ │
│ data_pipeline.py TrialDataPipeline │
│ load_raw_data() ──► clean_studies() ──► engineer_features│
│ │
│ models.py EnrollmentForecaster │
│ XGBClassifier (P_completed) │
│ XGBRegressor (duration_days) │
│ CoxPHFitter (survival curve S(t)) │
│ TreeExplainer (SHAP waterfall) │
│ │
│ competitive_intel.py CompetitiveAnalyzer │
│ get_landscape() plot_competition_map() plot_timeline() │
│ │
│ investigator_insights.py InvestigatorAnalyzer │
│ get_top_sites() get_country_performance() recommend_sites() │
│ │
│ genai_utils.py EligibilityAnalyzer (Groq / Llama 3.3) │
│ analyze_criteria() compare_criteria() exec_briefing() │
└────────────┬────────────────────────────────────────────────────┘
│
┌────────────▼────────────────────────────────────────────────────┐
│ app/ — Streamlit Dashboard │
│ │
│ streamlit_app.py Home / KPI overview │
│ pages/1_Enrollment_Forecast.py Risk gauge, survival curve │
│ pages/2_Competitive_Intelligence.py Choropleth, Gantt, treemap│
│ pages/3_Investigator_Insights.py Site ranking, country heatmap│
│ pages/4_Eligibility_Analyzer.py LLM risk analysis, briefing │
└─────────────────────────────────────────────────────────────────┘
decisionlens/
├── app/
│ ├── streamlit_app.py # Home page
│ ├── components/
│ │ ├── charts.py # Reusable Plotly chart functions
│ │ ├── sidebar.py # Global filter sidebar
│ │ └── _theme.py # Shared CSS theme
│ └── pages/
│ ├── 1_Enrollment_Forecast.py
│ ├── 2_Competitive_Intelligence.py
│ ├── 3_Investigator_Insights.py
│ └── 4_Eligibility_Analyzer.py
├── src/
│ ├── data_pipeline.py # TrialDataPipeline
│ ├── models.py # EnrollmentForecaster
│ ├── competitive_intel.py # CompetitiveAnalyzer
│ ├── investigator_insights.py # InvestigatorAnalyzer
│ └── genai_utils.py # EligibilityAnalyzer (Groq API)
├── notebooks/
│ ├── 01_eda_trial_landscape.ipynb # Exploratory data analysis
│ ├── 02_feature_engineering.ipynb # Feature construction & validation
│ ├── 03_enrollment_model.ipynb # Model training & tuning
│ ├── 04_competitive_intelligence.ipynb
│ └── 05_model_evaluation.ipynb # Threshold tuning, SHAP analysis
├── sql/
│ ├── enrollment_extract.sql # Main modeling query
│ ├── competitor_landscape.sql # Competing-trial query
│ └── investigator_sites.sql # Site-performance query
├── tests/
│ ├── conftest.py # Shared fixtures
│ ├── test_pipeline.py
│ ├── test_models.py
│ └── test_genai.py
├── docs/
│ └── methodology.md # Full technical methodology
├── models/
│ └── forecaster.joblib # Saved EnrollmentForecaster
├── setup_data.py # One-time data download & preprocessing
├── requirements.txt
└── .env.example
git clone https://github.com/yourhandle/decisionlens.git
cd decisionlens
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtcp .env.example .envEdit .env and set:
# Required for the Eligibility Analyzer (Page 4).
# Free key: https://console.groq.com/keys
# If not set, the app runs in demo mode with realistic mock responses.
GROQ_API_KEY=your_groq_key_here
# Path to your AACT data directory (created by setup_data.py).
# Defaults to ./data if not set.
AACT_DATA_DIR=./data
setup_data.py downloads the latest AACT flat-file snapshot (~2 GB) and converts it to parquets. If the download fails, it generates a 10,000-study synthetic dataset automatically.
python setup_data.pyExpected output:
Trying 2026-03-01 … OK
Downloaded: data/raw/20260301_clinical_trials.zip
Extracting … done
Converting to parquet … 10 tables
Preprocessing … done
Saved to data/processed/ (10 tables)
Run notebook 03 to train and save the EnrollmentForecaster:
jupyter notebook notebooks/03_enrollment_model.ipynb
# Or: jupyter nbconvert --to notebook --execute notebooks/03_enrollment_model.ipynbThis writes models/forecaster.joblib. A pre-trained model is included for convenience.
streamlit run app/streamlit_app.pyOpens at http://localhost:8501. All four pages are available from the sidebar.
Performance was evaluated on a 10,000-study synthetic AACT snapshot using 5-fold stratified cross-validation (March 2026).
| Metric | Default threshold (0.50) | Tuned threshold (0.93) |
|---|---|---|
| ROC-AUC | 0.787 ± 0.004 | 0.787 ± 0.004 |
| F1-macro | 0.564 | 0.660 |
| F1 (Completed) | 0.956 | 0.869 |
| F1 (Terminated) | 0.170 | 0.368 |
The threshold was raised from 0.50 to 0.93 to improve sensitivity to the Terminated (minority) class while preserving ROC-AUC. At the 0.50 threshold, the model nearly always predicted "Completed" — clinically useless. At 0.93, it surfaces roughly twice as many true termination risks.
Why F1-Terminated is still modest: AACT captures structural attributes (phase, sponsor, facility count) but lacks the operational signals that most reliably predict termination — site activation delays, protocol amendments, funding discontinuation, and regulatory holds. ROC-AUC = 0.787 confirms that a real signal exists; F1-Terminated will improve as site-level features are added.
| Metric | Value |
|---|---|
| R² | ~0.04 |
| Interpretation | Duration is driven by operational factors absent from AACT structural data. |
| Use | Rank-ordering and rough planning; prefer Cox PH intervals for uncertainty. |
Fits a parametric time-to-completion model on labeled trials with right-censoring for active/recruiting studies. Used to generate the enrollment-completion probability curve P(completed by day t) = 1 − S(t) shown on Page 1.
- Configure a hypothetical trial (phase, sites, countries, sponsor type, masking)
- Outputs: completion probability gauge, predicted duration in months, risk label
- Survival curve: P(enrollment completed by day t) with predicted median marker
- SHAP waterfall: top 12 risk drivers with signed feature contributions
- Benchmark: box plot of historical duration for same phase + 10 nearest-neighbor trials
- KPI cards: total/active/competing/terminated trials, competition intensity index
- Choropleth world map: trial site density by country
- Gantt timeline: concurrent trial start/end dates coloured by phase
- Sponsor treemap: market share by active-trial count
- Sortable site ranking table: trials, completion rate, avg enrollment, city/country
- Country heatmap: z-score normalised performance across 5 metrics
- Site co-participation network: nodes = sites, edges = shared trials
- Site allocation recommender: distributes a patient target across countries by performance score
- Pastes eligibility criteria text (or loads an NSCLC example)
- LLM (Llama 3.3-70B via Groq) identifies risk factors, assigns severity (high/medium/low), and suggests relaxations with estimated population impact
- Streaming executive briefing: 1-page VP-level summary (~300 words)
- Side-by-side protocol comparison: which criteria differ, which protocol is more enrollment-friendly
- Demo mode: runs with realistic mock responses when
GROQ_API_KEYis not set
| Notebook | Purpose |
|---|---|
| 01_eda_trial_landscape.ipynb | Distribution of trial statuses, phases, enrollment figures; temporal trends |
| 02_feature_engineering.ipynb | Feature construction, correlation analysis, missing-value audit |
| 03_enrollment_model.ipynb | XGBoost training, cross-validation, SHAP importances, model serialisation |
| 04_competitive_intelligence.ipynb | Competitive landscape deep-dive, sponsor analysis, site co-participation network |
| 05_model_evaluation.ipynb | Threshold sweep (F1 vs threshold curve), ROC/PR curves, calibration, confusion matrices |
pytest tests/ -v
# 89 passed, 7 skipped (live-mode Groq tests require GROQ_API_KEY) in ~5 s| Test file | What's tested |
|---|---|
test_pipeline.py |
TrialDataPipeline — init, clean_studies, engineer_features, load_raw_data |
test_models.py |
EnrollmentForecaster — fit, predict, survival, save/load, pre-fit guards |
test_genai.py |
EligibilityAnalyzer — demo mode, all three methods, caching, live mocks |
The sql/ directory contains equivalent PostgreSQL queries for teams with live AACT access:
| Script | Purpose |
|---|---|
enrollment_extract.sql |
Full modeling dataset — joins 8 AACT tables, engineers all features |
competitor_landscape.sql |
Active competing trials, sponsor market share, site distribution |
investigator_sites.sql |
Site completion rates, enrollment performance by country |
All settings are loaded from .env at startup (see .env.example):
| Variable | Default | Description |
|---|---|---|
GROQ_API_KEY |
(none) | Groq API key for the Eligibility Analyzer. Get free at console.groq.com/keys. App runs in demo mode without it. |
AACT_DATA_DIR |
./data |
Root directory for AACT flat files. setup_data.py creates raw/ and processed/ subdirectories inside it. |
AACT_USERNAME |
(none) | AACT PostgreSQL credentials — needed only for live database queries, not for flat-file mode. |
AACT_PASSWORD |
(none) | See above. |
See docs/methodology.md for a detailed walkthrough of:
- AACT data schema and preprocessing decisions
- Feature engineering rationale (all 14 features)
- Classifier threshold selection and imbalanced-class handling
- Survival model right-censoring and interpretation
- LLM prompt design and structured output schema
- Source: AACT flat-file downloads — pipe-delimited (
|), UTF-8, ~2.2 GB compressed - Scope: Interventional trials started ≥ 2008 with target enrollment ≥ 10
- Labels: COMPLETED → 1, TERMINATED → 0; RECRUITING/ACTIVE/etc. → unlabeled
- Synthetic fallback:
setup_data.pygenerates 10,000 seeded synthetic studies when the download is unavailable; all structural patterns are preserved
MIT — see LICENSE for details.