A machine learning ranking tool that assigns a vigilance score to crypto protocols (DeFi, CEX, bridges) to prioritize human security reviews.
CRS is a machine learning project designed to rank crypto protocols by vigilance score in order to prioritize human security reviews. Given ~500 active protocols and limited analyst capacity, the goal is to surface the 30 highest-risk candidates each quarter, not to predict hacks, but to support risk-based triage.
The project covers data collection, feature engineering, model comparison, ranking-oriented evaluation, and interpretability.
The notebooks are primarily written in French as they document the project methodology in detail. Each notebook includes an English summary at the top to make the workflow understandable for non-French readers.
A security team with limited capacity (one part-time analyst) cannot review every protocol from scratch each quarter. The question is: which 30 protocols out of ~500 should be reviewed first?
A naive approach (biggest TVL, most recent launch) is not systematic and misses important structural signals. CRS replaces ad hoc selection with a reproducible, model-driven ranking.
Does:
- Learn the structural profile of protocols historically targeted in the DefiLlama hack dataset.
- Assign a
risk_scoreto each protocol to sort ~500 candidates. - Direct limited analyst capacity toward the highest-priority cases.
Does not:
- Predict whether a protocol will be hacked.
- Use pre-hack features. Current version uses snapshot-based inputs; temporal alignment is documented as the main limitation and the primary improvement path.
- Replace human judgment. Every alert requires analyst validation.
Sources used:
| Source | Role |
|---|---|
DefiLlama /protocols |
Metadata: TVL, launch date, active chains, audit count (4,800+ protocols) |
DefiLlama /hacks |
Labels: 472 documented hacks, $15.85B in losses, 2016–2026 |
Sources explored and discarded:
- Rekt News: fully parsed (286 entries) → 0 usable ML features. Tags encode blockchains and attack techniques, not structured audit status.
- Dune Analytics: 0.8% coverage of DefiLlama protocols after join. Uncontrollable selection bias, fragile matching.
- CertiK Skynet / DeFiSafety: no stable public API, incomplete coverage, non-reproducible scraping.
Final dataset: 4,883 active protocols (TVL > 0, valid launch date), of which 164 hacked (3.4%) and 4,719 clean (96.6%). Multi-incident protocols appear multiple times, handled by grouped split (see Methodology).
Note: DefiLlama public API endpoints were migrated to a paid plan in April 2026. Data used here reflects a March 2026 snapshot.
01_Crypto_Protocol_Risk_Scoring_Collecte_des_données.ipynb
- DefiLlama
/hackscall → labels - DefiLlama
/protocolscall → raw features - Exploration and rejection of Rekt News, Dune, CertiK
- Initial feature engineering
- Documentation of temporal bias (2026 snapshots vs. 2016–2025 hacks)
- Parquet export →
data/
02_Crypto_Protocol_Risk_Scoring.ipynb
- EDA: feature distributions, class imbalance (naive baseline = 95.9% accuracy with zero learning)
- sklearn pipeline:
RobustScaler+OneHotEncoderin aColumnTransformer - Train/val/test split grouped by protocol (
GroupShuffleSplit); no protocol shared across sets - Model comparison: Logistic Regression, Random Forest, XGBoost
- Imbalance handling:
class_weight='balanced',scale_pos_weight=28.9(XGBoost) - Threshold tuning (0.3 for max-recall mode)
- Final evaluation on test set: protocol-level ranking by
risk_score - Interpretability: feature importances, SHAP
| Feature | Rationale |
|---|---|
log_tvl |
Compresses asymmetric TVL distribution ($10K to $3B) |
age_days |
Protocol age from launch date |
tvl_per_day |
TVL / age; captures "recent honey pot" profile (high TVL, low audit exposure) |
is_multichain |
Quantifies cross-chain attack surface |
is_bridge |
Bridge protocols have specific risk exposure |
is_dex, is_lending |
Protocol category flags |
audit_status |
Presence/absence of documented audits |
In the modeling notebook, additional derived features are tested, including audit_per_year, lending_audit_score (is_lending x audit_count), and category (one-hot encoded).
| Model | Notes |
|---|---|
| Logistic Regression | Baseline, interpretable, linear decision boundary |
| Random Forest | Ensemble, handles non-linearities |
| XGBoost | Gradient boosting, scale_pos_weight for imbalance |
Final model: Random Forest refitted on train + validation (best ROC-AUC on validation set, confirmed by Recall@k on test set).
Standard accuracy is meaningless on a 3.4% / 96.6% imbalance. The primary metrics are:
- Recall@k: among the top k% of protocols by predicted score, what fraction of true hacks are captured?
- Lift@k: how much better than random selection is the model at rank k?
- ROC-AUC: global discrimination ability.
Evaluation on the test set (976 unique protocols, 25 hacked, never seen during training):
| Top k% | Alerts | Recall@k | Lift@k |
|---|---|---|---|
| 5% | 48 | 40% | 8.1x |
| 10% | 97 | 44% | 4.4x |
| 15% | 146 | 48% | 3.2x |
| 20% | 195 | 60% | 3.0x |
ROC-AUC (test): 0.7836
Interpretation: by reviewing the 97 highest-scored protocols (top 10%), we cover 44% of documented hacks in the test set: 4.4x better than random selection.
Feature importances and SHAP values are computed in Notebook 2. Key findings:
- Tree-based feature importances highlight
chain_count,age_days,is_lending,tvl_per_day, andlog_tvlas the main signals. audit_countremains present but must be interpreted cautiously: it partly reflects a size effect, because large protocols are both more audited and more targeted.- SHAP summary plots show individual protocol contributions to each prediction.
CRS/
├── 01_Crypto_Protocol_Risk_Scoring_Collecte_des_données.ipynb # Data collection
├── 02_Crypto_Protocol_Risk_Scoring.ipynb # Modeling & evaluation
├── data/
│ └── df_defi_risk.parquet # Processed dataset
├── exports/
│ ├── 01_Crypto_Protocol_Risk_Scoring_Collecte_des_données.pdf
│ └── 02_Crypto_Protocol_Risk_Scoring.pdf
├── crypto_protocol_risk_scoring.pkl # Saved model
├── pyproject.toml
├── README.md # This file (English)
└── README.fr.md # French version
# Clone the repository
git clone https://github.com/V-Vaal/CRS.git
cd CRS
# Create a Python 3.12 virtual environment with uv and install dependencies
uv venv --python 3.12 .venv
source .venv/bin/activate
uv sync
# Launch JupyterLab
jupyter labRun notebooks in order:
01_Crypto_Protocol_Risk_Scoring_Collecte_des_données.ipynb→ generatesdata/df_defi_risk.parquet02_Crypto_Protocol_Risk_Scoring.ipynb→ trains the model and evaluates on the test set
No API key required. Data comes from DefiLlama public APIs (March 2026 snapshot). Note: DefiLlama endpoints have moved to a paid plan since April 2026.
Temporal bias. TVL, audit_count, and chain_count are March 2026 snapshots applied to hacks from 2016–2025. A hacked protocol may show collapsed TVL post-incident; the model observes a post-hack reality for positive examples.
Correlation ≠ causation. High audit_count correlates with risk because large protocols are both more audited and more targeted. This is a size proxy, not a causal signal usable in production.
CEX bias. audit_count is a native DeFi indicator (smart contract audits). CEX protocols surface with high scores because they have 0 smart contract audits, not because they are intrinsically riskier in a DeFi sense.
Incident-level dataset. A multi-incident protocol carries proportionally more weight in training. Correction path: deduplication or per-protocol weighting before fit.
hacked=0 ≠ safe. The negative class mixes genuinely safe protocols, protocols too small to be targeted, and undocumented hacks (label noise).
Temporal feature alignment (high priority). Pull pre-hack TVL via /api/protocol/{slug} and /api/inflows/{protocol}/{timestamp} → time-aligned features (TVL at T-30, 90-day volatility). Partially resolves the temporal bias.
Graph Neural Networks. Tabular models cannot capture inter-protocol interactions. PyTorch Geometric would allow modeling transaction patterns: flash loans, cross-protocol interactions, liquidity contagion.
Anomaly detection. Isolation Forest or autoencoders to identify structurally atypical protocols, useful when reliable labels are absent for new protocols.
Survival analysis. Model time-to-hack rather than a binary classifier. Better suited to censored data (active protocols with unknown futures).
On-chain features. Dune Analytics with robust matching (by contract address, not name) would add wallet concentration, activity metrics, and liquidity flow data.
| Component | Version |
|---|---|
| Python | 3.12 |
| pandas | ≥ 2.0 |
| numpy | ≥ 1.24 |
| scikit-learn | ≥ 1.3 |
| XGBoost | ≥ 2.0 |
| SHAP | ≥ 0.44 |
| matplotlib / seaborn | ≥ 3.7 / 0.12 |
| pyarrow | ≥ 14.0 |
| uv | dependency management |
MIT - See LICENSE
Valentin Valluet
- GitHub: github.com/V-Vaal
- LinkedIn: linkedin.com/in/valentin-valluet
- X: @val2_x