This repo contains a proof-of-concept pipeline for turning smart contract audit PDFs (and some native Markdown reports) into a normalized schema (SCVD v0.1), plus:
- a per-report extractor,
- a normalizer to SCVD records,
- a JSON Schema validator, and
- a local Streamlit dashboard.
The main flow:
-
extract_report.pyPDF/Markdown → report JSON (per report) -
normalize_report.pyreport JSON → normalized findings (one SCVD record per line, JSONL) -
validate_scvd.pyValidate normalized findings against the SCVD v0.1 JSON Schema -
dashboard.pyLocal visual explorer (Streamlit) over normalized findings -
run_pipeline.pyOrchestrate extract → normalize → validate for a whole tree of reports
- Python 3.10+ (3.11 recommended)
- Dependencies (minimal set):
pip install \
requests \
pandas \
streamlit \
jsonschema \
torch \
transformers>=4.45 \
numpy-
For PDF extraction:
markerPython package installed- Model weights configured for
marker(already handled inextract_report.pyviacreate_model_dict())
-
For metadata inference (optional):
-
Ollama running locally
-
A model like
qwen3:8bpulled:ollama pull qwen3:8b
-
Directory layout (suggested):
data/raw/<provider>/...– original PDFs / MD files (inputs only)data/extracted/...– extractor outputs (HTML, Markdown, per-report JSON)data/normalized/...– normalized SCVD v0.1 JSONL findings
run_pipeline.py assumes this kind of layout by default.
Purpose: Convert an audit report (PDF or “nice” Markdown) into:
-
HTML + Markdown (via Marker, for PDFs)
-
A structured JSON object with:
doc_idsource_pdf,source_mtimeextracted_at,extractor_versionrepositories(GitHub URLs + commits + evidence)report_schema(per-report metadata schema, optionally inferred via LLM)vulnerability_sections(index, headings, markdown, description, metadata, etc.)
Extract from a single PDF:
python extract_report.py path/to/report.pdf --use-ollamaThis will produce, alongside the PDF:
path/to/report.htmlpath/to/report.mdpath/to/report.json
Common options:
-
--doc-id DOCIDOverride thedoc_idin the JSON (default: PDF filename stem). -
--out-json OUTPUT.jsonCustom path for the JSON output. -
--save-html OUTPUT.html/--save-md OUTPUT.mdCustom paths for HTML/Markdown outputs. -
--use-ollamaUse an LLM (via Ollama) to:- infer the per-report metadata schema (field names + meanings), and
- extract metadata (Severity, Difficulty, Type, Finding ID, Target, etc.) per vulnerability,
- and (as a last resort) segment findings and descriptions when heuristics fail.
-
--ollama-base-url URLBase URL for Ollama (default:http://localhost:11434). -
--ollama-model MODEL_NAMEModel name to use (default:qwen3:8b). -
-v/-vvIncrease logging verbosity.
Process all PDFs in a directory (non-recursive):
python extract_report.py --pdf-dir ./reports --use-ollamaThis will:
-
Find all
*.pdfunder./reports(no recursion). -
For each
foo.pdf, create:foo.htmlfoo.mdfoo.json
Options:
--forceRe-run extraction even if a.jsonalready exists for a given PDF.
Example:
python extract_report.py --pdf-dir ./reports --use-ollama --force -vYou can pull Code4rena findings repositories (e.g. code-423n4/2024-11-nibiru-findings) directly from GitHub Issues and feed them through the exact same normalization + dashboard flow.
What it does
- Fetches all issues (or just open, if you choose) from each specified Code4rena repo.
- Skips non-findings (see filters below).
- Writes a synthetic
report.jsonper repo. - Normalizes those into SCVD v0.1 JSONL, just like PDFs/MDs.
Each line can be either owner/repo or a full GitHub URL:
# comments and blank lines are ignored
code-423n4/2024-11-nibiru-findings
https://github.com/code-423n4/2024-08-chakra-findings
code-423n4/2023-04-rubicon-findings
A common place to keep this file is:
data/raw/code4rena/repos.txt
…but it can live anywhere; just pass the path to the flags below.
Set a token to avoid 401s and tight rate limits:
export GITHUB_TOKEN=ghp_yourtokenhereThen pass the env var name to the CLI (defaults to GITHUB_TOKEN):
--github-token-env GITHUB_TOKEN
Tip: verify the token works
curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/rate_limit
Purpose:
Convert the per-report JSON from extract_report.py into SCVD v0.1 finding records.
- Input:
report.json - Output:
findings.jsonl(one JSON object per line, each a single finding)
Each SCVD record includes (schema v0.1):
schema_version,scvd_iddoc_id,finding_index,page_starttitle,description_md,full_markdownseverity,difficulty,type,finding_idtarget(path + placeholders for language/chain/contract/func/etc.)repo(best-effort repo context chosen fromrepositories)taxonomy(SWC/CWE/tags – currently empty lists)status(fix status, CVSS, exploit info – currently mostlynull)references(currently empty list)provenance(timestamps + versions from both extraction & normalization)metadata_raw(original metadata block from the report for this finding)
python normalize_report.py path/to/report.json --out path/to/findings.jsonlArguments:
-
report_json(positional) Path toreport.jsonproduced byextract_report.py. -
--source-pdf PATHPDF filename/path to store inprovenance.source_pdf. If omitted, the script tries:report["source_pdf"], or<doc_id>.pdfas a fallback.
-
--extraction-version VERSIONLabel forprovenance.scvd_normalizer_version(default:poc-0.1). -
--out OUTPUT.jsonlOutput file for SCVD findings. If omitted, JSONL is printed to stdout.
If you have multiple *.json reports (e.g. from directory mode):
# Example: normalize all .json reports in ./reports
for f in reports/*.json; do
out="${f%.json}.scvd.jsonl"
python normalize_report.py "$f" --out "$out"
done
# Combine all into a single corpus
cat reports/*.scvd.jsonl > all_findings.jsonlYou can then run validation and the dashboard on all_findings.jsonl.
Purpose:
Validate SCVD v0.1 records in a JSONL file against a formal JSON Schema
(e.g. schema/scvd_finding_v0_1.json).
This is a lint/check only – it does not modify your data.
python validate_scvd.py path/to/findings.jsonlDefault schema path:
schema/scvd_finding_v0_1.json
Override the schema path:
python validate_scvd.py path/to/findings.jsonl \
--schema path/to/custom_schema.jsonBehavior:
-
Reads
findings.jsonlline by line. -
Parses each line as JSON.
-
Validates against the schema.
-
On errors, prints messages like:
[line 12] 'schema_version' is a required property at[line 34] 'severity' is not of type 'string' at severity
-
Exits with:
0and✅ all goodif everything matches the schema.- Non-zero and
❌ validation failed with N error(s)otherwise.
Purpose: Provide a small local dashboard to explore SCVD findings visually:
- Show basic stats and a severity distribution chart.
- List all normalized findings in a table (no filters).
- Let you inspect a single finding in detail (markdown, repo, provenance, etc.).
pip install streamlit pandasstreamlit run dashboard.py -- --jsonl path/to/findings.jsonlNotes:
- The
--separates Streamlit’s own args from your script’s args. --jsonlis the path to the normalized findings file produced bynormalize_report.py(can be a combined corpus).
Streamlit will print a URL, usually:
http://localhost:8501
-
Overview:
- Total number of findings.
- Number of unique reports (
doc_id). - Number of unique repositories.
-
Chart:
- Bar chart of findings by severity.
-
Table:
-
All findings, with columns:
scvd_iddoc_idfinding_indextitleseveritydifficultytype- target path (
target.path) - repo URL (
repo.url)
-
-
Detail view:
-
Select one
scvd_idand see:- Title, SCVD ID,
doc_id, finding index - Severity / difficulty / type
- Target (path, contract name, function, chain, contract address)
- Description / markdown (rendered)
- Repository info
- Taxonomy (SWC/CWE/tags)
- Status (fix status, exploit info, etc.)
- Provenance (source PDF, extraction/normalization timestamps and versions)
- Raw metadata from the report (
metadata_raw)
- Title, SCVD ID,
-
There are deliberately no filters in this PoC dashboard; it always shows all findings in the provided JSONL file.
# 1) Extract structured report from a single PDF
python extract_report.py reports/timeboost.pdf --use-ollama
# 2) Normalize to SCVD v0.1
python normalize_report.py reports/timeboost.json \
--out reports/timeboost.scvd.jsonl
# 3) Validate SCVD records
python validate_scvd.py reports/timeboost.scvd.jsonl
# 4) Explore visually
streamlit run dashboard.py -- --jsonl reports/timeboost.scvd.jsonl# 1) Extract all PDFs in a directory
python extract_report.py --pdf-dir ./reports --use-ollama --force
# 2) Normalize each report.json to .scvd.jsonl
for f in reports/*.json; do
out="${f%.json}.scvd.jsonl"
python normalize_report.py "$f" --out "$out"
done
# 3) Combine into a single corpus
cat reports/*.scvd.jsonl > all_findings.jsonl
# 4) Validate combined corpus
python validate_scvd.py all_findings.jsonl
# 5) Run dashboard on the combined data
streamlit run dashboard.py -- --jsonl all_findings.jsonlYou can export CSV from any JSONL or a single JSON array file without running the full pipeline:
python -m scvd.run_pipeline \
--export-csv \
--jsonl-in path/to/findings.dedup.jsonl \
--csv-out path/to/findings.dedup.csvOptionally select a tidy subset of columns:
python -m scvd.run_pipeline \
--export-csv \
--jsonl-in path/to/findings.jsonl \
--csv-fields "scvd_id,title,severity.level,target.chain,repo.url"There is also a convenience script that runs the full pipeline for you:
- extract (PDF/Markdown →
report.json+ HTML/MD) - normalize (
report.json→*.scvd.jsonl) - optionally validate against the SCVD v0.1 schema
- optionally combine everything into a single
all_findings.jsonl
Example:
python -m scvd.run_pipeline \
--raw-dir data/raw \
--extracted-dir data/extracted \
--normalized-dir data/normalized \
--use-ollama \
--force \
-vThis will:
-
Recursively find all
*.pdfand*.mdunderdata/raw. -
For each input:
- Write
report.json+ HTML/Markdown underdata/extracted/... - Write
*.scvd.jsonlunderdata/normalized/...
- Write
-
Combine all normalized findings into:
data/normalized/combined/all_findings.jsonl
-
Optionally validate the combined file with the SCVD v0.1 schema (controlled via
--schema/--skip-validate).
You can then point the dashboard at:
streamlit run dashboard.py -- --jsonl data/normalized/combined/all_findings.jsonlDirectory mode (recommended) - with dedup
python -m scvd.run_pipeline \
--raw-dir data/raw \
--extracted-dir data/extracted \
--normalized-dir data/normalized \
--use-ollama \
--force \
--run-dedup \
--dedup-model Snowflake/snowflake-arctic-embed-l-v2.0 \
--dedup-sim-th 0.82 \
--dedup-hard-boost 0.10 \
--dedup-embed-cache disk \
--dedup-topk 5 \
-vWhat this does
- Produces
data/normalized/combined/all_findings.jsonl - Runs the semantic dedup post-step and writes:
data/normalized/combined/all_findings.dedup.jsonl - The dedup file adds
duplicate_of,dedup, andduplicatesfields
python -m scvd.run_pipeline path/to/report.pdf --use-ollama -vThis will write:
report.json,report.html,report.md, andreport.scvd.jsonlnext to the input file, and- optionally validate that single
*.scvd.jsonl(unless--skip-validateis set).
Runs PDFs/MDs and Code4rena ingestion in one go, then combines outputs.
python -m scvd.run_pipeline \
--raw-dir data/raw \
--extracted-dir data/extracted \
--normalized-dir data/normalized \
--use-ollama \
--force \
-v \
--code4rena-repos data/raw/code4rena/repos.txt \
--github-token-env GITHUB_TOKENor with deduplication it should be
python -m scvd.run_pipeline \
--raw-dir data/raw \
--extracted-dir data/extracted \
--normalized-dir data/normalized \
--use-ollama \
--force \
--code4rena-repos data/raw/code4rena/repos.txt \
--github-token-env GITHUB_TOKEN \
--run-dedup \
--dedup-model Snowflake/snowflake-arctic-embed-l-v2.0 \
--dedup-sim-th 0.82 \
--dedup-hard-boost 0.10 \
--dedup-embed-cache disk \
--dedup-topk 5 \
-vOutputs (mirrors the tree just like PDFs/MDs):
-
Synthetic reports:
data/extracted/code4rena/<owner>/<repo>/report.json -
Normalized findings:
data/normalized/code4rena/<owner>/<repo>/report.scvd.jsonl -
Combined corpus (all sources):
data/normalized/combined/all_findings.jsonl
Issue state control (optional):
--c4-state {all|open|closed}(default:all)--c4-open-only(shortcut for--c4-state open)
If you only want to ingest Code4rena repos:
python -m scvd.run_pipeline \
--code4rena-repos data/raw/code4rena/repos.txt \
--extracted-dir data/extracted \
--normalized-dir data/normalized \
--github-token-env GITHUB_TOKEN \
-vor with deduplication it is:
python -m scvd.run_pipeline \
--code4rena-repos data/raw/code4rena/repos.txt \
--extracted-dir data/extracted \
--normalized-dir data/normalized \
--github-token-env GITHUB_TOKEN \
--run-dedup \
--dedup-model Snowflake/snowflake-arctic-embed-l-v2.0 \
--dedup-sim-th 0.82 \
--dedup-hard-boost 0.10 \
--dedup-embed-cache disk \
--dedup-topk 5 \
-vFlags (summary)
--run-dedup— enable the post-processing dedup pass--dedup-model— HF embedding model (default:Snowflake/snowflake-arctic-embed-l-v2.0)--dedup-sim-th— cosine similarity threshold (default:0.82)--dedup-hard-boost— additive boost if repo/commit/path match (default:0.10)--dedup-embed-cache {none|disk}— store computed embeddings on disk (recommended:disk)--dedup-topk— keep top-K candidate duplicates per record (default:5)
Outputs and combined corpus are written under the same data/extracted / data/normalized roots as above.
The pipeline can export a CSV for spreadsheets/BI tools. It works in directory/C4 modes (using the combined or deduped corpus) or in a convert-only mode where you point to any JSONL/JSON file.
Directory mode → combined → CSV (default source)
python -m scvd.run_pipeline \
--raw-dir data/raw \
--export-csv
# Writes:
# JSONL: data/normalized/combined/all_findings.jsonl
# CSV: data/normalized/combined/all_findings.csvWith dedup (CSV from dedup by default)
python -m scvd.run_pipeline \
--raw-dir data/raw \
--run-dedup \
--export-csv
# Source = all_findings.dedup.jsonl → all_findings.dedup.csvC4-only mode → combined → CSV
python -m scvd.run_pipeline \
--code4rena-repos data/raw/code4rena/repos.txt \
--export-csvConvert-only (any JSONL/JSON file → CSV)
python -m scvd.run_pipeline \
--export-csv \
--jsonl-in path/to/whatever.jsonl \
--csv-out out/whatever.csvPick specific columns
python -m scvd.run_pipeline \
--export-csv \
--jsonl-in data/normalized/combined/all_findings.jsonl \
--csv-fields "scvd_id,title,severity.level,taxonomy.swc,target.chain,contract_address,provenance.source_url"Flags (CSV)
--export-csv– enable CSV export--jsonl-in– explicit source JSONL/JSON (overrides combined/dedup default)--csv-out– CSV output path (default: same as source with.csv)--csv-fields– comma-separated dotted keys; if omitted, exports all flattened keys
taxonomy.swc,taxonomy.cwe, and manytarget/statusfields are defined in the schema but intentionallynull/empty in this PoC. They are placeholders for future enrichment (SWC tagging, CWE mapping, fix-status parsing, CVSS-like scoring, etc.).- The pipeline is read-only: it never writes back into PDFs or source repos.
- JSONL (
*.scvd.jsonl) and the SCVD JSON Schema are the main artifacts meant for discussion, experimentation, and future standardization of smart contract vulnerability data.
This repo also includes a lightweight read-only API to serve normalized findings.
pip install "fastapi>=0.103" "uvicorn[standard]>=0.23" "python-dateutil>=2.8" "PyYAML>=6.0"From the repository root (scvd/):
uvicorn api.app:app --reload --host 127.0.0.1 --port 8000If you move things around, point to the module path of your app file, e.g.
uvicorn scvd.api.app:app --reload.
SCVD_DATA_JSONL— path to combined findings file (default:data/normalized/combined/all_findings.jsonl)SCVD_SNAPSHOTS_DIR— directory for monthly JSONL snapshots (default:data/snapshots)SCVD_API_KEY— optional. If set, the API will requireX-API-Keyfor access. If unset, the API is public (recommended for PoC).
- Swagger UI:
http://127.0.0.1:8000/docs - ReDoc:
http://127.0.0.1:8000/redoc - Raw spec:
http://127.0.0.1:8000/openapi.json
Main endpoints:
GET /health— health check ({"status":"ok","loaded": N})GET /findings— list with filters, pagination viaX-Next-CursorGET /findings/{scvd_id}— single recordGET /stats— corpus summaryGET /snapshots— list available monthly snapshotsGET /snapshots/{period}— download a snapshot (JSON Lines stream)
# Health
curl http://127.0.0.1:8000/health
# One Medium finding (limit 1)
curl "http://127.0.0.1:8000/findings?severity=Medium&limit=1" | jq .
# Free-text search
curl "http://127.0.0.1:8000/findings?q=malleability&limit=5" | jq .
# Pagination
FIRST=$(curl -i -s "http://127.0.0.1:8000/findings?limit=1" | tee /dev/tty | awk -F': ' '/X-Next-Cursor/{print $2}' | tr -d '\r')
curl "http://127.0.0.1:8000/findings?limit=1&cursor=$FIRST" | jq .
# Snapshots
curl http://127.0.0.1:8000/snapshots | jq .
curl http://127.0.0.1:8000/snapshots/2025-11 -o 2025-11.jsonlOption A: Postman UI
- Run the API locally.
- In Postman, Import → Link and paste
http://127.0.0.1:8000/openapi.json(or import the YAML file). - Choose Generate collection. Folder strategy Tags works well.
Option B: CLI → collection file
npm i -g openapi-to-postmanv2
curl http://127.0.0.1:8000/openapi.json -o openapi.json
openapi2postmanv2 -s openapi.json -o scvd.postman_collection.json -p -O folderStrategy=TagsImport scvd.postman_collection.json into Postman.
Swagger UI examples troubleshooting: if examples don’t render, either (1) use OpenAPI
3.0.3+nullable, or (2) keep3.1.0and replacetype: [string, "null"]withoneOfand add a top-levelexample:under the media type.
The same FastAPI app is deployed read-only on Google Cloud Run and fronted by the custom domain api.scvd.dev.
Base URL
https://api.scvd.dev
Auth
-
Public by default.
-
If the service is started with
SCVD_API_KEY, every request must include:X-API-Key: <your-key>
CORS
- Enabled for all origins (read-only).
Key endpoints
GET /health– quick status and loaded record countGET /findings– list with filters (q,severity,swc,cwe,doc_id,chain,repo,since,until,sort,order,limit,cursor)GET /findings/{scvd_id}– fetch one recordGET /stats– summary counters and top SWCGET /snapshots– list monthly JSONL snapshotsGET /snapshots/{period}– download a snapshot (streaming JSON Lines)- Docs:
GET /docs(Swagger),GET /redoc, raw spec atGET /openapi.json
Examples
# health
curl https://api.scvd.dev/health
# first page of Medium findings
curl "https://api.scvd.dev/findings?severity=Medium&limit=5" | jq .
# free-text search
curl "https://api.scvd.dev/findings?q=reentrancy&limit=5" | jq .
# follow cursor
NEXT=$(curl -i -s "https://api.scvd.dev/findings?limit=1" \
| awk -F': ' '/X-Next-Cursor/{print $2}' | tr -d '\r')
curl "https://api.scvd.dev/findings?limit=1&cursor=$NEXT" | jq .
# stats for a time range
curl "https://api.scvd.dev/stats?since=2025-11-01T00:00:00Z&until=2025-12-01T00:00:00Z" | jq .OpenAPI in Postman
-
Import from URL:
https://api.scvd.dev/openapi.json -
Or generate a collection via CLI:
npx -y openapi-to-postmanv2 \ -s https://api.scvd.dev/openapi.json \ -o scvd.postman_collection.json \ -p -O folderStrategy=Tags
Rolling out new data
-
The service reads
SCVD_DATA_JSONL(defaultdata/normalized/combined/all_findings.jsonl) at startup. -
To publish new data:
- build & push a new image with the updated JSONL,
- deploy a new revision of the same Cloud Run service.
-
Your domain mapping to
api.scvd.devstays the same.
This PoC includes an optional post-processing step that detects near-duplicate findings across reports and marks a canonical record.
-
Computes embeddings for each finding using a local HF model (default:
Snowflake/snowflake-arctic-embed-l-v2.0). -
Scores pairwise similarity with cosine.
-
Adds a small hard boost if repo/commit/path agree (we’ve found this hugely helpful).
-
Selects a canonical record per duplicate cluster and annotates:
duplicate_of(on duplicates)dedup(decision metadata on every record)duplicates(top-K scored neighbors per record)
No vectors are written into your JSON. If
--dedup-embed-cache diskis used, embeddings are cached locally under the embeddings root (default passed fromrun_pipelineas--emb-root data) to speed up subsequent runs.
You can run it on any combined corpus:
python -m scvd.dedup.run_dedup \
--in data/normalized/combined/all_findings.jsonl \
--out data/normalized/combined/all_findings.dedup.jsonl \
--model Snowflake/snowflake-arctic-embed-l-v2.0 \
--sim-th 0.82 \
--hard-boost 0.10 \
--topk 5 \
--embed-cache disk \
--emb-root data-
text_sim = cosine(embedding_i, embedding_j)
-
hard signals (binary):
repo_match= 1.0 if repo URL/org/name agreecommit_match= 1.0 if same commitpath_match= 1.0 if samerepo.relative_file
-
final score =
text_sim + hard_boost * (repo_match + commit_match + path_match)- Clipped to
[0, 1] - Default
hard_boost = 0.10
- Clipped to
-
A pair is a duplicate if
final score >= sim_th(default0.82).
We also compute a swc_overlap (Jaccard) and include it in duplicates.signals for transparency; it’s not part of the default boost.
For a group of mutual duplicates, the canonical record is the first in encounter order (stable across runs for the same input order). You can later swap this out to prefer older provenance or richer context.
On each SCVD record:
{
"duplicate_of": "scvd_abc123" // or null if canonical/unique
}{
"dedup": {
"decision": "canonical | duplicate | unique | uncertain",
"canonical_id": "scvd_abc123",
"model": "Snowflake/snowflake-arctic-embed-l-v2.0",
"sim_threshold": 0.82,
"hard_boost": 0.1,
"run_at": "2025-12-10T12:34:56Z"
},
"duplicates": [
{
"scvd_id": "scvd_def456",
"score": 0.91,
"signals": {
"text_sim": 0.86,
"swc_overlap": 1.0,
"repo_match": 1.0,
"commit_match": 1.0,
"path_match": 1.0
},
"shared": {
"repo": "https://github.com/acme/proj",
"commit": "deadbeef",
"path": "contracts/Token.sol"
}
}
]
}- Model:
Snowflake/snowflake-arctic-embed-l-v2.0is robust, multilingual, and fast on GPU. You can swap to any HF text embedding model. - Threshold: raise
--dedup-sim-thto cut false positives; lower it to catch more. - Boosts: if your data always includes code paths/commits, the default
0.10boost is conservative — feel free to raise it. - Caching:
--dedup-embed-cache diskis recommended on larger corpora; it avoids recomputing embeddings.
The default model fits comfortably and uses GPU automatically via transformers (no extra flags needed). You can pin a device with CUDA_VISIBLE_DEVICES.