Reproducible, 0-cost, local-first analytics pipeline (macOS / Apple Silicon). Anchor project: "AI adoption across Polish & EU enterprises" — public data → audited pipeline → cited report. Built as a portfolio + hard-skills learning vehicle for a junior data/business-analyst role in the PL/EU financial sector.
What this is — and isn't (read first). The analytical subject is AI adoption in the Polish business economy benchmarked against the EU (GUS × Eurostat). The financial sector (NACE K) is excluded from these official statistics — there is no public dataset of AI adoption specific to Polish banks (none from NBP/KNF/EBA/ECB as of mid-2026). So this is not a banking data-cut. Banking shows up here only in two, clearly-separated ways: (1) it's the sector I'm targeting for work — the motivation, not the dataset; and (2) a separate qualitative public-evidence synthesis on AI in EU/PL banking (
reports/evidence_report.html), which collates published sources — it is not derived from the GUS/Eurostat figures. SeeFACT_CHECK.md.
AI-augmented, not AI-substituted. Code is generated/reviewed with Claude Code (Claude Max) and, for local models, LM Studio (OpenAI-compatible local server) — never a metered API key. Every generated SQL / pandas / statistic is rewritten and understood by hand so it survives a technical interview.
| Question | How does AI adoption among Polish enterprises compare with the EU, and how is it trending — reconciled across two independent official statistics offices? |
| Audience | Hiring managers (junior DA/BA, PL/EU finance) + my own decision workflow. |
| Decision it supports | A defensible, source-cited read on PL vs EU enterprise AI adoption — and a demonstrable, reproducible analyst workflow. |
| In scope | Public data only: GUS (Wykorzystanie ICT w przedsiębiorstwach), Eurostat (isoc_eb_ai); NBP rates as a pipeline-skeleton placeholder. |
| Out of scope | A banking-specific AI-adoption data-cut (no clean public source exists → handled separately as a qualitative evidence synthesis, scoped explicitly). |
| Success | Every figure traceable to a source; the whole report rebuilds from raw with one command (dvc repro). |
CCDS-style layout + DVC + pandera + Quarto + MLflow(SQLite), AI assist via Claude Code (Max) / LM Studio (local).
Design v2 was chosen after a compose-vs-build prior-art scan (decision record kept in private notes).
| # | Component | Status |
|---|---|---|
| 00 | problem-frame | ✅ this README |
| 01 | data-acquire | ✅ src/data/ingest_nbp.py (NBP, with provenance) |
| 02 | data-version | ✅ DVC pipeline (dvc.yaml) |
| 03 | data-quality-validate | ✅ pandera contract (schemas/nbp_schema.py + src/data/validate.py) |
| 04 | source-claim-validate | ✅ FACT_CHECK.md (SIFT template) |
| 05 | data-wrangle | ✅ src/features/wrangle.py (wide pivot, log-returns, rolling vol, join log) |
| 06 | stat-validate | ✅ src/analysis/stat_validate.py (scipy: Shapiro + Mann–Whitney + Cohen's d + CI) |
| 07 | data-analyze | ✅ hand-written SQL via DuckDB (sql/fx_analysis.sql) + local AI assist (src/analysis/text_to_sql.py) |
| 08 | insight-synth | ✅ src/analysis/insight_synth.py (findings → evidence + confidence) |
| 09 | decision-report | ✅ Quarto HTML showcase (reports/report.qmd → report.html, with chart) + Markdown fallback (build_report.py → report.md) |
| 10 | monitor-and-maintain | ✅ src/pipeline/monitor.py → reports/run_log.jsonl (MLflow = planned upgrade) |
The headline data deliverable. A focused SQL thread that reconciles the same indicator — enterprises using at least one AI technology — across two official statistics offices: GUS (Poland) vs Eurostat (EU). It exists to demonstrate hand-written JOIN + reconciliation + window functions on real public data.
| Stage | What |
|---|---|
ingest_gus_ai |
GUS Wykorzystanie ICT w przedsiębiorstwach 2025, Tab.15 → tidy parquet + provenance (src/data/ingest_gus_ai.py) |
ingest_eurostat_ai |
Eurostat isoc_eb_ai (E_AI_TANY, SDMX-CSV) → tidy parquet + provenance (src/data/ingest_eurostat_ai.py) |
validate_ai |
pandera contracts for both tables (schemas/gus_ai_schema.py, schemas/eurostat_ai_schema.py) |
analyze_ai |
4 hand-written DuckDB queries (sql/ai_*.sql) → reports/ai_sql_metrics.json + reports/ai_adoption_reconciliation.md |
ai_report |
branded charts + print-ready HTML (src/report/build_ai_report.py → reports/ai_adoption_report.html); render to PDF with the Chrome one-liner in the script docstring |
uv run dvc repro analyze_ai # ingest GUS + Eurostat -> validate -> 4 SQL queries -> report
uv run pytest tests/ # pandera contract tests + SQL invariantsScope, stated honestly: both sources cover the whole economy excluding the financial sector
(NACE K) — there is no public clean "AI adoption in PL banks" table. The banking read is contextual
(consolidated, large-firm sectors map to the high-adoption 250+ segment), not a financial-sector cut. The
finding is a synthesis of public figures (reconcile + benchmark + trend), not new facts — see
FACT_CHECK.md and reports/ai_adoption_reconciliation.md.
Prerequisite: uv — the only thing you install by hand; it provisions Python and the full locked env (DVC and pandera included).
curl -LsSf https://astral.sh/uv/install.sh | sh # install uv once (macOS/Linux)
uv sync # restore the locked environment
uv run dvc repro # ingest data -> validate -> analyze -> reportOutputs: data/raw/*.parquet (+ .provenance.json), data/processed/*.parquet, and reports/ (ai_adoption_report, evidence_report, report, stats, insights, sql_metrics, run_log).
Three self-contained outputs, with distinct purposes — don't conflate them:
reports/ai_adoption_report.html/.pdf— the flagship analysis: the GUS × Eurostat reconciliation on enterprise AI adoption (PL vs EU, 2021–2025), with branded charts, the four SQL queries' results, and the scope caveat upfront. This is the data deliverable.reports/evidence_report.html— sector context (qualitative): a public-evidence synthesis on AI in EU/PL banking — the sector I'm targeting (20 source-cited claims, SIFT-verified, with strength + triangulation). It collates published sources; it is not derived from the GUS/Eurostat dataset above.reports/report.html— a pipeline-mechanics showcase rendered fromreport.qmd. It runs on placeholder NBP FX data (labelled "skeleton") to demonstrate the end-to-end DVC → validate → wrangle → render flow.
The HTML reports need the quarto CLI on PATH (the pip-only pipeline runs fine without it — the report_html
stage just renders the HTML):
# install Quarto once, no sudo — macOS tarball into ~/.local, symlink onto PATH
# (pick the latest VER from https://github.com/quarto-dev/quarto-cli/releases)
VER=1.6.40
curl -L "https://github.com/quarto-dev/quarto-cli/releases/download/v$VER/quarto-$VER-macos.tar.gz" | tar -xz -C ~/.local
ln -sf ~/.local/quarto-$VER/bin/quarto ~/.local/bin/quarto
uv run dvc repro report_html # or: quarto render reports/report.qmdAsk questions in natural language; a local model (LM Studio, OpenAI-compatible endpoint) writes the SQL,
which is shown, saved to sql/generated/, and run on DuckDB. 0 PLN beyond hardware, nothing leaves the machine.
# LM Studio: load a model, Start Server (port 1234), turn the model's "thinking/reasoning" mode OFF
uv run python src/analysis/text_to_sql.py "which date had the highest USD mid rate?"Provider-abstracted via env (LLM_BASE_URL, LLM_MODEL). Not a pipeline stage (interactive + needs the server).
Literacy guard: the generated SQL is a draft to review/rewrite — in an interview the SQL must be yours.
- Environment lock:
pyproject.toml+uv.lock. - Data versioning: DVC (
dvc.yamlstages,dvc.locklineage record); local cache, no cloud. - Provenance: every dataset ships a
*.provenance.json(source, endpoint, access time, SHA-256, license). - Data contract: pandera schema validates dtypes, ranges, uniqueness, no future dates — fails loudly on drift.