Skip to content

martin0ne/analytics-stack-pl

Repository files navigation

analytics-stack-pl

Reproducible, 0-cost, local-first analytics pipeline (macOS / Apple Silicon). Anchor project: "AI adoption across Polish & EU enterprises" — public data → audited pipeline → cited report. Built as a portfolio + hard-skills learning vehicle for a junior data/business-analyst role in the PL/EU financial sector.

What this is — and isn't (read first). The analytical subject is AI adoption in the Polish business economy benchmarked against the EU (GUS × Eurostat). The financial sector (NACE K) is excluded from these official statistics — there is no public dataset of AI adoption specific to Polish banks (none from NBP/KNF/EBA/ECB as of mid-2026). So this is not a banking data-cut. Banking shows up here only in two, clearly-separated ways: (1) it's the sector I'm targeting for work — the motivation, not the dataset; and (2) a separate qualitative public-evidence synthesis on AI in EU/PL banking (reports/evidence_report.html), which collates published sources — it is not derived from the GUS/Eurostat figures. See FACT_CHECK.md.

AI-augmented, not AI-substituted. Code is generated/reviewed with Claude Code (Claude Max) and, for local models, LM Studio (OpenAI-compatible local server) — never a metered API key. Every generated SQL / pandas / statistic is rewritten and understood by hand so it survives a technical interview.

00 — Problem frame

Question How does AI adoption among Polish enterprises compare with the EU, and how is it trending — reconciled across two independent official statistics offices?
Audience Hiring managers (junior DA/BA, PL/EU finance) + my own decision workflow.
Decision it supports A defensible, source-cited read on PL vs EU enterprise AI adoption — and a demonstrable, reproducible analyst workflow.
In scope Public data only: GUS (Wykorzystanie ICT w przedsiębiorstwach), Eurostat (isoc_eb_ai); NBP rates as a pipeline-skeleton placeholder.
Out of scope A banking-specific AI-adoption data-cut (no clean public source exists → handled separately as a qualitative evidence synthesis, scoped explicitly).
Success Every figure traceable to a source; the whole report rebuilds from raw with one command (dvc repro).

Stack (DESIGN v2, 2026-06-13)

CCDS-style layout + DVC + pandera + Quarto + MLflow(SQLite), AI assist via Claude Code (Max) / LM Studio (local). Design v2 was chosen after a compose-vs-build prior-art scan (decision record kept in private notes).

Pipeline components (v1)

# Component Status
00 problem-frame ✅ this README
01 data-acquire src/data/ingest_nbp.py (NBP, with provenance)
02 data-version ✅ DVC pipeline (dvc.yaml)
03 data-quality-validate ✅ pandera contract (schemas/nbp_schema.py + src/data/validate.py)
04 source-claim-validate FACT_CHECK.md (SIFT template)
05 data-wrangle src/features/wrangle.py (wide pivot, log-returns, rolling vol, join log)
06 stat-validate src/analysis/stat_validate.py (scipy: Shapiro + Mann–Whitney + Cohen's d + CI)
07 data-analyze ✅ hand-written SQL via DuckDB (sql/fx_analysis.sql) + local AI assist (src/analysis/text_to_sql.py)
08 insight-synth src/analysis/insight_synth.py (findings → evidence + confidence)
09 decision-report ✅ Quarto HTML showcase (reports/report.qmdreport.html, with chart) + Markdown fallback (build_report.pyreport.md)
10 monitor-and-maintain src/pipeline/monitor.pyreports/run_log.jsonl (MLflow = planned upgrade)

Flagship analysis — AI-adoption SQL cannonball (GUS × Eurostat reconciliation)

The headline data deliverable. A focused SQL thread that reconciles the same indicator — enterprises using at least one AI technology — across two official statistics offices: GUS (Poland) vs Eurostat (EU). It exists to demonstrate hand-written JOIN + reconciliation + window functions on real public data.

Stage What
ingest_gus_ai GUS Wykorzystanie ICT w przedsiębiorstwach 2025, Tab.15 → tidy parquet + provenance (src/data/ingest_gus_ai.py)
ingest_eurostat_ai Eurostat isoc_eb_ai (E_AI_TANY, SDMX-CSV) → tidy parquet + provenance (src/data/ingest_eurostat_ai.py)
validate_ai pandera contracts for both tables (schemas/gus_ai_schema.py, schemas/eurostat_ai_schema.py)
analyze_ai 4 hand-written DuckDB queries (sql/ai_*.sql) → reports/ai_sql_metrics.json + reports/ai_adoption_reconciliation.md
ai_report branded charts + print-ready HTML (src/report/build_ai_report.pyreports/ai_adoption_report.html); render to PDF with the Chrome one-liner in the script docstring
uv run dvc repro analyze_ai      # ingest GUS + Eurostat -> validate -> 4 SQL queries -> report
uv run pytest tests/             # pandera contract tests + SQL invariants

Scope, stated honestly: both sources cover the whole economy excluding the financial sector (NACE K) — there is no public clean "AI adoption in PL banks" table. The banking read is contextual (consolidated, large-firm sectors map to the high-adoption 250+ segment), not a financial-sector cut. The finding is a synthesis of public figures (reconcile + benchmark + trend), not new facts — see FACT_CHECK.md and reports/ai_adoption_reconciliation.md.

Run

Prerequisite: uv — the only thing you install by hand; it provisions Python and the full locked env (DVC and pandera included).

curl -LsSf https://astral.sh/uv/install.sh | sh   # install uv once (macOS/Linux)
uv sync                                            # restore the locked environment
uv run dvc repro                                   # ingest data -> validate -> analyze -> report

Outputs: data/raw/*.parquet (+ .provenance.json), data/processed/*.parquet, and reports/ (ai_adoption_report, evidence_report, report, stats, insights, sql_metrics, run_log).

Reports

Three self-contained outputs, with distinct purposes — don't conflate them:

  • reports/ai_adoption_report.html / .pdfthe flagship analysis: the GUS × Eurostat reconciliation on enterprise AI adoption (PL vs EU, 2021–2025), with branded charts, the four SQL queries' results, and the scope caveat upfront. This is the data deliverable.
  • reports/evidence_report.htmlsector context (qualitative): a public-evidence synthesis on AI in EU/PL banking — the sector I'm targeting (20 source-cited claims, SIFT-verified, with strength + triangulation). It collates published sources; it is not derived from the GUS/Eurostat dataset above.
  • reports/report.html — a pipeline-mechanics showcase rendered from report.qmd. It runs on placeholder NBP FX data (labelled "skeleton") to demonstrate the end-to-end DVC → validate → wrangle → render flow.

The HTML reports need the quarto CLI on PATH (the pip-only pipeline runs fine without it — the report_html stage just renders the HTML):

# install Quarto once, no sudo — macOS tarball into ~/.local, symlink onto PATH
# (pick the latest VER from https://github.com/quarto-dev/quarto-cli/releases)
VER=1.6.40
curl -L "https://github.com/quarto-dev/quarto-cli/releases/download/v$VER/quarto-$VER-macos.tar.gz" | tar -xz -C ~/.local
ln -sf ~/.local/quarto-$VER/bin/quarto ~/.local/bin/quarto

uv run dvc repro report_html        # or: quarto render reports/report.qmd

Local AI assist — text-to-SQL (the AI edge)

Ask questions in natural language; a local model (LM Studio, OpenAI-compatible endpoint) writes the SQL, which is shown, saved to sql/generated/, and run on DuckDB. 0 PLN beyond hardware, nothing leaves the machine.

# LM Studio: load a model, Start Server (port 1234), turn the model's "thinking/reasoning" mode OFF
uv run python src/analysis/text_to_sql.py "which date had the highest USD mid rate?"

Provider-abstracted via env (LLM_BASE_URL, LLM_MODEL). Not a pipeline stage (interactive + needs the server). Literacy guard: the generated SQL is a draft to review/rewrite — in an interview the SQL must be yours.

Reproducibility & provenance

  • Environment lock: pyproject.toml + uv.lock.
  • Data versioning: DVC (dvc.yaml stages, dvc.lock lineage record); local cache, no cloud.
  • Provenance: every dataset ships a *.provenance.json (source, endpoint, access time, SHA-256, license).
  • Data contract: pandera schema validates dtypes, ranges, uniqueness, no future dates — fails loudly on drift.

About

Reproducible, 0-cost analytics pipeline (DVC + pandera + DuckDB + Quarto): AI adoption across Polish vs EU enterprises (GUS × Eurostat reconciliation). Portfolio + hard-skills vehicle for a junior data/BA role in PL/EU finance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors