analytics-stack-pl

Reproducible, 0-cost, local-first analytics pipeline (macOS / Apple Silicon). Anchor project: "AI adoption across Polish & EU enterprises" — public data → audited pipeline → cited report. Built as a portfolio + hard-skills learning vehicle for a junior data/business-analyst role in the PL/EU financial sector.

What this is — and isn't (read first). The analytical subject is AI adoption in the Polish business economy benchmarked against the EU (GUS × Eurostat). The financial sector (NACE K) is excluded from these official statistics — there is no public dataset of AI adoption specific to Polish banks (none from NBP/KNF/EBA/ECB as of mid-2026). So this is not a banking data-cut. Banking shows up here only in two, clearly-separated ways: (1) it's the sector I'm targeting for work — the motivation, not the dataset; and (2) a separate qualitative public-evidence synthesis on AI in EU/PL banking (reports/evidence_report.html), which collates published sources — it is not derived from the GUS/Eurostat figures. See FACT_CHECK.md.

AI-augmented, not AI-substituted. Code is generated/reviewed with Claude Code (Claude Max) and, for local models, LM Studio (OpenAI-compatible local server) — never a metered API key. Every generated SQL / pandas / statistic is rewritten and understood by hand so it survives a technical interview.

00 — Problem frame


Question	How does AI adoption among Polish enterprises compare with the EU, and how is it trending — reconciled across two independent official statistics offices?
Audience	Hiring managers (junior DA/BA, PL/EU finance) + my own decision workflow.
Decision it supports	A defensible, source-cited read on PL vs EU enterprise AI adoption — and a demonstrable, reproducible analyst workflow.
In scope	Public data only: GUS (Wykorzystanie ICT w przedsiębiorstwach), Eurostat (`isoc_eb_ai`); NBP rates as a pipeline-skeleton placeholder.
Out of scope	A banking-specific AI-adoption data-cut (no clean public source exists → handled separately as a qualitative evidence synthesis, scoped explicitly).
Success	Every figure traceable to a source; the whole report rebuilds from raw with one command (`dvc repro`).

Stack (DESIGN v2, 2026-06-13)

CCDS-style layout + DVC + pandera + Quarto + MLflow(SQLite), AI assist via Claude Code (Max) / LM Studio (local). Design v2 was chosen after a compose-vs-build prior-art scan (decision record kept in private notes).

Pipeline components (v1)

#	Component	Status
00	problem-frame	✅ this README
01	data-acquire	✅ `src/data/ingest_nbp.py` (NBP, with provenance)
02	data-version	✅ DVC pipeline (`dvc.yaml`)
03	data-quality-validate	✅ pandera contract (`schemas/nbp_schema.py` + `src/data/validate.py`)
04	source-claim-validate	✅ `FACT_CHECK.md` (SIFT template)
05	data-wrangle	✅ `src/features/wrangle.py` (wide pivot, log-returns, rolling vol, join log)
06	stat-validate	✅ `src/analysis/stat_validate.py` (scipy: Shapiro + Mann–Whitney + Cohen's d + CI)
07	data-analyze	✅ hand-written SQL via DuckDB (`sql/fx_analysis.sql`) + local AI assist (`src/analysis/text_to_sql.py`)
08	insight-synth	✅ `src/analysis/insight_synth.py` (findings → evidence + confidence)
09	decision-report	✅ Quarto HTML showcase (`reports/report.qmd` → `report.html`, with chart) + Markdown fallback (`build_report.py` → `report.md`)
10	monitor-and-maintain	✅ `src/pipeline/monitor.py` → `reports/run_log.jsonl` (MLflow = planned upgrade)

Flagship analysis — AI-adoption SQL cannonball (GUS × Eurostat reconciliation)

The headline data deliverable. A focused SQL thread that reconciles the same indicator — enterprises using at least one AI technology — across two official statistics offices: GUS (Poland) vs Eurostat (EU). It exists to demonstrate hand-written JOIN + reconciliation + window functions on real public data.

Stage	What
`ingest_gus_ai`	GUS Wykorzystanie ICT w przedsiębiorstwach 2025, Tab.15 → tidy parquet + provenance (`src/data/ingest_gus_ai.py`)
`ingest_eurostat_ai`	Eurostat `isoc_eb_ai` (`E_AI_TANY`, SDMX-CSV) → tidy parquet + provenance (`src/data/ingest_eurostat_ai.py`)
`validate_ai`	pandera contracts for both tables (`schemas/gus_ai_schema.py`, `schemas/eurostat_ai_schema.py`)
`analyze_ai`	4 hand-written DuckDB queries (`sql/ai_*.sql`) → `reports/ai_sql_metrics.json` + `reports/ai_adoption_reconciliation.md`
`ai_report`	branded charts + print-ready HTML (`src/report/build_ai_report.py` → `reports/ai_adoption_report.html`); render to PDF with the Chrome one-liner in the script docstring

uv run dvc repro analyze_ai      # ingest GUS + Eurostat -> validate -> 4 SQL queries -> report
uv run pytest tests/             # pandera contract tests + SQL invariants

Scope, stated honestly: both sources cover the whole economy excluding the financial sector (NACE K) — there is no public clean "AI adoption in PL banks" table. The banking read is contextual (consolidated, large-firm sectors map to the high-adoption 250+ segment), not a financial-sector cut. The finding is a synthesis of public figures (reconcile + benchmark + trend), not new facts — see FACT_CHECK.md and reports/ai_adoption_reconciliation.md.

Run

Prerequisite: uv — the only thing you install by hand; it provisions Python and the full locked env (DVC and pandera included).

curl -LsSf https://astral.sh/uv/install.sh | sh   # install uv once (macOS/Linux)
uv sync                                            # restore the locked environment
uv run dvc repro                                   # ingest data -> validate -> analyze -> report

Outputs: data/raw/*.parquet (+ .provenance.json), data/processed/*.parquet, and reports/ (ai_adoption_report, evidence_report, report, stats, insights, sql_metrics, run_log).

Reports

Three self-contained outputs, with distinct purposes — don't conflate them:

reports/ai_adoption_report.html / .pdf — the flagship analysis: the GUS × Eurostat reconciliation on enterprise AI adoption (PL vs EU, 2021–2025), with branded charts, the four SQL queries' results, and the scope caveat upfront. This is the data deliverable.
reports/evidence_report.html — sector context (qualitative): a public-evidence synthesis on AI in EU/PL banking — the sector I'm targeting (20 source-cited claims, SIFT-verified, with strength + triangulation). It collates published sources; it is not derived from the GUS/Eurostat dataset above.
reports/report.html — a pipeline-mechanics showcase rendered from report.qmd. It runs on placeholder NBP FX data (labelled "skeleton") to demonstrate the end-to-end DVC → validate → wrangle → render flow.

The HTML reports need the quarto CLI on PATH (the pip-only pipeline runs fine without it — the report_html stage just renders the HTML):

# install Quarto once, no sudo — macOS tarball into ~/.local, symlink onto PATH
# (pick the latest VER from https://github.com/quarto-dev/quarto-cli/releases)
VER=1.6.40
curl -L "https://github.com/quarto-dev/quarto-cli/releases/download/v$VER/quarto-$VER-macos.tar.gz" | tar -xz -C ~/.local
ln -sf ~/.local/quarto-$VER/bin/quarto ~/.local/bin/quarto

uv run dvc repro report_html        # or: quarto render reports/report.qmd

Local AI assist — text-to-SQL (the AI edge)

Ask questions in natural language; a local model (LM Studio, OpenAI-compatible endpoint) writes the SQL, which is shown, saved to sql/generated/, and run on DuckDB. 0 PLN beyond hardware, nothing leaves the machine.

# LM Studio: load a model, Start Server (port 1234), turn the model's "thinking/reasoning" mode OFF
uv run python src/analysis/text_to_sql.py "which date had the highest USD mid rate?"

Provider-abstracted via env (LLM_BASE_URL, LLM_MODEL). Not a pipeline stage (interactive + needs the server). Literacy guard: the generated SQL is a draft to review/rewrite — in an interview the SQL must be yours.

Reproducibility & provenance

Environment lock: pyproject.toml + uv.lock.
Data versioning: DVC (dvc.yaml stages, dvc.lock lineage record); local cache, no cloud.
Provenance: every dataset ships a *.provenance.json (source, endpoint, access time, SHA-256, license).
Data contract: pandera schema validates dtypes, ranges, uniqueness, no future dates — fails loudly on drift.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

analytics-stack-pl

00 — Problem frame

Stack (DESIGN v2, 2026-06-13)

Pipeline components (v1)

Flagship analysis — AI-adoption SQL cannonball (GUS × Eurostat reconciliation)

Run

Reports

Local AI assist — text-to-SQL (the AI edge)

Reproducibility & provenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.dvc		.dvc
data		data
notebooks		notebooks
references		references
reports		reports
schemas		schemas
sql		sql
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
.python-version		.python-version
FACT_CHECK.md		FACT_CHECK.md
LICENSE		LICENSE
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

analytics-stack-pl

00 — Problem frame

Stack (DESIGN v2, 2026-06-13)

Pipeline components (v1)

Flagship analysis — AI-adoption SQL cannonball (GUS × Eurostat reconciliation)

Run

Reports

Local AI assist — text-to-SQL (the AI edge)

Reproducibility & provenance

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages