Skip to content

ci: lint/type/test floor + dataset & eval-baseline gates (Milestone A)#5

Merged
helebest merged 1 commit into
mainfrom
ci/lint-and-eval-gates
Jun 28, 2026
Merged

ci: lint/type/test floor + dataset & eval-baseline gates (Milestone A)#5
helebest merged 1 commit into
mainfrom
ci/lint-and-eval-gates

Conversation

@helebest

Copy link
Copy Markdown
Collaborator

What

Closes the CI/CD gap in dikw-data so it has the same deterministic floor as dikw-core, before Phase 1 dataset construction begins. This is Milestone A of the eval-plan Phase 0→1 work (see docs/dikw-eval-plan.md); the public-anchor calibration (scifact/cmteb) follows as Milestone B.

Changes

  • pyproject.toml — ruff + mypy (strict) config mirroring dikw-core. Ignores RUF001/2/3 (ambiguous-unicode false positives on this bilingual zh/en codebase's embedded CJK text) and scope-ignores E702 to the one procedural-pictogram generator.
  • .pre-commit-config.yaml — local ruff + mypy hooks (uv run), matching CI. Install with uv run pre-commit install.
  • .github/workflows/ci.ymluv sync → ruff → mypy src → pytest → validate every dataset (shape gate, $0, no provider keys). Matrix 3.12 / 3.13.
  • .github/workflows/eval-gate.yml + tools/check_baselines.py — a datasets/** change must land a new dated reports/BASELINES.md entry naming a retrieval metric; override with the no-baseline-needed label. The pure check is unit-tested (tests/test_check_baselines.py).
  • .gitignore — track reports/BASELINES.md (the baseline log) while keeping per-run artifacts ignored; ignore .impeccable/.
  • Applied ruff autofixes (import sorting, unused-import / whitespace cleanup) across scripts/, src/, web/, tests/ so the existing tree passes the new gate. Fixed 3 real lint findings (RUF005 in run_eval.py, dead code in generate_queries_local.py) and 2 mypy findings (unused ignore; int**int Any-widening in llm_client.py).

Verification (local)

  • uv run ruff check . → clean
  • uv run mypy src → clean
  • uv run pytest → 50 passed
  • scripts/validate_dataset.py over all 3 datasets → valid
  • eval-gate plumbing tested end-to-end: non-dataset change → no-op pass; dataset change without a baseline entry → fail (exit 1); with a proper entry → pass.

🤖 Generated with Claude Code

Close the CI/CD gap so dikw-data has the same deterministic floor as dikw-core
before Phase 1 dataset construction begins.

- pyproject: ruff + mypy (strict) config, mirroring dikw-core; ignore RUF001/2/3
  (false positives on this bilingual zh/en codebase's embedded CJK text) and
  scope-ignore E702 to the one procedural-pictogram generator.
- .pre-commit-config.yaml: local ruff + mypy hooks (uv run), matching CI.
- .github/workflows/ci.yml: uv sync -> ruff -> mypy src -> pytest -> validate
  every dataset (shape gate, $0, no provider keys). Matrix 3.12/3.13.
- .github/workflows/eval-gate.yml + tools/check_baselines.py: a dataset change
  (datasets/**) must land a new dated reports/BASELINES.md entry naming a
  retrieval metric; override with the `no-baseline-needed` label. Unit-tested.
- .gitignore: track reports/BASELINES.md (the baseline log) while keeping per-run
  artifacts ignored; ignore .impeccable/.
- Apply ruff autofixes (import sorting, unused-import / whitespace cleanup) across
  scripts/, src/, web/, tests/ so the existing tree passes the new gate. Fix 3
  real lint findings (RUF005 in run_eval.py, dead code in generate_queries_local)
  and 2 mypy findings (unused ignore; int**int Any-widening in llm_client).

All green locally: ruff clean, mypy clean, 50 tests pass, 3 datasets validate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@helebest helebest force-pushed the ci/lint-and-eval-gates branch from 72527bd to 6d62374 Compare June 28, 2026 12:57
@helebest helebest merged commit 6639f08 into main Jun 28, 2026
2 checks passed
@helebest helebest deleted the ci/lint-and-eval-gates branch June 28, 2026 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant