Summary
Implement a dataset quality scoring system for doctest corpora, inspired by rust-project-score in paiml-mcp-agent-toolkit.
Background
QA process on data/corpora/cpython-doctests.parquet revealed:
- Grade: D (Major rework needed)
- 101 primary key duplicates
- Prompt leakage (
>>>) in expected output
- Multi-line input coverage below threshold (5.4% < 10%)
- Module/function distribution imbalance
- Missing
schema_version metadata
Requirements
1. Quality Score Calculator
- Compute weighted score (0-100) based on 100-point QA checklist
- Severity weights: Critical (2x), High (1.5x), Medium (1x), Low (0.5x)
- Output letter grade: A (95+), B (85-94), C (70-84), D (50-69), F (<50)
2. Score Report Format
{
"dataset": "cpython-doctests.parquet",
"score": 67,
"grade": "D",
"timestamp": "2025-11-29T15:30:00Z",
"checks": {
"passed": 72,
"failed": 18,
"pending": 10
},
"critical_failures": [
{"id": 8, "check": "Primary key uniqueness", "details": "101 duplicates"},
{"id": 31, "check": "No prompt leakage", "details": ">>> found in expected"}
],
"suggestions": [
"Deduplicate rows using (module, function, input) key",
"Regenerate with ALIM-R001 prose detection fix",
"Add schema_version to parquet metadata"
]
}
3. CLI Integration
# Score a dataset
alimentar quality score data/doctests.parquet
# Score with JSON output
alimentar quality score data/doctests.parquet --format json
# Score with suggestions
alimentar quality score data/doctests.parquet --suggest
4. Badge Generation
Generate shields.io compatible badge for README:

Acceptance Criteria
References
- QA Checklist:
docs/dataset-publication-qa-checklist.md
- alimentar doctest module:
../alimentar/src/doctest/
- rust-project-score pattern:
../paiml-mcp-agent-toolkit/
Labels
enhancement, data-quality, tooling
Summary
Implement a dataset quality scoring system for doctest corpora, inspired by
rust-project-scorein paiml-mcp-agent-toolkit.Background
QA process on
data/corpora/cpython-doctests.parquetrevealed:>>>) in expected outputschema_versionmetadataRequirements
1. Quality Score Calculator
2. Score Report Format
{ "dataset": "cpython-doctests.parquet", "score": 67, "grade": "D", "timestamp": "2025-11-29T15:30:00Z", "checks": { "passed": 72, "failed": 18, "pending": 10 }, "critical_failures": [ {"id": 8, "check": "Primary key uniqueness", "details": "101 duplicates"}, {"id": 31, "check": "No prompt leakage", "details": ">>> found in expected"} ], "suggestions": [ "Deduplicate rows using (module, function, input) key", "Regenerate with ALIM-R001 prose detection fix", "Add schema_version to parquet metadata" ] }3. CLI Integration
4. Badge Generation
Generate shields.io compatible badge for README:
Acceptance Criteria
make qualitytargetReferences
docs/dataset-publication-qa-checklist.md../alimentar/src/doctest/../paiml-mcp-agent-toolkit/Labels
enhancement, data-quality, tooling