Skip to content

feat: Add dataset quality scoring system with improvement suggestions #6

Description

@noahgift

Summary

Implement a dataset quality scoring system for doctest corpora, inspired by rust-project-score in paiml-mcp-agent-toolkit.

Background

QA process on data/corpora/cpython-doctests.parquet revealed:

  • Grade: D (Major rework needed)
  • 101 primary key duplicates
  • Prompt leakage (>>>) in expected output
  • Multi-line input coverage below threshold (5.4% < 10%)
  • Module/function distribution imbalance
  • Missing schema_version metadata

Requirements

1. Quality Score Calculator

  • Compute weighted score (0-100) based on 100-point QA checklist
  • Severity weights: Critical (2x), High (1.5x), Medium (1x), Low (0.5x)
  • Output letter grade: A (95+), B (85-94), C (70-84), D (50-69), F (<50)

2. Score Report Format

{
  "dataset": "cpython-doctests.parquet",
  "score": 67,
  "grade": "D",
  "timestamp": "2025-11-29T15:30:00Z",
  "checks": {
    "passed": 72,
    "failed": 18,
    "pending": 10
  },
  "critical_failures": [
    {"id": 8, "check": "Primary key uniqueness", "details": "101 duplicates"},
    {"id": 31, "check": "No prompt leakage", "details": ">>> found in expected"}
  ],
  "suggestions": [
    "Deduplicate rows using (module, function, input) key",
    "Regenerate with ALIM-R001 prose detection fix",
    "Add schema_version to parquet metadata"
  ]
}

3. CLI Integration

# Score a dataset
alimentar quality score data/doctests.parquet

# Score with JSON output
alimentar quality score data/doctests.parquet --format json

# Score with suggestions
alimentar quality score data/doctests.parquet --suggest

4. Badge Generation

Generate shields.io compatible badge for README:

![Dataset Quality](https://img.shields.io/badge/dataset_quality-D_67%25-red)

Acceptance Criteria

  • Score calculation matches 100-point checklist weights
  • JSON output format validated
  • CLI commands implemented in alimentar
  • Badge URL generation
  • Integration with make quality target

References

  • QA Checklist: docs/dataset-publication-qa-checklist.md
  • alimentar doctest module: ../alimentar/src/doctest/
  • rust-project-score pattern: ../paiml-mcp-agent-toolkit/

Labels

enhancement, data-quality, tooling

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions