A repository-level CodeAgent evaluation harness powered by Xiaomi MiMo Pro API.
It scans real code repositories, reasons over cross-file dependencies, generates coding tasks, runs model-based agents, checks patches/build signals, records token usage, and produces dashboards and technical reports.
Most LLM coding evaluations focus on isolated algorithm problems or short code snippets. However, real engineering work requires much more than writing one function:
- understanding a real repository structure;
- locating relevant files across modules;
- reasoning about cross-file dependencies;
- generating repository-level tasks;
- applying or checking patch-style modifications;
- running lightweight build/test checks;
- recording token usage and evaluation evidence;
- producing reproducible reports.
MiMo-CodeHarness was built to evaluate these repository-level CodeAgent capabilities in a more engineering-oriented workflow.
This project is especially designed for AI + Software Engineering + IoT/Embedded/Tooling scenarios. It uses Xiaomi MiMo Pro API as the main model backend and applies the harness to real repositories such as an STM32 RFID access control project and a Python SlideNotes GUI tool.
This project is not just a single LLM API demo.
A simple API demo usually looks like:
Prompt -> LLM response
MiMo-CodeHarness runs a structured evaluation pipeline:
Real Repository
-> Repository Scanner
-> Dependency Reasoner
-> Task Generator
-> Model Runner
-> Patch Checker
-> Build/Test Checker
-> Evidence-aware Scorer
-> Dashboard / Technical Report / Token Log
In other words, this project works as a harness: it provides the task environment, execution pipeline, evidence collection, scoring logic, and reproducible report generation for CodeAgent evaluation.
- Counts files, languages, suffixes, and source/documentation roles.
- Extracts representative files from the target repository.
- Supports multiple code types including C/C++/embedded C, Python, JavaScript/TypeScript, ArkTS, Markdown, JSON, and shell scripts.
- Extracts cross-file relations such as C includes, Python imports, Markdown links, and symbol-level references.
- Produces dependency summaries and structured dependency files.
Generates repository-level tasks such as:
- repository architecture understanding;
- cross-file dependency explanation;
- bug localization;
- test plan generation;
- patch-style maintainability improvement;
- engineering risk analysis.
- Uses Xiaomi MiMo Pro API through environment variables.
- Records model outputs, response time, and token usage.
- API keys are never stored in source code.
- Extracts patch-style outputs when available.
- Applies patch checks in isolated worktrees.
- Runs lightweight build/test signals such as Python syntax check, JSON parsing, and C/C++ brace scan.
- For embedded projects, the current build/test check is not a full Keil firmware compilation; it is a lightweight static safety check.
The final clean results use the v0.3/v0.3.1 evidence-aware automatic scorer. The score is based on repository grounding, task-rubric coverage, answer completeness, patch evidence, and build/test evidence.
LLM Judge and human review scripts were explored experimentally, but they are not used as the final reported scoring basis in the clean results, because model-as-judge can be biased and needs further calibration.
The following results are from final clean runs. They are the recommended results to cite in README, reports, and presentations.
| Case Study | Repository Type | Files | Dependency Edges | Tasks | Avg. Score | Estimated Tokens | Status |
|---|---|---|---|---|---|---|---|
| STM32 RFID Access Control | Embedded C / IoT | 16 | 86 | 6 | 78.76 | 41,520 | OK |
| SlideNotes GUI Tool | Python GUI / Document Export Tool | 91 | 505 | 6 | 66.71 | 38,675 | OK |
Recommended result directories:
outputs/mimo_codeharness_v02/final_stm32_v031_clean__stm32_rfid/
outputs/mimo_codeharness_v02/final_slidenotes_v031_clean__slidenotes/
Each final result directory contains:
dashboard.html
TECHNICAL_REPORT.md
PAPER_DRAFT.md
token_usage.csv
evaluation_scores.csv
model_outputs.jsonl
repo_summary.json
dependency_summary.json
tasks.json
Each case study produces an HTML dashboard showing:
- repository file count;
- dependency edge count;
- estimated token usage;
- task count;
- average score;
- build/test status;
- task-level score table.
Open a dashboard locally:
ii "outputs\mimo_codeharness_v02\final_stm32_v031_clean__stm32_rfid\dashboard.html"or:
ii "outputs\mimo_codeharness_v02\final_slidenotes_v031_clean__slidenotes\dashboard.html"Each case study also generates a technical report:
TECHNICAL_REPORT.md
The report summarizes repository structure, dependency analysis, model evaluation results, token usage, and engineering observations.
git clone https://github.com/hongzuoj-pixel/MiMo-CodeHarness.git
cd MiMo-CodeHarnessThe current version mainly uses Python standard libraries. If your local environment needs additional packages, install them according to your project configuration.
python --versionPython 3.10+ is recommended.
Do not write your API key into source code or commit it to GitHub.
PowerShell example:
$env:MIMO_API_KEY='YOUR_MIMO_API_KEY'
$env:MIMO_BASE_URL='https://api.xiaomimimo.com/v1'
$env:MIMO_MODEL='mimo-v2.5-pro'Test the API connection:
python src\test_mimo_api_connection.pyExpected result:
[OK] API connected successfully.
python src\run_real_repo_case_studies.py --cases config\case_stm32_clean.local.json --suite-name final_stm32_v031_clean --config config\api_models_mimo.json --execute --clone-missing --apply-patches --limit-cases 1Check result:
Get-Content outputs\mimo_codeharness_v02\final_stm32_v031_clean\case_study_summary.csv
Test-Path "outputs\mimo_codeharness_v02\final_stm32_v031_clean__stm32_rfid\dashboard.html"python src\run_real_repo_case_studies.py --cases config\case_slidenotes_clean.local.json --suite-name final_slidenotes_v031_clean --config config\api_models_mimo.json --execute --clone-missing --apply-patches --limit-cases 1Check result:
Get-Content outputs\mimo_codeharness_v02\final_slidenotes_v031_clean\case_study_summary.csv
Test-Path "outputs\mimo_codeharness_v02\final_slidenotes_v031_clean__slidenotes\dashboard.html"MiMo-CodeHarness/
├── config/
│ ├── api_models_mimo.json
│ ├── case_stm32_clean.local.json
│ └── case_slidenotes_clean.local.json
├── src/
│ ├── run_real_repo_case_studies.py
│ ├── run_mimo_codeharness_v02.py
│ ├── test_mimo_api_connection.py
│ ├── validate_mimo_codeharness_v02.py
│ └── scoring_v03.py
├── docs/
│ ├── SCORING_SYSTEM_V03.md
│ └── SCORING_SYSTEM_V031.md
├── outputs/
│ └── mimo_codeharness_v02/
│ ├── final_stm32_v031_clean__stm32_rfid/
│ └── final_slidenotes_v031_clean__slidenotes/
├── README.md
└── README_CN.md
This project is an engineering prototype, not a fully industrial benchmark yet.
Current limitations:
- The build/test stage is lightweight and does not replace full project-specific compilation.
- The embedded STM32 case does not perform full Keil firmware compilation yet.
- LLM Judge and human review are not used in the final clean scores.
- Scores should be interpreted as evidence-aware automatic evaluation results, not absolute ground truth.
- More repositories and more repeated trials are needed for stronger benchmark-level conclusions.
Planned improvements:
- Add project-specific build/test adapters, such as Keil/STM32 build checks and Python unit tests.
- Add more real repositories, including HarmonyOS, robotics, and backend projects.
- Add optional calibrated LLM Judge with stricter JSON schema and cross-model judging.
- Add manual review protocol for small-sample validation.
- Build a web dashboard for comparing multiple repositories and multiple models.
- Convert the current technical report into a more complete undergraduate research report or workshop-style paper draft.
- Do not commit API keys.
- Use environment variables for all model credentials.
- Review generated patches before applying them to production code.
- Treat automatic scores as evaluation evidence, not final truth.