docs(harness-eval): add EFC paper reference notes for design and experiments#134
Open
howie wants to merge 3 commits into
Open
docs(harness-eval): add EFC paper reference notes for design and experiments#134howie wants to merge 3 commits into
howie wants to merge 3 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why(為什麼要做)
harness-evalskill 目前的 D1–D11 評分有兩個根本限制:這個 PR 借用論文 Scaling Laws for Agent Harnesses via Effective Feedback Compute(EFC)的兩個方法論概念,開始把 harness-eval 從「直覺評分」往「可證偽的預測模型」推進。
What(這個 PR 做了什麼)
三個 commit,由輕到重:
研究筆記(
docs/research/2026-06-03-efc-...reference.md,純文件)把 EFC 論文概念逐項對映到 harness-eval 現有 scanner,分成 Track A(設計微調)與 Track B(資料收集/驗證),並排出 impact × effort 優先順序。不改任何程式碼。
兩份 Spectra change proposal(
openspec/changes/,純規格)add-task-demand-normalization(追蹤 feat(harness-eval): 以 task-demand 正規化分數,讓不同規模 repo 可比 (EFC D_task) #136)add-harness-eval-validation-protocol(追蹤 feat(harness-eval): 建立 R²/MAE prospective-holdout 驗證協定,檢驗分數預測力 #143,R²/MAE prospective holdout 驗證協定)D_repo task-demand 正規化的實作(feat(harness-eval): 以 task-demand 正規化分數,讓不同規模 repo 可比 (EFC D_task) #136,唯一動到程式碼的 commit)
models.py:ScanOutput新增d_repo/d_repo_components/size_adjusted_score/size_adjusted_note,並在model_post_init算size_adjusted_score = round(total / d_repo, 1)service.py:新增_count_source_loc/_count_skills/_count_hooks/_count_rules與_compute_d_repo(log 縮放、恆 ≥ 1.0、防禦式解析)cli.py+SKILL.md:報告輸出 raw 分 + size_adjusted 分,並明確標記provisional(未校準,見 #143)tests/test_scanners.py:新增 13 個 TDN-DT / SMK 測試Value(價值)
size_adjusted_score = 機械總分 / D_repo抵銷「artifact 越多分越高」的加總式膨脹,讓不同規模的 repo 可橫向比較相對成熟度。size_adjusted_score被明確標記為 provisional(未經 outcome 校準)——真正的權重校準需要實證資料(feat(harness-eval): 建立 R²/MAE prospective-holdout 驗證協定,檢驗分數預測力 #143 的工作),在校準前僅供相對比較、不可當絕對門檻。先承認不確定性,再分階段補強。How to test(如何驗證)
驗證重點:
size_adjusted_score = round(total_mechanical / d_repo, 1),且d_repo恆 ≥ 1.0_compute_d_repo對同一 repo 具確定性(TDN-DT-008)🔗 相關 issue:#136(task-demand normalization)、#143(validation protocol)、#140 / #142(前置依賴)