Skip to content

fix(benchmark): restore offline AfSP WRB benchmark in the suite (v0.9.111)#63

Open
HugoMachadoRodrigues wants to merge 1 commit into
mainfrom
claude/agitated-payne-111bb5
Open

fix(benchmark): restore offline AfSP WRB benchmark in the suite (v0.9.111)#63
HugoMachadoRodrigues wants to merge 1 commit into
mainfrom
claude/agitated-payne-111bb5

Conversation

@HugoMachadoRodrigues

Copy link
Copy Markdown
Owner

Summary

The offline AfSP (Africa Soil Profiles) WRB benchmark was silently dropped from run_all_benchmarks()'s consolidated report — .suite_run_afsp() returned NULL and AfSP never appeared as a row. This restores it as a 2nd independent offline WRB benchmark alongside FEBR. Surfaced during the v0.9.110 "benchmark methodology" (B1) work; scoped out then as a separate harness issue.

The classification engine is untouched — every edit is in the AfSP loader/benchmark and the suite wiring.

Root cause — two bugs, one mechanism

The bundled inst/extdata/afsp_sample.rds was not a stub: it already holds the full stratified 5 × 24 = 120-pedon sample inside a wrapper list(pedons, pulled_on, source, filter) (mirroring load_wosis_stratified_sample() / load_kssl_sample()).

  1. load_afsp_sample() returns that 4-element wrapper; the suite fed it straight into benchmark_afsp() instead of its $pedons slot. (The "only 4 pedons" symptom was the wrapper's length, not a truncated sample.)
  2. benchmark_afsp() then iterated the 4 wrapper elements and hit $site on the atomic ones (pulled_on, source) → "$ operator is invalid for atomic vectors" → swallowed by .suite_run_afsp()'s tryCatch → silent NULL.

Fix

  • New internal .afsp_as_pedons() unwraps $pedons when handed the wrapper and passes a bare pedon list through unchanged.
  • benchmark_afsp() calls it (so it accepts either form) and skips any non-PedonRecord member instead of erroring.
  • .suite_run_afsp() unwraps before its length-guard and max_n slice.
  • load_afsp_sample() docs corrected from the never-shipped "130 / 26 RSGs" to the actual 120 / 24, and clarify the wrapper return; 3 Rd files regenerated.

Result

run_all_benchmarks() now surfaces a real afsp_sample / wrb2022 row with the full v0.9.110 metric set:

n Accuracy Bal. acc Macro-F1 Kappa NIR
120 25.0% 25.0% 0.22 0.22 4.2%

Tests

  • New test-v09111-afsp-benchmark-harness.R: asserts benchmark_afsp(load_afsp_sample()) (the raw wrapper) returns non-NULL with a non-empty confusion matrix, junk members are skipped, .suite_run_afsp() yields a full-metric row, and run_all_benchmarks() surfaces AfSP.
  • test-v09106-benchmark-suite.R: replaced the stale assertion that encoded the old buggy "only canonical survives" behaviour with one asserting external datasets are skipped while offline rows (canonical + AfSP) appear.

Full suite: 1702 tests, 0 failures, 0 errors.

Version bumped to 0.9.111 with a NEWS entry.

🤖 Generated with Claude Code

….111)

The offline AfSP (Africa Soil Profiles) WRB benchmark was silently dropped
from run_all_benchmarks()'s consolidated report. Two bugs, one mechanism:

  1. load_afsp_sample() returns a wrapper list(pedons, pulled_on, source,
     filter) -- length 4 -- mirroring load_wosis_stratified_sample() /
     load_kssl_sample(). The suite fed that wrapper straight into
     benchmark_afsp() instead of its $pedons slot.
  2. benchmark_afsp() then iterated the 4 wrapper elements and hit $site on
     the atomic ones (pulled_on, source) -> "$ operator is invalid for
     atomic vectors", which .suite_run_afsp()'s tryCatch swallowed to NULL,
     so AfSP never appeared.

The bundled inst/extdata/afsp_sample.rds was NOT a stub: it already holds the
full stratified 5 x 24 = 120-pedon sample.

Fix (engine untouched -- harness only):
  - New internal .afsp_as_pedons() unwraps $pedons when handed the wrapper
    and passes a bare pedon list through unchanged.
  - benchmark_afsp() calls it (accepts either form) and skips any
    non-PedonRecord member instead of erroring.
  - .suite_run_afsp() unwraps before its length-guard and max_n slice.
  - load_afsp_sample() docs corrected from the never-shipped "130 / 26 RSGs"
    to the actual 120 / 24, and clarify the wrapper return.

Result: run_all_benchmarks() surfaces a real afsp_sample/wrb2022 row (n=120,
acc 25%, bal-acc 25%, macro-F1 0.22, kappa 0.22, NIR 4.2%, bootstrap CI) -- a
2nd independent OFFLINE WRB benchmark alongside FEBR -- with the full v0.9.110
metric set.

Tests:
  - New test-v09111-afsp-benchmark-harness.R: benchmark_afsp(load_afsp_sample())
    (raw wrapper) returns non-NULL with a non-empty confusion matrix, junk
    members are skipped, .suite_run_afsp() yields a full-metric row, and
    run_all_benchmarks() surfaces AfSP.
  - test-v09106-benchmark-suite.R: replace the stale assertion that encoded the
    old buggy "only canonical survives" behaviour with one that asserts external
    datasets are skipped while offline rows (canonical + AfSP) appear.

Full suite: 1702 tests, 0 failures, 0 errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 12, 2026 18:25

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores the offline AfSP (Africa Soil Profiles) WRB benchmark row in run_all_benchmarks() by unwrapping the bundled load_afsp_sample() wrapper correctly and hardening the AfSP benchmark harness so it no longer fails silently.

Changes:

  • Added internal unwrapping helper .afsp_as_pedons() and updated benchmark_afsp() to accept either the wrapper or a bare pedon list (skipping non-PedonRecord members).
  • Updated the benchmark suite’s AfSP wiring to unwrap before length checks / slicing so the full sample is benchmarked.
  • Added regression tests and refreshed documentation/NEWS + version bump to 0.9.111.

Reviewed changes

Copilot reviewed 6 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/testthat/test-v09111-afsp-benchmark-harness.R New regression tests covering wrapper unwrapping, junk-member skipping, suite row surfacing, and consolidated suite visibility.
tests/testthat/test-v09106-benchmark-suite.R Updates suite expectations to allow offline AfSP row while still skipping absent external datasets.
R/benchmark-suite.R Unwraps AfSP wrapper before guards/slicing to prevent feeding atomic wrapper members into the benchmark loop.
R/afsp.R Adds .afsp_as_pedons() and makes benchmark_afsp() resilient to wrapper inputs and non-pedon elements; updates load_afsp_sample() docs.
NEWS.md Adds 0.9.111 entry describing the harness fix and restored AfSP offline benchmark row.
man/load_afsp_sample.Rd Updates documented sample size and clarifies wrapper return shape.
man/dot-afsp_as_pedons.Rd New generated Rd for the internal unwrapping helper.
man/benchmark_afsp.Rd Updates parameter docs to reflect wrapper acceptance.
DESCRIPTION Version/date bump to 0.9.111 / 2026-06-12.
Files not reviewed (3)
  • man/benchmark_afsp.Rd: Generated file
  • man/dot-afsp_as_pedons.Rd: Generated file
  • man/load_afsp_sample.Rd: Generated file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/afsp.R
Comment on lines +296 to +298
.afsp_as_pedons <- function(x) {
if (is.list(x) && !is.null(x$pedons) && is.list(x$pedons)) x$pedons else x
}
Comment on lines +96 to +98
expect_true(any(hit))
expect_equal(out$summary$n_compared[hit], 120L)
expect_gt(out$summary$accuracy[hit], 0.15)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants