Fix #356 compatibility#361
Conversation
ec07cd5 to
27f9d2e
Compare
bb6ba01 to
b913b91
Compare
The new cross-validation kwargs (split_coord="story", kfold="group") change the ceiling from 0.2103 to 0.1446.
- blank2014: add story coord to predictions (needed for split_coord="story") - fedorenko2016: add sentence_id coord to predictions (needed for split_coord="sentence_id") - pereira2018: replace .coords check with try/except to handle xarray MultiIndex levels not appearing in .coords (xarray 2022.3.0 behavior) - pereira2018/test.py: update test_dummy_bad expected scores for new GroupKFold CV strategy (243: 0.0186->0.0, 384: 0.0334->0.0168) - linear_predictivity/test.py: update expected score (0.0283->0.0410)
|
@BKHMSI Would like your feedback on my changes for your PR when you have the chance. |
|
Hi @KartikP, thanks for making it work with the testing infrastructure. The only comment I have is that the benchmarks with linear regression (e.g., |
I'll verify that your original implement can reproduce previous scores. Otherwise, I'll mark the |
Renames Blank2014-linear, Fedorenko2016-linear, and Pereira2018.{243,384}sentences-linear
to *-linear-shuffle to disambiguate the shuffle-CV legacy variants from the new
group-CV ridge variants. Makes the shuffle cv_kwargs explicit on the renamed variants.
Pereira2018 keeps loading its cached ceiling from the legacy S3 identifier so the
existing S3 artifacts are reused. Tests, integration tests, and examples updated.
# Conflicts: # brainscore_language/benchmarks/pereira2018/benchmark.py
…arks (fedorenko2016,pereira2018,tuckute2024,blank2014)
Pereira2018_*_ridge() now loads its cached extrapolation ceiling from brainscore-storage/brainscore-language/ (ceiling, raw, raw_raw) instead of recomputing on every benchmark instantiation.
The linear ceiling files migrated from brainscore-language (direct) to brainscore-storage/brainscore-language/ without versioning enabled, so the old direct-bucket version_ids 404 when load_from_s3 hits storage. Sha1s match the storage objects exactly, so the integrity check still verifies the download.
Bootstrap iterations whose sampled scores do not fit the exponential (seen with the ridge metric on both Blank2014 and Fedorenko2016, where curve_fit hits its maxfev limit) now mark params as NaN and continue, matching the Pereira2018 ceiling_packaging.py behavior. Also filters NaN/Inf inputs before curve_fit.
From #356 but with other changes to get it to work with testing infrastructure.