Add weighted MAE to results table and tests#2
Merged
Conversation
andrewpbray
commented
May 26, 2026
Collaborator
- compute_mae_and_isp now returns MAE, wMAE, and ISP separately; wMAE uses rubric point weights from metadata when supplied, NA otherwise
- generate_results_row passes metadata.json to compute_mae_and_isp and populates wMAE_ columns alongside MAE_ and ISP_
- generate_gt_results_table adds wMAE spanner and label stripping
- New test file test-compute-mae-and-isp.R with 18 tests covering all three metrics across no-metadata, with-metadata, and edge cases
- compute_mae_and_isp now returns MAE, wMAE, and ISP separately; wMAE uses rubric point weights from metadata when supplied, NA otherwise - generate_results_row passes metadata.json to compute_mae_and_isp and populates wMAE_ columns alongside MAE_ and ISP_ - generate_gt_results_table adds wMAE spanner and label stripping - New test file test-compute-mae-and-isp.R with 18 tests covering all three metrics across no-metadata, with-metadata, and edge cases
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends the grading-accuracy metrics pipeline to include a point-weighted MAE (wMAE) derived from rubric point values in metadata.json, and surfaces it in the generated results table output, along with new unit tests for the updated metric output shape.
Changes:
- Update
compute_mae_and_isp()to return a 3-metric list:MAE,wMAE, andISP. - Add
wMAE_columns to generated results rows and display them ingttables via a new “wMAE” spanner and label stripping. - Add a new
testthatfile covering corecompute_mae_and_isp()behavior with and without metadata.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
R/computing-accuracy.R |
Changes compute_mae_and_isp() to return MAE, wMAE, ISP and introduces weighted computation via metadata-derived rubric scores. |
R/generate-results-table.R |
Plumbs metadata into metric computation, adds wMAE_ columns, and updates gt formatting to include wMAE. |
tests/testthat/test-compute-mae-and-isp.R |
Adds new tests validating the new return structure and wMAE behavior in basic scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
274
to
+279
| eval1 <- readr::read_csv(file1, show_col_types = FALSE) | ||
| eval2 <- readr::read_csv(file2, show_col_types = FALSE) | ||
| weights <- if (!is.null(metadata_file)) scores_from_metadata(metadata_file) else NULL | ||
| list(MAE = rubric_mae(eval1, eval2, weights = weights), | ||
| ISP = isp(eval1, eval2)) | ||
| list(MAE = rubric_mae(eval1, eval2), | ||
| wMAE = if (!is.null(weights)) rubric_mae(eval1, eval2, weights = weights) else NA_real_, | ||
| ISP = isp(eval1, eval2)) |
Comment on lines
276
to
+279
| weights <- if (!is.null(metadata_file)) scores_from_metadata(metadata_file) else NULL | ||
| list(MAE = rubric_mae(eval1, eval2, weights = weights), | ||
| ISP = isp(eval1, eval2)) | ||
| list(MAE = rubric_mae(eval1, eval2), | ||
| wMAE = if (!is.null(weights)) rubric_mae(eval1, eval2, weights = weights) else NA_real_, | ||
| ISP = isp(eval1, eval2)) |
Comment on lines
+277
to
+279
| list(MAE = rubric_mae(eval1, eval2), | ||
| wMAE = if (!is.null(weights)) rubric_mae(eval1, eval2, weights = weights) else NA_real_, | ||
| ISP = isp(eval1, eval2)) |
Comment on lines
+108
to
+113
| metrics <- compute_mae_and_isp(paste0(dir, f1), paste0(dir, f2), | ||
| metadata_file = metadata_file) | ||
|
|
||
| mae_vals[[pair_name]] <- metrics$MAE | ||
| isp_vals[[pair_name]] <- metrics$ISP | ||
| mae_vals[[pair_name]] <- metrics$MAE | ||
| wmae_vals[[pair_name]] <- metrics$wMAE | ||
| isp_vals[[pair_name]] <- metrics$ISP |
Comment on lines
+86
to
+90
| metadata_file <- paste0(dir, "/metadata.json") | ||
|
|
||
| mae_vals <- list() | ||
| wmae_vals <- list() | ||
| isp_vals <- list() |
Comment on lines
+10
to
+19
| # compute_mae_and_isp - without metadata ------------------------------------ | ||
|
|
||
| test_that("compute_mae_and_isp - returns MAE, wMAE (NA), and ISP without metadata", { | ||
| result <- compute_mae_and_isp(experts_csv, students_csv) | ||
|
|
||
| expect_named(result, c("MAE", "wMAE", "ISP")) | ||
| expect_true(is.numeric(result$MAE)) | ||
| expect_true(is.na(result$wMAE)) | ||
| expect_true(is.numeric(result$ISP)) | ||
| }) |
Comment on lines
+76
to
+80
| # All other rows match. Expected wMAE = 0.5 / 5 = 0.1 | ||
| weights <- c(1.0, 0.5, 0.5, 0.0) | ||
| result <- compute_mae_and_isp(experts_csv, students_csv, | ||
| metadata_file = metadata_json) | ||
| expected_wmae <- (0 + 0.5 + 0 + 0 + 0) / 5 |
| @@ -274,8 +274,9 @@ compute_mae_and_isp <- function(file1, file2, metadata_file = NULL){ | |||
| eval1 <- readr::read_csv(file1, show_col_types = FALSE) | |||
| eval2 <- readr::read_csv(file2, show_col_types = FALSE) | |||
| weights <- if (!is.null(metadata_file)) scores_from_metadata(metadata_file) else NULL | |||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.