fix: consume pdex's log2_fold_change directly (stop double-logging)#234
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the cell-eval package to version 0.7.1 and transitions the differential expression (DE) logic to use log2 fold change as the primary metric instead of raw fold change. Key changes include removing the fold_change_col parameter from DEResults and initialization functions, updating integrity checks to validate log2 fold change data, and adjusting the Spearman correlation metric. A new test case was added to ensure negative log2 fold change values are correctly handled. Feedback highlights potential AttributeErrors due to renamed enum members, an opportunity to optimize integrity checks using a single select pass, and a suggestion to improve type safety by explicitly casting columns before numeric operations.
| f"Identified {fc_num_nan} NaN fold change values ({self.name})" | ||
| f"Identified {lfc_num_nan} NaN log2 fold change values ({self.name})" | ||
| ) | ||
| logger.info(f"DE data integrity check complete. ({self.name})") |
There was a problem hiding this comment.
The PR summary mentions renaming DESortBy.FOLD_CHANGE to LOG2_FOLD_CHANGE and ABS_FOLD_CHANGE to ABS_LOG2_FOLD_CHANGE. However, several call sites in this file (e.g., lines 153 and 219) still use the old names in the provided context. Please ensure all references are updated to match the new enum definition to avoid AttributeError.
There was a problem hiding this comment.
Outdated — lines 153 and 219 are updated in this PR (DESortBy.LOG2_FOLD_CHANGE / DESortBy.ABS_LOG2_FOLD_CHANGE). Search the file for the old names returns no hits.
| lfc_num_null = self.data.filter( | ||
| pl.col(self.log2_fold_change_col).is_null() | ||
| ).height | ||
| lfc_num_inf = self.data.filter( | ||
| pl.col(self.log2_fold_change_col).is_infinite() | ||
| ).height | ||
| lfc_num_nan = self.data.filter( | ||
| pl.col(self.log2_fold_change_col).is_nan() | ||
| ).height |
There was a problem hiding this comment.
The integrity check performs three separate filter operations on the dataframe, which is inefficient for large datasets. Additionally, is_infinite() and is_nan() will raise an error if the column is not already a floating-point type. Since type enforcement happens later in __post_init__, it is more efficient and safer to compute these counts in a single select pass while ensuring the column is cast to a numeric type for the check.
lfc_stats = self.data.select(
null_count=pl.col(self.log2_fold_change_col).is_null().sum(),
inf_count=pl.col(self.log2_fold_change_col).cast(pl.Float32).is_infinite().sum(),
nan_count=pl.col(self.log2_fold_change_col).cast(pl.Float32).is_nan().sum(),
)
lfc_num_null = lfc_stats.item(0, "null_count")
lfc_num_inf = lfc_stats.item(0, "inf_count")
lfc_num_nan = lfc_stats.item(0, "nan_count")There was a problem hiding this comment.
Out of scope. This block predates this PR — we only changed the column it inspects (fold_change_col → log2_fold_change_col). It's one-time init logging on an in-memory frame of at most ~10k×500, so the perf gain is negligible. The "cast to numeric for safety" half isn't needed either: the column is required and gets cast to Float32 further down, and is_nan/is_infinite don't error on numeric input.
| self.data = self.data.with_columns( | ||
| pl.col(self.fold_change_col) | ||
| .log(base=2) | ||
| .alias(self.log2_fold_change_col) | ||
| .fill_nan(0.0) | ||
| ).with_columns( | ||
| pl.col(self.log2_fold_change_col) | ||
| .abs() | ||
| .alias(self.abs_log2_fold_change_col) |
There was a problem hiding this comment.
To ensure type safety when calculating the absolute values, it is recommended to cast the column to a numeric type here, especially since the global type enforcement occurs later in the method.
| self.data = self.data.with_columns( | |
| pl.col(self.fold_change_col) | |
| .log(base=2) | |
| .alias(self.log2_fold_change_col) | |
| .fill_nan(0.0) | |
| ).with_columns( | |
| pl.col(self.log2_fold_change_col) | |
| .abs() | |
| .alias(self.abs_log2_fold_change_col) | |
| self.data = self.data.with_columns( | |
| pl.col(self.log2_fold_change_col) | |
| .cast(pl.Float32) | |
| .abs() | |
| .alias(self.abs_log2_fold_change_col) | |
| ) |
There was a problem hiding this comment.
Already handled. The trailing block in post_init casts every column in numeric_cols to Float32, and abs_log2_fold_change_col is in that list. Adding an explicit cast here would be redundant.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Two typing chore commits in here that are unrelated to the
Both surfaced because |
pdex 0.2.2is published to PyPI.This PR depends on the new
log2_fold_changecolumn that pdex 0.2.2 emits. Until that release is on PyPI,uv syncin this repo cannot resolve dependencies (pdex>=0.2.2is unsatisfiable) and CI will fail.Summary
Cell-eval was double-logging fold changes.
DEResults.__post_init__appliedlog(base=2).fill_nan(0.0)to thefold_changecolumn on the assumption it was a linear ratio. Since pdex 0.2.0,fold_changehas actually heldlog2(target/ref)— so we were computinglog2(log2(FC)), which produces NaN for every down regulated gene (negatives) and was then zeroed byfill_nan(0.0). Result: roughly half of all DE values silently set to zero, corrupting every metric that ranks byabs_log2_fold_change(overlap_at_N,precision_at_N,de_direction_match,DESpearmanLFC).Discussion: ArcInstitute/cell-eval#232.
The upstream fix in pdex#74 adds an explicit
log2_fold_changecolumn. This PR consumes it directly and stops touchingfold_change— so cell-eval also survives pdex 0.3.0, which removes thefold_changealias entirely.Changes
src/cell_eval/_types/_de.pylog2_fold_changeis now the required column;fold_changeis no longer read.pl.col(fold_change).log(base=2).fill_nan(0.0)derivation.abs_log2_fold_changeis derived fromlog2_fold_change(when not already provided).fold_change_colkwarg fromDEResultsandinitialize_de_comparison. It pointed at a column cell-eval no longer reads or requires; silent-ignore would have been a footgun.src/cell_eval/_types/_enums.pyDESortBy.FOLD_CHANGE→DESortBy.LOG2_FOLD_CHANGEandDESortBy.ABS_FOLD_CHANGE→DESortBy.ABS_LOG2_FOLD_CHANGE. The values were already"log2_fold_change"/"abs_log2_fold_change"; themember names were misleading.
src/cell_eval/metrics/_de.pyDESpearmanLFCnow readslog2_fold_change_colinstead offold_change_col. The class name says LFC; usingfold_changewas only "correct" because of the pdex bug.DEDirectionMatchalready usedlog2_fold_change_col— no change.DESortBy, whose values are already"log2_fold_change"and"abs_log2_fold_change". The corruption was purely in the column values; once__post_init__is fixed theyauto-recover.
pyproject.tomlpdex>=0.2.0→pdex>=0.2.2.0.7.0→0.7.1.tests/test_de_float_types.py)log2_fold_change.log2_fold_change = [-1.0, 0.0, 1.0, 2.0]must yieldabs_log2_fold_change = [1.0, 0.0, 1.0, 2.0](pre-fix the negative was zeroed via the NaN path).Breaking changes
DESortBy.FOLD_CHANGEandDESortBy.ABS_FOLD_CHANGEare renamed toDESortBy.LOG2_FOLD_CHANGEandDESortBy.ABS_LOG2_FOLD_CHANGE. No deprecation aliases — pre-1.0 cleanup.fold_change_colkwarg is removed fromDEResultsandinitialize_de_comparison. Callers that explicitly passed it must drop the argument.MetricsEvaluator(de_pred=..., de_real=...)(CSV/parquet) must now contain alog2_fold_changecolumn. Tables with only a linearfold_changewill fail withValueError: Missing required columns: {'log2_fold_change'}. This is intentional — silently re-deriving was the original bug, and cell-eval cannot infer the semantics of an externalfold_changecolumn.Test plan
pdex 0.2.2available on PyPIuv syncresolvesuv run pytest -v— all tests pass, including the regression testuv run ruff formatcleanabs_log2_fold_change, finiteDESpearmanLFCcorrelations on frames containing negative log2 FC)