Skip to content

Fix MissingDataError in multi_mra when features contain NaN#350

Open
drussellmrichie wants to merge 2 commits into
larsiusprime:masterfrom
drussellmrichie:pr/multi-mra-nan
Open

Fix MissingDataError in multi_mra when features contain NaN#350
drussellmrichie wants to merge 2 commits into
larsiusprime:masterfrom
drussellmrichie:pr/multi-mra-nan

Conversation

@drussellmrichie

Copy link
Copy Markdown
Contributor

Problem

run_multi_mra / _run_multi_mra crashes with:

statsmodels.tools.sm_exceptions.MissingDataError: exog contains inf or nans

when any feature in ind_vars has NaN values. This is common with real-world property data — fields like bldg_parking_spaces, bldg_has_garage, bldg_central_air, etc. are sparsely recorded by assessors and arrive as NaN after the astype(float) conversion on line 1940.

Root cause

_run_multi_mra one-hot encodes categoricals and calls astype(float), which preserves NaN from nullable Float64 columns as float64 NaN. There is no imputation step before sm.OLS(y_train, X_train).fit(), so statsmodels raises immediately.

Fix

Impute NaN with training-set column medians immediately before the global OLS fit, and propagate the same medians to X_test, X_sales, and X_univ so all splits are treated consistently:

col_medians = X_train.median()
X_train = X_train.fillna(col_medians)
ds_prepped.X_test  = ds_prepped.X_test.fillna(col_medians)
ds_prepped.X_sales = ds_prepped.X_sales.fillna(col_medians)
ds_prepped.X_univ  = ds_prepped.X_univ.fillna(col_medians)

This is the same median-imputation pattern already applied in utilities/stats.py (calc_elastic_net_regularization, calc_p_values_recursive_drop, calc_t_values_recursive_drop, calc_vif_recursive_drop) and accepted upstream in PRs #313, #316, #318.

Test plan

  • Run multi_mra against a dataset with at least one feature containing NaN values — confirm it completes without MissingDataError
  • Confirm predictions are produced for all rows (train, test, sales, universe)
  • Confirm no change in behaviour when all features are fully populated

🤖 Generated with Claude Code

Real-world property data routinely has missing values (e.g. parking
spaces, garage counts, central air) that survive as NaN through the
one-hot encoding and astype(float) steps.  statsmodels OLS raises
MissingDataError on any NaN in the exog matrix.

Guard by imputing NaN with training-set column medians before the
global OLS fit, then propagate the same medians to X_test, X_sales,
and X_univ so all splits are treated consistently.  This is the same
median-imputation pattern already applied in utilities/stats.py
(calc_elastic_net_regularization, calc_p_values_recursive_drop, etc.)
and accepted upstream in PRs larsiusprime#313, larsiusprime#316, larsiusprime#318.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.

No action is required from you in this PR thread. Once you have signed the CLA externally, a maintainer will verify your signature and record it here on your behalf by commenting:


I affirm that this contributor has signed the CLA


Russell Richie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

The initial fix only called fillna(col_medians), but two further cases
caused the error to persist:
- inf values (e.g. FAR = area/land_area where land_area=0) are not
  caught by fillna; replace inf→NaN first.
- Columns that are entirely NaN produce median()=NaN, making fillna
  a no-op; fall back to 0.0 for such columns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant