Fix MissingDataError in multi_mra when features contain NaN by drussellmrichie · Pull Request #350 · larsiusprime/openavmkit

drussellmrichie · 2026-06-06T01:41:14Z

Problem

run_multi_mra / _run_multi_mra crashes with:

statsmodels.tools.sm_exceptions.MissingDataError: exog contains inf or nans

when any feature in ind_vars has NaN values. This is common with real-world property data — fields like bldg_parking_spaces, bldg_has_garage, bldg_central_air, etc. are sparsely recorded by assessors and arrive as NaN after the astype(float) conversion on line 1940.

Root cause

_run_multi_mra one-hot encodes categoricals and calls astype(float), which preserves NaN from nullable Float64 columns as float64 NaN. There is no imputation step before sm.OLS(y_train, X_train).fit(), so statsmodels raises immediately.

Fix

Impute NaN with training-set column medians immediately before the global OLS fit, and propagate the same medians to X_test, X_sales, and X_univ so all splits are treated consistently:

col_medians = X_train.median()
X_train = X_train.fillna(col_medians)
ds_prepped.X_test  = ds_prepped.X_test.fillna(col_medians)
ds_prepped.X_sales = ds_prepped.X_sales.fillna(col_medians)
ds_prepped.X_univ  = ds_prepped.X_univ.fillna(col_medians)

This is the same median-imputation pattern already applied in utilities/stats.py (calc_elastic_net_regularization, calc_p_values_recursive_drop, calc_t_values_recursive_drop, calc_vif_recursive_drop) and accepted upstream in PRs #313, #316, #318.

Test plan

Run multi_mra against a dataset with at least one feature containing NaN values — confirm it completes without MissingDataError
Confirm predictions are produced for all rows (train, test, sales, universe)
Confirm no change in behaviour when all features are fully populated

🤖 Generated with Claude Code

Real-world property data routinely has missing values (e.g. parking spaces, garage counts, central air) that survive as NaN through the one-hot encoding and astype(float) steps. statsmodels OLS raises MissingDataError on any NaN in the exog matrix. Guard by imputing NaN with training-set column medians before the global OLS fit, then propagate the same medians to X_test, X_sales, and X_univ so all splits are treated consistently. This is the same median-imputation pattern already applied in utilities/stats.py (calc_elastic_net_regularization, calc_p_values_recursive_drop, etc.) and accepted upstream in PRs larsiusprime#313, larsiusprime#316, larsiusprime#318. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-06T01:41:26Z

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.

No action is required from you in this PR thread. Once you have signed the CLA externally, a maintainer will verify your signature and record it here on your behalf by commenting:

I affirm that this contributor has signed the CLA

Russell Richie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

The initial fix only called fillna(col_medians), but two further cases caused the error to persist: - inf values (e.g. FAR = area/land_area where land_area=0) are not caught by fillna; replace inf→NaN first. - Columns that are entirely NaN produce median()=NaN, making fillna a no-op; fall back to 0.0 for such columns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MissingDataError in multi_mra when features contain NaN#350

Fix MissingDataError in multi_mra when features contain NaN#350
drussellmrichie wants to merge 2 commits into
larsiusprime:masterfrom
drussellmrichie:pr/multi-mra-nan

drussellmrichie commented Jun 6, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drussellmrichie commented Jun 6, 2026

Problem

Root cause

Fix

Test plan

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant