Detect orphaned multi-parcel (bulk) deeds in sales scrutiny by drussellmrichie · Pull Request #351 · larsiusprime/openavmkit

drussellmrichie · 2026-06-06T19:20:48Z

Problem

A bulk deed — one key_sale recorded against many parcels — carries a single consideration covering the whole bundle, which gets stamped onto each parcel. The per-parcel sale price is therefore not a usable arm's-length signal, and these sales badly distort land/value modeling.

run_heuristics already tries to catch these with two duplicate-based heuristics: repeated deed + date (flag_dupe_deed_date) and repeated date + price (flag_dupe_date_price). Those work only while the deed's sibling rows are still in the sales table.

But process_data (1) filters to valid_sale == True and (2) de-duplicates sales to one row per parcel. After that thinning, a multi-parcel deed can be reduced to a single orphan row — the other parcels' rows are gone (filtered out, or displaced by a later sale on those parcels). The orphan still carries the inflated bundle price, but there's no longer any duplicate for the date/price or deed/date heuristics to match against, so it passes scrutiny silently.

Real-world impact

On a full county dataset (Philadelphia, ~770k RTT records), this leaked ~370 bulk-deed sales into the vacant-land training set that scrutiny passed. Every surviving deed had identical date+price across all its raw parcels, but only one row remained in sup.sales. The contamination inflated the vacant-land median price-per-sqft from ~$39 to ~$172 (≈4×). Dropping these (plus a vacant $/sqft sanity cap, not part of this PR) cut the vacant-land ratio-study COD from 52.1 → 44.2 (trimmed).

Fix

process_data — compute deed→parcel multiplicity (sale_parcel_count) right after the sales merge, before the valid-sale filter and de-dup, so the signal survives the thinning.
flag_bulk_deeds() helper + flag_bulk_deed heuristic in run_heuristics — prefers sale_parcel_count (catches orphans); falls back to counting distinct parcels per deed in the current table (catches bulk deeds whose duplicate rows survive). Writes the usual out/sales_scrutiny/multi_parcel_bulk_deeds.xlsx report.
Gated on column presence → no-op for datasets without key/key_sale. Honors the existing drop flag (flag-only when drop=False).

Behavior change

With drop=True (the default), orphaned bulk deeds that previously survived will now be dropped — this is the intended correctness improvement. Non-orphan bulk deeds were already dropped by flag_dupe_date_price, so for most datasets the incremental change is the small orphan set. drop=False preserves flag-only behavior.

Tests

tests/test_sales_scrutiny.py covers the new flag_bulk_deeds helper: precomputed-count (orphan) path, fallback path, precomputed-takes-precedence, NA handling, and the no-usable-columns no-op.

(Note: I couldn't run the full tests/test_data.py locally — master's openavmkit.modeling imports ngboost, which wasn't in my environment. The new tests pass and the touched modules compile.)

A bulk deed (one `key_sale` recorded against many parcels) carries a single consideration covering the whole bundle, which gets stamped onto each parcel, so the per-parcel sale price is not a usable arm's-length signal. The existing duplicate-based heuristics (repeated deed+date, repeated date+price) only catch bulk deeds while their sibling rows are still present in the sales table. But `process_data` de-duplicates sales to one row per parcel and filters out invalid sales, so a deed can be reduced to a single "orphan" row whose siblings are gone -- it still carries the inflated bundle price, but there is no duplicate left for those heuristics to match against. In a real county dataset this leaked ~370 bulk-deed sales into the vacant-land training set (it inflated the vacant median price-per-sqft ~4x) that scrutiny silently passed. Fix: - `process_data`: compute deed->parcel multiplicity (`sale_parcel_count`) right after the sales merge, before the valid-sale filter and de-dup, so the signal survives the thinning. - `flag_bulk_deeds()` helper + a `flag_bulk_deed` heuristic in `run_heuristics`: prefers `sale_parcel_count` (catches orphans), falls back to counting parcels per deed in the current table (catches bulk deeds whose rows survive). - Gated on column presence -> no-op for datasets without `key`/`key_sale`. Honors the existing `drop` flag (flag-only when `drop=False`). - Unit tests for the helper (precomputed, fallback, NA, and no-op paths). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-06T19:20:55Z

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.

No action is required from you in this PR thread. Once you have signed the CLA externally, a maintainer will verify your signature and record it here on your behalf by commenting:

I affirm that this contributor has signed the CLA

Russell Richie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect orphaned multi-parcel (bulk) deeds in sales scrutiny#351

Detect orphaned multi-parcel (bulk) deeds in sales scrutiny#351
drussellmrichie wants to merge 1 commit into
larsiusprime:masterfrom
drussellmrichie:bulk-deed-multiplicity

drussellmrichie commented Jun 6, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drussellmrichie commented Jun 6, 2026

Problem

Real-world impact

Fix

Behavior change

Tests

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant