Skip to content

Detect orphaned multi-parcel (bulk) deeds in sales scrutiny#351

Open
drussellmrichie wants to merge 1 commit into
larsiusprime:masterfrom
drussellmrichie:bulk-deed-multiplicity
Open

Detect orphaned multi-parcel (bulk) deeds in sales scrutiny#351
drussellmrichie wants to merge 1 commit into
larsiusprime:masterfrom
drussellmrichie:bulk-deed-multiplicity

Conversation

@drussellmrichie

Copy link
Copy Markdown
Contributor

Problem

A bulk deed — one key_sale recorded against many parcels — carries a single consideration covering the whole bundle, which gets stamped onto each parcel. The per-parcel sale price is therefore not a usable arm's-length signal, and these sales badly distort land/value modeling.

run_heuristics already tries to catch these with two duplicate-based heuristics: repeated deed + date (flag_dupe_deed_date) and repeated date + price (flag_dupe_date_price). Those work only while the deed's sibling rows are still in the sales table.

But process_data (1) filters to valid_sale == True and (2) de-duplicates sales to one row per parcel. After that thinning, a multi-parcel deed can be reduced to a single orphan row — the other parcels' rows are gone (filtered out, or displaced by a later sale on those parcels). The orphan still carries the inflated bundle price, but there's no longer any duplicate for the date/price or deed/date heuristics to match against, so it passes scrutiny silently.

Real-world impact

On a full county dataset (Philadelphia, ~770k RTT records), this leaked ~370 bulk-deed sales into the vacant-land training set that scrutiny passed. Every surviving deed had identical date+price across all its raw parcels, but only one row remained in sup.sales. The contamination inflated the vacant-land median price-per-sqft from ~$39 to ~$172 (≈4×). Dropping these (plus a vacant $/sqft sanity cap, not part of this PR) cut the vacant-land ratio-study COD from 52.1 → 44.2 (trimmed).

Fix

  • process_data — compute deed→parcel multiplicity (sale_parcel_count) right after the sales merge, before the valid-sale filter and de-dup, so the signal survives the thinning.
  • flag_bulk_deeds() helper + flag_bulk_deed heuristic in run_heuristics — prefers sale_parcel_count (catches orphans); falls back to counting distinct parcels per deed in the current table (catches bulk deeds whose duplicate rows survive). Writes the usual out/sales_scrutiny/multi_parcel_bulk_deeds.xlsx report.
  • Gated on column presence → no-op for datasets without key/key_sale. Honors the existing drop flag (flag-only when drop=False).

Behavior change

With drop=True (the default), orphaned bulk deeds that previously survived will now be dropped — this is the intended correctness improvement. Non-orphan bulk deeds were already dropped by flag_dupe_date_price, so for most datasets the incremental change is the small orphan set. drop=False preserves flag-only behavior.

Tests

tests/test_sales_scrutiny.py covers the new flag_bulk_deeds helper: precomputed-count (orphan) path, fallback path, precomputed-takes-precedence, NA handling, and the no-usable-columns no-op.

(Note: I couldn't run the full tests/test_data.py locally — master's openavmkit.modeling imports ngboost, which wasn't in my environment. The new tests pass and the touched modules compile.)

A bulk deed (one `key_sale` recorded against many parcels) carries a single
consideration covering the whole bundle, which gets stamped onto each parcel,
so the per-parcel sale price is not a usable arm's-length signal.

The existing duplicate-based heuristics (repeated deed+date, repeated
date+price) only catch bulk deeds while their sibling rows are still present
in the sales table. But `process_data` de-duplicates sales to one row per
parcel and filters out invalid sales, so a deed can be reduced to a single
"orphan" row whose siblings are gone -- it still carries the inflated bundle
price, but there is no duplicate left for those heuristics to match against.
In a real county dataset this leaked ~370 bulk-deed sales into the vacant-land
training set (it inflated the vacant median price-per-sqft ~4x) that scrutiny
silently passed.

Fix:
- `process_data`: compute deed->parcel multiplicity (`sale_parcel_count`)
  right after the sales merge, before the valid-sale filter and de-dup, so the
  signal survives the thinning.
- `flag_bulk_deeds()` helper + a `flag_bulk_deed` heuristic in `run_heuristics`:
  prefers `sale_parcel_count` (catches orphans), falls back to counting parcels
  per deed in the current table (catches bulk deeds whose rows survive).
- Gated on column presence -> no-op for datasets without `key`/`key_sale`.
  Honors the existing `drop` flag (flag-only when `drop=False`).
- Unit tests for the helper (precomputed, fallback, NA, and no-op paths).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.

No action is required from you in this PR thread. Once you have signed the CLA externally, a maintainer will verify your signature and record it here on your behalf by commenting:


I affirm that this contributor has signed the CLA


Russell Richie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant