Detect orphaned multi-parcel (bulk) deeds in sales scrutiny#351
Open
drussellmrichie wants to merge 1 commit into
Open
Detect orphaned multi-parcel (bulk) deeds in sales scrutiny#351drussellmrichie wants to merge 1 commit into
drussellmrichie wants to merge 1 commit into
Conversation
A bulk deed (one `key_sale` recorded against many parcels) carries a single consideration covering the whole bundle, which gets stamped onto each parcel, so the per-parcel sale price is not a usable arm's-length signal. The existing duplicate-based heuristics (repeated deed+date, repeated date+price) only catch bulk deeds while their sibling rows are still present in the sales table. But `process_data` de-duplicates sales to one row per parcel and filters out invalid sales, so a deed can be reduced to a single "orphan" row whose siblings are gone -- it still carries the inflated bundle price, but there is no duplicate left for those heuristics to match against. In a real county dataset this leaked ~370 bulk-deed sales into the vacant-land training set (it inflated the vacant median price-per-sqft ~4x) that scrutiny silently passed. Fix: - `process_data`: compute deed->parcel multiplicity (`sale_parcel_count`) right after the sales merge, before the valid-sale filter and de-dup, so the signal survives the thinning. - `flag_bulk_deeds()` helper + a `flag_bulk_deed` heuristic in `run_heuristics`: prefers `sale_parcel_count` (catches orphans), falls back to counting parcels per deed in the current table (catches bulk deeds whose rows survive). - Gated on column presence -> no-op for datasets without `key`/`key_sale`. Honors the existing `drop` flag (flag-only when `drop=False`). - Unit tests for the helper (precomputed, fallback, NA, and no-op paths). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
Thank you for your contribution. I affirm that this contributor has signed the CLA Russell Richie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A bulk deed — one
key_salerecorded against many parcels — carries a single consideration covering the whole bundle, which gets stamped onto each parcel. The per-parcel sale price is therefore not a usable arm's-length signal, and these sales badly distort land/value modeling.run_heuristicsalready tries to catch these with two duplicate-based heuristics: repeated deed + date (flag_dupe_deed_date) and repeated date + price (flag_dupe_date_price). Those work only while the deed's sibling rows are still in the sales table.But
process_data(1) filters tovalid_sale == Trueand (2) de-duplicates sales to one row per parcel. After that thinning, a multi-parcel deed can be reduced to a single orphan row — the other parcels' rows are gone (filtered out, or displaced by a later sale on those parcels). The orphan still carries the inflated bundle price, but there's no longer any duplicate for the date/price or deed/date heuristics to match against, so it passes scrutiny silently.Real-world impact
On a full county dataset (Philadelphia, ~770k RTT records), this leaked ~370 bulk-deed sales into the vacant-land training set that scrutiny passed. Every surviving deed had identical date+price across all its raw parcels, but only one row remained in
sup.sales. The contamination inflated the vacant-land median price-per-sqft from ~$39 to ~$172 (≈4×). Dropping these (plus a vacant $/sqft sanity cap, not part of this PR) cut the vacant-land ratio-study COD from 52.1 → 44.2 (trimmed).Fix
process_data— compute deed→parcel multiplicity (sale_parcel_count) right after the sales merge, before the valid-sale filter and de-dup, so the signal survives the thinning.flag_bulk_deeds()helper +flag_bulk_deedheuristic inrun_heuristics— preferssale_parcel_count(catches orphans); falls back to counting distinct parcels per deed in the current table (catches bulk deeds whose duplicate rows survive). Writes the usualout/sales_scrutiny/multi_parcel_bulk_deeds.xlsxreport.key/key_sale. Honors the existingdropflag (flag-only whendrop=False).Behavior change
With
drop=True(the default), orphaned bulk deeds that previously survived will now be dropped — this is the intended correctness improvement. Non-orphan bulk deeds were already dropped byflag_dupe_date_price, so for most datasets the incremental change is the small orphan set.drop=Falsepreserves flag-only behavior.Tests
tests/test_sales_scrutiny.pycovers the newflag_bulk_deedshelper: precomputed-count (orphan) path, fallback path, precomputed-takes-precedence, NA handling, and the no-usable-columns no-op.(Note: I couldn't run the full
tests/test_data.pylocally — master'sopenavmkit.modelingimportsngboost, which wasn't in my environment. The new tests pass and the touched modules compile.)