Skip to content

morton index surface followup: points= encode + __from_arrow__#86

Draft
espg wants to merge 3 commits into
mainfrom
claude/79-morton-index-surface
Draft

morton index surface followup: points= encode + __from_arrow__#86
espg wants to merge 3 commits into
mainfrom
claude/79-morton-index-surface

Conversation

@espg

@espg espg commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Closes #79

Surfaces the two morton_index skin gaps flagged in issue #79 (followups from #73 / #51): the Kind::Point encode path and the Arrow -> pandas __from_arrow__ hook. One PR, two phases.

Phase (a) - thread points=True -> Kind::Point

The pandas/pyarrow skins only ever produced area cells; the kernel's point encode (encode_point, decimal_morton.rs:206) had no public surface. This adds the full bridge:

  • Rust kernel (src_rust/src/decimal_morton.rs): new from_nested_point(nested) - the point twin of from_nested. It decomposes an order-29 nested id into its 29 stored 0..=3 tuples and calls encode_point, so the result decodes with kind == Kind::Point, order == 29. It guards an oversized nested index (base > 11) the same way from_nested does. Unit tests: from_nested_point_matches_encode_point (bit-identical to encode_point, Kind::Point, order 29, shares the nested cell with the area word) and from_nested_point_rejects_oversized_nested.
  • PyO3 binding (src_rust/src/lib.rs): a dedicated single-purpose rust_mi_from_nested_point(nested_array), mirroring rust_mi_from_nested (no point: bool flag), registered alongside it.
  • Python (mortie/morton_index.py): MortonIndexArray.from_latlon gains points=False. When points=True, lat/lon is hashed at order 29 and routed through rust_mi_from_nested_point; an explicit order != 29 passed with points=True raises ValueError. The default order is already MAX_ORDER (29), so the point path needs no extra argument.
  • Tests (mortie/tests/test_morton_index.py, TestPointEncode): points=True words decode as Kind::Point/order 29; points share a nested cell with the matching area word; coarsening a point to a lower order yields an area cell; points=False (and the default) stay Kind::Area; order != 29 + points=True raises.

Answer to the encode_point order-arg question

No, encode_point does not take an order argument. Its signature is pub fn encode_point(base_cell: u8, tuples: &[u8]) -> u64 (decimal_morton.rs:206). Point encoding is order-29-only by construction: it asserts tuples.len() >= MAX_ORDER (29) and hardcodes MAX_ORDER into both the body pack and the point suffix (pack_prefix_body(base_cell, tuples, MAX_ORDER) | build_point_suffix(tuples[27], tuples[28])). A point is the max-resolution "this exact location, no area claim" encoding, so there is no coarser-order point - that is why no order parameter exists. (By contrast the area encode/from_nested do take an order/depth, because area cells exist at every order 0..=29.) The Python side mirrors this: points=True forces order 29 and rejects any explicit order != 29, so a caller cannot ask for a non-existent "order-N point".

Phase (b) - wire __from_arrow__

Before: table.to_pandas() on a morton_index Arrow column produced a plain int64 Series (the dtype had no __from_arrow__ hook). Now:

  • Python (mortie/morton_index.py): MortonIndexDtype.__from_arrow__(self, array) accepts a pa.Array or pa.ChunkedArray, delegates to mortie.arrow.to_morton_index, iterates chunks and concatenates via MortonIndexArray._concat_same_type for the chunked case, and returns a MortonIndexArray. The pyarrow import stays lazy (reuses arrow._require_pyarrow), so morton_index.py remains numpy-only importable.
  • Tests (mortie/tests/test_arrow.py, TestFromArrowHook): table.to_pandas() yields a morton_index-dtyped Series backed by a MortonIndexArray with bit-identical words; __from_arrow__ handles both plain and chunked arrays; full Parquet -> read_table -> to_pandas() preserves words + dtype.

Phases checklist

  • Phase (a): from_nested_point kernel + rust_mi_from_nested_point binding + from_latlon(points=...) + tests
  • Phase (b): MortonIndexDtype.__from_arrow__ (Array + ChunkedArray) + tests

How it was tested

All run in an isolated venv on this branch:

  • cargo fmt (applied) + cargo test -> 175 passed (includes the 2 new kernel tests)
  • cargo clippy -> only pre-existing warnings (the useless_vec warnings in coverage/tests.rs and the useless_conversion on PyResult<PyObject> shared by all 31 existing #[pyfunction] bindings); the new binding matches that established pattern, so no new warning is introduced.
  • maturin develop --release -> builds clean
  • flake8 mortie --select=E9,F63,F7,F82 -> clean; non-blocking --max-line-length=88 pass clean on touched files
  • pytest -v -> 450 passed, 8 skipped (31 in test_morton_index.py, 15 in test_arrow.py)

Questions for review

  • Pre-existing clippy warnings, untouched. cargo clippy reports 31 useless_conversion warnings on PyResult<PyObject> (every existing binding) and useless_vec warnings in coverage/tests.rs. None are introduced by this change; the new rust_mi_from_nested_point deliberately mirrors the existing rust_mi_from_nested style rather than diverging. Left as-is per the "don't fix unrelated CI noise / match the surrounding code" conventions - flag if you'd rather I clean up the binding pattern repo-wide in a separate PR.
  • __from_arrow__ reuses arrow._require_pyarrow (a leading-underscore helper) to keep the lazy-import contract in one place. If you'd prefer a public accessor instead of importing the private name, say so.

Generated by Claude Code

@espg espg added the implement label Jun 27, 2026
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.77%. Comparing base (8482da6) to head (91b73a7).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #86      +/-   ##
==========================================
+ Coverage   93.61%   93.77%   +0.15%     
==========================================
  Files          25       25              
  Lines        3728     3822      +94     
==========================================
+ Hits         3490     3584      +94     
  Misses        238      238              
Flag Coverage Δ
unittests 93.77% <100.00%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
mortie/morton_index.py 90.82% <100.00%> (+0.71%) ⬆️
mortie/tests/test_arrow.py 100.00% <100.00%> (ø)
mortie/tests/test_morton_index.py 100.00% <100.00%> (ø)

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8482da6...91b73a7. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@espg espg left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Phase 1 review (commit 3ab4bea, issue #79 phase (a)). The change is correct and meets the acceptance criteria: from_latlon(..., points=True) produces Kind::Point words at order 29, and points=False (default) is unchanged Kind::Area.

Verified:

  • from_nested_point tuple decomposition is bit-faithful to encode_point (orders 28/29 land in tuples[27]/tuples[28]); pair extraction mirrors from_nested's body loop; the base>11 guard mirrors from_nested.
  • The PyO3 binding rust_mi_from_nested_point faithfully mirrors rust_mi_from_nestedallow_threads, catch_unwind, par_iter, and the panic_msg/PyValueError error path all match.
  • Kernel tests assert encode_point equality, Kind::Point, order==29, nested round-trip, and the oversized-nested panic. Python tests assert order 29 + kind Point, area-unchanged default, coarsen→area, and the non-29 reject.
  • No flake8 style-pass (88-col) violations in the Python diff; no over-long Rust lines.

Findings (all minor / non-blocking, details inline):
(1) from_nested_point silently masks order-≥30 bits via & 3 rather than rejecting them — consistent with from_nested, flagged only to make it a conscious choice.
(2) The ValueError guard accepts an explicit order=29, points=True (indistinguishable from default); a non-integer order raises TypeError from int() rather than the guard's ValueError. Both look intended.
(3) Test nice-to-haves: assert point words differ from the order-29 area words for the same location, add the accept-explicit-29 path, and confirm the >180 longitude fixture matches rust_ang2pix's expected range.

Nothing here blocks advancing to the next phase. Note decimal_morton.rs is already ~1568 lines (pre-existing, not from this diff) — standing observation for @espg per §4, not a finding against this PR.


Generated by Claude Code

Comment thread mortie/morton_index.py
together with ``points=True`` is a ``ValueError`` (the default
``order`` is ``MAX_ORDER`` so the point path needs no extra argument).
"""
if points and int(order) != MAX_ORDER:

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

The guard is points and int(order) != MAX_ORDER. Worth being explicit that an explicit order=29, points=True is accepted silently (it can't be distinguished from the default, since both resolve to int(order)==29). The docstring already says "pass order=29 (the default) or omit it", so this is intended — just confirming the edge case is by design and the test test_points_true_with_non29_order_raises covers only the reject path (order=12), not the accept-explicit-29 path. A one-line test that from_latlon(..., order=29, points=True) succeeds and yields points would lock that contract in.

Also note int(order) is evaluated before the lat/lon coercion, so a non-integer order with points=True raises a TypeError (from int()) rather than the ValueError — fine, but a different exception type than the explicit guard suggests.


Generated by Claude Code

_, orders, kinds, _ = _rustie.rust_mi_decode(a._data)
np.testing.assert_array_equal(orders, np.full(len(self.LATS), 29))
np.testing.assert_array_equal(kinds, np.ones(len(self.LATS), dtype=np.uint8))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Coverage is solid — test_points_true_yields_order29_points asserts both order==29 and kind==1 (Point), test_points_false_default_is_area pins the area default, and test_points_coarsen_and_to_nested confirms coarsen→area. The kinds==1 assertion is what guards against a regression that silently routed points=True back through the area path.

Two small additions worth considering (non-blocking): (1) a direct assertion that points=True words differ from the order=29 area words for the same location (the kind bit makes them distinct; asserting pts._data != area._data documents that intent next to test_points_share_nested_cell_with_area, which only checks the shared nested cell). (2) The LONS fixture uses values >180 (200.0, 305.0, 270.2) — confirm that's the intended 0–360 convention rust_ang2pix expects, since the area tests elsewhere may use a different range.


Generated by Claude Code

Comment thread mortie/morton_index.py
if not parts:
return MortonIndexArray(np.empty(0, dtype=np.uint64))
return MortonIndexArray._concat_same_type(parts)
return to_morton_index(array)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Null/NA round-trip is silently unhandled here, and this is the one path where it could bite. to_morton_index does storage.to_numpy(zero_copy_only=False) on the chunk/array storage; if an Arrow morton_index column carries nulls (a real scenario once a MortonIndexArray with isna() sentinels is written to Parquet and read back), the uint64 to_numpy either raises or fills with an arbitrary value rather than mapping back to the all-zero _SENTINEL that isna() expects. Phase (b)'s stated acceptance is words+dtype preservation for non-null data, so this isn't a blocker — but the whole Arrow suite (test_arrow.py) never exercises a null, so the gap is invisible. Standing for @espg: is null/NA Arrow interop in scope for issue #79, or a follow-up? If in scope, a test writing a column with a pd.NA element through Parquet and asserting isna() survives would pin it.


Generated by Claude Code

Comment thread mortie/morton_index.py
"""
from .arrow import _require_pyarrow, to_morton_index

pa = _require_pyarrow()

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Lazy-import discipline is correctly preserved: the from .arrow import _require_pyarrow, to_morton_index is method-local, the module top still imports only numpy and _rustie, so import mortie.morton_index stays numpy-only. Reusing the sibling module's _require_pyarrow helper matches the existing intra-package pattern (arrow.py and morton_index.py both gate optional deps through their own _require_*), so the private-name import is consistent with the codebase rather than a smell. One small note: to_morton_index already calls _require_pyarrow() internally, so the pa bound here is needed only for the isinstance(array, pa.ChunkedArray) discriminator — which is legitimate, just worth knowing it isn't doing double duty as the import guard for the conversion itself.


Generated by Claude Code

@espg espg left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Phase 2 review (commit ddfda07, issue #79 phase (b): wire MortonIndexDtype.__from_arrow__ so table.to_pandas() on a morton_index column yields a MortonIndexArray). The change is correct and meets the stated acceptance criteria.

Verified:

  • pyarrow stays lazy — the from .arrow import _require_pyarrow, to_morton_index is method-local; module top still imports only numpy and _rustie, so import mortie.morton_index remains numpy-only importable. No module-level pyarrow import added.
  • __from_arrow__ is correctly a method of MortonIndexDtype (8-space indent, same level as construct_array_type).
  • ChunkedArray path iterates array.chunks in order, converts each via to_morton_index, and concatenates with MortonIndexArray._concat_same_type (which does np.concatenate in list order) — order and words preserved across chunks.
  • Empty ChunkedArray (zero chunks) returns an empty MortonIndexArray(np.empty(0, dtype=np.uint64)).
  • Plain pa.Array path returns to_morton_index(array), i.e. a MortonIndexArray.
  • Wiring is complete: arrow.py's MortonIndexType.to_pandas_dtype() returns MortonIndexDtype(), which is what routes to_pandas() into this hook.
  • Tests cover dtype-name + backing-array-type + bit-identical words (in-memory), the chunked order-preservation path, the plain-array path, and the full Parquet -> to_pandas() round-trip.
  • Line lengths within the 88-col style pass; no E9/F63/F7/F82 concerns; commit message is title-only per §3.

Findings (all minor / non-blocking, details inline):
(1) Null/NA Arrow round-trip is silently unhandled (to_morton_index -> storage.to_numpy(zero_copy_only=False) won't map Arrow nulls back to the all-zero _SENTINEL that isna() expects); the whole Arrow suite never exercises a null, so the gap is invisible. Phase (b)'s acceptance is non-null words+dtype, so not a blocker — standing for @espg whether null interop is in scope for issue #79 or a follow-up.
(2) Test nice-to-haves: assert isinstance(series.array, MortonIndexArray) directly in the Parquet round-trip test, and add a one-line empty-ChunkedArray test to lock that handled-but-untested edge.
(3) Note on the private _require_pyarrow import: consistent with the codebase's intra-package optional-dep pattern, and pa is genuinely needed for the isinstance(..., pa.ChunkedArray) discriminator (to_morton_index already guards pyarrow internally).

Nothing here blocks the phase. None of the findings imply a side-effecting action; items for @espg are flagged as standing, not acted on.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Test coverage for phase (b) is solid and meets the acceptance criteria: test_to_pandas_yields_morton_index_series asserts all three of dtype.name == "morton_index", isinstance(series.array, MortonIndexArray), and bit-identical words; test_from_arrow_handles_chunked_array exercises a 2-chunk ChunkedArray and confirms both order preservation across chunks and bit-identical words; test_from_arrow_handles_plain_array covers the pa.Array branch; and test_parquet_to_pandas_full_round_trip pins the Parquet -> to_pandas() path end to end.

Two non-blocking nits:
(1) test_parquet_to_pandas_full_round_trip asserts dtype.name and compares series.array._data, but unlike the in-memory test it doesn't assert isinstance(series.array, mortie.MortonIndexArray) directly (it's only implied by the ._data access). One explicit isinstance line would make the full-roundtrip path assert the backing-array type the same way the in-memory test does.
(2) The empty-ChunkedArray branch (pa.chunked_array([], type=morton_index) -> empty MortonIndexArray) is handled in the code but never tested. A one-line test would lock that edge in.


Generated by Claude Code

@codspeed-hq

codspeed-hq Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Merging this PR will improve performance by 13.69%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 1 improved benchmark
✅ 66 untouched benchmarks
⏩ 1 skipped benchmark1

Performance Changes

Benchmark BASE HEAD Efficiency
coverage_triangle[4] 131.6 µs 115.8 µs +13.69%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/79-morton-index-surface (91b73a7) with main (8482da6)

Open in CodSpeed

Footnotes

  1. 1 benchmark was skipped, so the baseline result was used instead. If it was deleted from the codebase, click here and archive it to remove it from the performance reports.

@espg

espg commented Jun 27, 2026

Copy link
Copy Markdown
Owner Author

🤖 from Claude

Folded the diff-scoped findings from the two adversarial self-reviews into commit 91b73a7 ("fold self-review on #79"):

  • Phase (a): added test_point_words_differ_from_area_words (asserts the packed point word != the order-29 area word for the same location) and test_points_true_explicit_order29_is_accepted (locks the accept-explicit-order=29 path next to the existing reject-order=12 test).
  • Phase (b): test_parquet_to_pandas_full_round_trip now asserts isinstance(series.array, MortonIndexArray) directly (matching the in-memory test), and added test_from_arrow_handles_empty_chunked_array to cover the empty-ChunkedArray branch.

Addressed-by-design (no code change), grounding the self-review notes:

  • The points and int(order) != MAX_ORDER guard accepting an explicit order=29 is intentional (the docstring says "pass order=29 (the default) or omit it"); the new accept-explicit-29 test pins it.
  • from_nested_point masking order-≥30 bits via & 3 mirrors from_nested exactly — kept consistent on purpose.
  • The LONS fixture values >180 (200.0, 305.0, 270.2) use the 0–360 convention rust_ang2pix expects; this is the same spread the existing TestVsCdshealpix.test_latlon_matches_cdshealpix_nested already uses, so it's consistent with the area path.

Standing for review (1): Null/NA Arrow interop. The phase-(b) self-review flagged that __from_arrow__ / to_morton_index don't round-trip a null/NA element back to the all-zero _SENTINEL that isna() expects, and the whole test_arrow.py suite never exercises a null. Issue #79 phase (b)'s stated acceptance is words+dtype preservation for non-null data, so I scoped this out. Options: (a) leave it for a dedicated follow-up issue (Arrow null <-> sentinel is its own contract spanning from_morton_index/to_morton_index too), or (b) fold a null round-trip (write a pd.NA-bearing column through Parquet, assert isna() survives) into this PR. I left it out pending your call — happy to do (b) here if you'd prefer it in scope.

All phases complete; local gates green (cargo test 175 passed, cargo clippy only pre-existing warnings, flake8 --select=E9,F63,F7,F82 clean, pytest 452 passed / 8 skipped). Draft kept in draft.


Generated by Claude Code

@espg espg added the waiting label Jun 27, 2026
@espg espg mentioned this pull request Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

morton index surface followup

2 participants