[Data] Add calendar and sub-second extractors to the datetime expression namespace#64392
[Data] Add calendar and sub-second extractors to the datetime expression namespace#64392Jenson97 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds several new datetime extractors to the dt namespace, including sub-second extractors (millisecond, microsecond, nanosecond) and calendar extractors (quarter, day_of_week, day_of_year, iso_week, iso_year, is_leap_year), along with corresponding unit tests. The review feedback suggests using pd.Timestamp instead of datetime.datetime in the tests to properly test the nanosecond extractor with a non-zero value.
| ds = ray.data.from_items( | ||
| [ | ||
| # Leap day, a Thursday, with sub-second precision. | ||
| {"ts": datetime.datetime(2024, 2, 29, 13, 45, 30, 123456)}, |
There was a problem hiding this comment.
Using datetime.datetime limits the precision to microseconds, which means the nanosecond extractor is only tested with a value of 0. Using pd.Timestamp with nanosecond precision allows us to fully exercise and verify the nanosecond extractor with a non-zero value.
| {"ts": datetime.datetime(2024, 2, 29, 13, 45, 30, 123456)}, | |
| {"ts": pd.Timestamp("2024-02-29 13:45:30.123456789")}, |
| "is_leap_year": True, | ||
| "millisecond": 123, | ||
| "microsecond": 456, | ||
| "nanosecond": 0, |
…ion namespace Extends the `.dt` expression namespace (ray-project#58674) with the calendar and sub-second accessors that PyArrow already exposes but Ray Data did not: - quarter, day_of_week (Monday=0), day_of_year - iso_week, iso_year (ISO 8601 calendar) - is_leap_year - millisecond, microsecond, nanosecond Each is a 1:1 wrapper over the corresponding pyarrow.compute function, mirroring the existing year/month/day extractors, so they also lower to native pyarrow.compute expressions for predicate pushdown. Adds an integration test covering a leap-day timestamp with sub-second precision and an ISO 8601 year-boundary case (2021-01-01 -> ISO year 2020, week 53). Signed-off-by: Jenson97 <jenson.strive@gmail.com>
3e20dd1 to
441fcdb
Compare
|
Good catch @gemini-code-assist — |
|
Using |
Why
The
.dtexpression namespace currently exposesyear/month/day/hour/minute/secondplusstrftimeand theceil/floor/roundrounding helpers, but it's missing a number of common calendar and sub-second accessors that PyArrow already provides. This adds them so users don't have to drop into amap_batchesUDF for everyday datetime feature extraction.Part of #58674 (Ray Data Compute Expressions).
What
Adds the following methods to
_DatetimeNamespace, each a 1:1 wrapper over the correspondingpyarrow.computefunction (mirroring the existing extractors, so they also lower to nativepyarrow.computeexpressions for predicate pushdown):quarter,day_of_week(Monday=0),day_of_year,iso_week,iso_year,is_leap_yearmillisecond,microsecond,nanosecondTests
Adds
test_datetime_namespace_calendar_extractors, exercising all nine accessors against:2024-02-29 13:45:30.123456), and2021-01-01→ ISO year 2020, week 53), which verifiesiso_year/iso_weekdiffer from the calendar year as expected.Checks
pytest python/ray/data/tests/expressions/test_namespace_datetime.pyblack/ruffclean