Skip to content

[Data] Add calendar and sub-second extractors to the datetime expression namespace#64392

Open
Jenson97 wants to merge 1 commit into
ray-project:masterfrom
Jenson97:data-expr-dt-calendar
Open

[Data] Add calendar and sub-second extractors to the datetime expression namespace#64392
Jenson97 wants to merge 1 commit into
ray-project:masterfrom
Jenson97:data-expr-dt-calendar

Conversation

@Jenson97

Copy link
Copy Markdown

Why

The .dt expression namespace currently exposes year/month/day/hour/minute/second plus strftime and the ceil/floor/round rounding helpers, but it's missing a number of common calendar and sub-second accessors that PyArrow already provides. This adds them so users don't have to drop into a map_batches UDF for everyday datetime feature extraction.

Part of #58674 (Ray Data Compute Expressions).

What

Adds the following methods to _DatetimeNamespace, each a 1:1 wrapper over the corresponding pyarrow.compute function (mirroring the existing extractors, so they also lower to native pyarrow.compute expressions for predicate pushdown):

  • Calendar: quarter, day_of_week (Monday=0), day_of_year, iso_week, iso_year, is_leap_year
  • Sub-second: millisecond, microsecond, nanosecond
ds.with_column("q", col("ts").dt.quarter())
  .with_column("dow", col("ts").dt.day_of_week())
  .with_column("leap", col("ts").dt.is_leap_year())

Tests

Adds test_datetime_namespace_calendar_extractors, exercising all nine accessors against:

  • a leap-day timestamp with sub-second precision (2024-02-29 13:45:30.123456), and
  • an ISO 8601 year-boundary case (2021-01-01 → ISO year 2020, week 53), which verifies iso_year/iso_week differ from the calendar year as expected.

Checks

  • pytest python/ray/data/tests/expressions/test_namespace_datetime.py
  • black / ruff clean

@Jenson97 Jenson97 requested a review from a team as a code owner June 27, 2026 08:12

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds several new datetime extractors to the dt namespace, including sub-second extractors (millisecond, microsecond, nanosecond) and calendar extractors (quarter, day_of_week, day_of_year, iso_week, iso_year, is_leap_year), along with corresponding unit tests. The review feedback suggests using pd.Timestamp instead of datetime.datetime in the tests to properly test the nanosecond extractor with a non-zero value.

ds = ray.data.from_items(
[
# Leap day, a Thursday, with sub-second precision.
{"ts": datetime.datetime(2024, 2, 29, 13, 45, 30, 123456)},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using datetime.datetime limits the precision to microseconds, which means the nanosecond extractor is only tested with a value of 0. Using pd.Timestamp with nanosecond precision allows us to fully exercise and verify the nanosecond extractor with a non-zero value.

Suggested change
{"ts": datetime.datetime(2024, 2, 29, 13, 45, 30, 123456)},
{"ts": pd.Timestamp("2024-02-29 13:45:30.123456789")},

"is_leap_year": True,
"millisecond": 123,
"microsecond": 456,
"nanosecond": 0,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the expected nanosecond value to 789 to match the nanosecond precision of the updated input timestamp.

Suggested change
"nanosecond": 0,
"nanosecond": 789,

@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Jun 27, 2026
…ion namespace

Extends the `.dt` expression namespace (ray-project#58674) with the calendar and
sub-second accessors that PyArrow already exposes but Ray Data did not:

- quarter, day_of_week (Monday=0), day_of_year
- iso_week, iso_year (ISO 8601 calendar)
- is_leap_year
- millisecond, microsecond, nanosecond

Each is a 1:1 wrapper over the corresponding pyarrow.compute function,
mirroring the existing year/month/day extractors, so they also lower to
native pyarrow.compute expressions for predicate pushdown.

Adds an integration test covering a leap-day timestamp with sub-second
precision and an ISO 8601 year-boundary case (2021-01-01 -> ISO year
2020, week 53).

Signed-off-by: Jenson97 <jenson.strive@gmail.com>
@Jenson97 Jenson97 force-pushed the data-expr-dt-calendar branch from 3e20dd1 to 441fcdb Compare June 28, 2026 05:12
@Jenson97

Copy link
Copy Markdown
Author

Good catch @gemini-code-assistdatetime.datetime caps at microsecond precision, so the nanosecond extractor was only exercised with 0. Switched the leap-day input to pd.Timestamp("2024-02-29 13:45:30.123456789") and updated the expected nanosecond to 789 (pushed in 441fcdbd), so millisecond/microsecond/nanosecond now all assert non-zero values.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Using pd.Timestamp with nanosecond precision correctly ensures that the pyarrow compute functions are exercised for sub-second components, as datetime.datetime is limited to microsecond precision. This change effectively validates the precision handling for the new extractors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant