Fix issue #28#32
Conversation
…ch#28) `get_flights()` previously returned `time_hour` with airport-local wall clocks implicitly tagged as UTC, while `get_weather()` returned true UTC instants tagged as GMT. The two columns therefore disagreed for any non-UTC origin and the canonical `(origin, time_hour)` join produced no rows. After this change both functions return `time_hour` in the airport's IANA `tzone` (looked up from `get_airports()`), matching `nycflights13`'s convention: flights are forced into the local zone (wall clock preserved, instant corrected) and weather is converted into the local zone (instant preserved, wall clock + year/month/day/hour recomputed). New unit tests in `test-8-time-hour-tz.R` exercise both modes and lock in the cross-table join invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tag `time_hour` with each origin airport's local time zone (simonpcouch#28)
simonpcouch
left a comment
There was a problem hiding this comment.
Ughhh time zones! This is gnarly, thanks for giving this a go.
The bug is real, but I’m not sure time_hour should be made “origin-local” here. nyclights13 is able to do this because each of the origin airports share a time zone, but while a mixed-origin POSIXct column can store many instants it only has one timezone attribute for the whole column. So for get_flights(c("PDX", "DEN", "JFK"), ...), the column cannot actually be simultaneously Pacific, Mountain, and Eastern time.
That means this PR can preserve the right underlying instants for joins, but the printed clock times will be misleading for some origins. Say each airport has a 9am local departure:
PDX 9am PST = 17:00 UTC
DEN 9am MST = 16:00 UTC
JFK 9am EST = 14:00 UTCIf the combined column ends up carrying "America/Denver", those print as:
PDX 10:00
DEN 09:00
JFK 07:00So the join works now, which is great, but the docs saying time_hour is "in the airport's local time zone" aren't really true when there's more than one timezone in play.
I think the cleaner contract is just: time_hour is the correct instant, in UTC. Flights would convert their local wall-clock times to proper UTC instants, and weather would keep its UTC observation instants as-is. The join key stays honest regardless of how many timezones are involved:
dplyr::inner_join(flights, weather, by = c("origin", "time_hour"))(There's also a boundary bug with weather: the API is queried in UTC and months are filtered in UTC, but this PR then converts to local time afterward. For a western timezone like PDX, that means a January query can pick up rows that are locally December 31 (early UTC January hours) and miss the local evening of January 31 (those hours are UTC February and were never fetched). That'd need a padded UTC query range, then filtering on local-time components after conversion. This shouldn't matter if we switch back to UTC.)
| hour = unname(utc_hour), | ||
| time_hour = ISOdatetime(2023, 1, 1, unname(utc_hour), 0, 0, tz = "GMT") | ||
| ) %>% | ||
| anyflights:::adjust_time_hour_tz(airports, action = "convert") |
There was a problem hiding this comment.
| anyflights:::adjust_time_hour_tz(airports, action = "convert") | |
| adjust_time_hour_tz(airports, action = "convert") |
testthat should make this function available in this namespace, I think
|
|
||
| test_that("force mode preserves wall clock and corrects the instant per origin", { | ||
| # Simulates raw flights data: airport-local wall clock implicitly tagged UTC | ||
| flights <- tibble::tibble( |
There was a problem hiding this comment.
I'm seeing:
❯ checking for unstated dependencies in ‘tests’ ... WARNING
'::' or ':::' import not declared from: ‘tibble’
|
|
||
| * `time_hour` in both `get_flights()` and `get_weather()` is now tagged with | ||
| each origin airport's local IANA time zone (looked up from | ||
| `get_airports()$tzone`), matching the convention used by `nycflights13`. |
There was a problem hiding this comment.
| `get_airports()$tzone`), matching the convention used by `nycflights13`. | |
| `get_airports()$tzone`), matching the convention used by nycflights13. |
Tidy style (which, admittedly, might not be uniformly applied across this repo): no formatting for package names
| } else { | ||
| dplyr::mutate( | ||
| .x, | ||
| time_hour = lubridate::with_tz(time_hour, tzone = tz), |
There was a problem hiding this comment.
Item from an AI agent that I find reasonable:
- R/utils.R:412 / R/utils.R:480 / R/utils.R:224: weather is still queried
and filtered on UTC month boundaries before time_hour is converted to
local time. For western time zones, get_weather("PDX", 2023, 1) can return
local 2022-12-31 rows and miss the final local evening of 2023-01-31.ISOdatetime(2023, 1, 1, 0, 0, 0, tz = "GMT") |> lubridate::with_tz("America/Los_Angeles") # 2022-12-31 16:00:00 PSTI’d either query a padded UTC window and filter after recomputing local year/
month/day/hour, or query ASOS in the station-local timezone if that gives the
desired invariant.
| dplyr::select(airports_data, faa, tzone), | ||
| by = c("origin" = "faa") | ||
| ) %>% | ||
| dplyr::group_by(tzone) %>% |
There was a problem hiding this comment.
Another item that I find reasonable:
- R/utils.R:215: grouping by tzone before group_modify() reorders rows by
timezone group. This means get_flights() no longer preserves the explicit
arrange(year, month, day, dep_time) done at R/get_flights.R:125. It also
means mixed-origin output does not actually have a per-row timezone; a
POSIXct vector has one tzone attribute, so later-origin rows print in
whatever timezone survives the bind.origin x time_hour DEN 3 2023-01-01 09:00:00 PDX 1 2023-01-01 10:00:00 JFK 2 2023-01-01 07:00:00The instants may be right, but the documented “airport’s local time zone”
display is not true for multi-timezone outputs.
A POSIXct column carries one `tzone` attribute, so the previous "tag each row with its origin's local tz" approach couldn't actually hold for mixed-origin queries -- `bind_rows` collapses the per-group tzs and the printed display silently lies for the losing rows. Reframe the contract: `get_flights()$time_hour` is the correct UTC instant (BTS wall clocks reinterpreted via each origin's IANA tz), `get_weather()$time_hour` already publishes UTC instants, and the `(origin, time_hour)` join works honestly across any combination of time zones. Also removes the weather-side UTC-boundary bug (querying & filtering on UTC months but then converting to local time pulled in adjacent local months), since weather no longer gets a tz conversion. Tests cover DST crossing within a single airport, round-trip back to local wall clocks, multi-tz row ordering, and empty input. Drops `tibble::` usage in tests (undeclared dependency WARNING) in favour of `dplyr::tibble`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch `time_hour` contract to UTC instants
|
Thanks for the review! That makes a lot more sense, and I'm glad you caught the bugginess of trying to get everything on local. I made some updates to hopefully clear up these issues to be more in alignment with what your review added. It does look like some lingering bugs still with GA though. |
It only took me a year! 😅
get_flights()previously returnedtime_hourwith airport-local wall clocks implicitly tagged as UTC, whileget_weather()returned true UTC instants tagged as GMT. The two columns therefore disagreed for any non-UTC origin and the canonical(origin, time_hour)join produced no rows. After this change both functions returntime_hourin the airport's IANAtzone(looked up fromget_airports()), matchingnycflights13's convention: flights are forced into the local zone (wall clock preserved, instant corrected) and weather is converted into the local zone (instant preserved, wall clock + year/month/day/hour recomputed). New unit tests intest-8-time-hour-tz.Rexercise both modes and lock in the cross-table join invariant.