Add Nuplan central token extraction and USDZ dataset support for Alpasim integration by WCJ-BERT · Pull Request #62 · NVlabs/trajdata

WCJ-BERT · 2026-03-19T12:53:35Z

Overview

This PR enables trajdata as a unified data source for Alpasim by adding:

Nuplan central token extraction - Efficient scenario fragment extraction
Multiprocessing parameter passing - Proper config propagation in parallel mode

All changes are fully backward compatible with existing trajdata usage, and ready to support Unified Scene Data Flow changes in Alpasim.

Add USDZ dataset implementation with the following features: - Wrap alpasim_utils.Artifact for stable USDZ parsing - Convert Artifact data to trajdata's standard DataFrame format - Extract velocity and acceleration from trajectory positions - Use project's arr_utils.quaternion_to_yaw() for consistency - Calculate actual dt from timestamps (fallback to 0.1s) - Optimize derivative computation with single groupby pass - Support maps and agent metadata extraction Technical improvements: - Proper quaternion order conversion ([x,y,z,w] -> [w,x,y,z]) - Dynamic time step calculation from trajectory timestamps - Efficient pandas operations for velocity/acceleration - Clean integration with env_utils Total: 461 lines of new USDZ dataset code

- Add dataset_kwargs parameter to ParallelDatasetPreprocessor - Pass dataset_kwargs through multiprocessing to child processes - Save dataset_kwargs in UnifiedDataset for parallel preprocessing - This enables dataset-specific configurations to work in parallel mode Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…path Major changes: - Change dataset_kwargs format from flat to nested dict structure Old: dataset_kwargs={'param': value} (shared by all datasets) New: dataset_kwargs={'dataset_name': {'param': value}} (per-dataset) - Remove yaml_config_path parameter from NuplanDataset and NuPlanObject Use central_tokens_config directly instead - Update env_utils.get_raw_datasets() to support nested dict format Each dataset now receives only its own specific parameters - Optimize parallel preprocessing to avoid redundant parameter extraction ParallelDatasetPreprocessor now correctly handles nested dict format - Clean up df_cache.py: remove unused resolution parameter Benefits: - Clearer separation of per-dataset parameters - More flexible multi-dataset configuration - Simpler parallel worker logic Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>

Revert the usdz dataset added in c60712d: drop the src/trajdata/dataset_specific/usdz package, the usdz dispatch branch in env_utils, and the README table row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mwatson-nvidia · 2026-05-22T21:08:11Z

            kdtrees_path,
            rtrees_path,
-        ) = DataFrameCache.get_map_paths(cache_path, env_name, map_name, resolution)
+        ) = DataFrameCache.get_map_paths(cache_path, env_name, map_name)


I'm not sure I understand if this change was actually needed. It seems like is_map_cached now has an unused argument and that existing callers may have issues.

This has been updated so is_map_cached now consistently takes only cache_path, env_name, and map_name across the base class, DataFrameCache, and the caller in dataset.py.
The removed resolution argument is no longer used anywhere.

mwatson-nvidia · 2026-05-22T21:09:29Z

 class NuplanDataset(RawDataset):
+    def __init__(
+        self,
+        name: str,


nit: suggest keeping this variable the same name as the super class (env_name). Especially important since this is different from self.name

Done, renamed this argument to env_name to match the superclass.

mwatson-nvidia · 2026-05-22T21:17:59Z

+        num_timesteps_before: Optional[int] = None,
+        num_timesteps_after: Optional[int] = None,
+        use_central_tokens: bool = False,
+        **kwargs,


Yes, will delete.

mwatson-nvidia · 2026-05-22T21:19:44Z

+        central_tokens_config: Optional[List[Dict[str, Any]]] = None,
+        num_timesteps_before: Optional[int] = None,
+        num_timesteps_after: Optional[int] = None,
+        use_central_tokens: bool = False,


Based on my understanding, this might be redundant. If I read this correctly, use_central_tokens is equivalent to central_tokens_config is not None

This also makes me wonder if self._use_central_tokens is necessary or if this could be a property (dependent on central_tokens_config) instead

This comment also trickles down to the NuplanObject class

self._use_central_tokens has been removed in favor of a derived property in both NuplanDataset and NuPlanObject.
We used bool(self._central_tokens_config) rather than is not None to also handle the edge case of an empty list being treated as "disabled", which seems like the right semantic.

mwatson-nvidia · 2026-05-22T21:27:30Z

+        data_dir: str,
+        parallelizable: bool = True,
+        has_maps: bool = True,
+        central_tokens_config: Optional[List[Dict[str, Any]]] = None,


I think it's possible that these optional arguments might currently be incompatible with the way that caching works in that the env_is_cached() method won't be aware of changes to these arguments.

The current cache key is based solely on env_name, so changing central_tokens_config, num_timesteps_before, or num_timesteps_after won't automatically invalidate an existing cache. I prepare to add a docstring note to make this explicit and advise users to pass rebuild_cache=True when these arguments change.

A more robust fix would be to incorporate a hash of these parameters into the cache key (e.g. a config fingerprint stored alongside scenes_list.dill). I'm happy to discuss whether that's worth a follow-up change.

mwatson-nvidia · 2026-05-23T00:21:38Z

+            }
+        }
+        try:
+            import pickle


this introduces pickle.loads() on values read from the NuPlan SQLite database. That is only safe if we treat the raw NuPlan dataset as trusted input (the concern being malicious pickle execution). Is this required by the NuPlan DB format, and if so can we document the trust assumption clearly or use an official NuPlan deserialization path instead

Yes, this is required by the NuPlan DB representation for these columns. The official ORM defines the camera calibration fields as pickle-backed SQL types, and SimplePickleType.process_result_value() uses pickle.loads() before wrapping the value in the NuPlan data type.
Details in:

https://github.com/motional/nuplan-devkit/blob/e924167/nuplan/database/common/sql_types.py#L74-L96

https://github.com/motional/nuplan-devkit/blob/e924167/nuplan/database/nuplan_db_orm/camera.py#L28-L31

I agree this still carries the normal pickle trust assumption. Since trajdata reads the SQLite DB directly instead of instantiating the NuPlan SQLAlchemy ORM objects, this mirrors the official path, but the DB must be treated as trusted NuPlan input.
I will strengthen the inline comment near this deserialization code to make that trust assumption explicit.

mwatson-nvidia · 2026-05-23T00:25:31Z

+                        else:
+                            translation = np.array([trans_obj.x, trans_obj.y, trans_obj.z]) if hasattr(trans_obj, 'x') else np.array([0.0, 0.0, 0.0])
+                    except Exception:
+                        translation = np.array([0.0, 0.0, 0.0])


I wonder what the consequences of the default value are here. For instance, if there is an issue loading this data, would it be better to raise the exception or set this to None to avoid someone accidentally using this downstream and getting corrupt results.

same question for the quaternions

Yes, this has been updated for camera calibration fields. If translation, rotation/quaternion, intrinsic, or distortion cannot be parsed or converted, the value is now set to None instead of falling back to synthetic defaults like zero translation, identity quaternion, or hardcoded intrinsics. That should prevent downstream consumers from accidentally treating invalid calibration as valid data.

mwatson-nvidia · 2026-05-23T00:27:34Z

+                            # Try to get quaternion components.
+                            if hasattr(rot_obj, 'quaternion'):
+                                q = rot_obj.quaternion
+                                rotation = np.array([q.w, q.x, q.y, q.z]) if hasattr(q, 'w') else np.array([q.x, q.y, q.z, q.w])


this line doesn't quite look right to me. it uses q.w even when it doesn't exist

this is another point where I wonder if None is better

mwatson-nvidia · 2026-05-23T00:28:40Z

+                        elif hasattr(intrinsic_obj, '__iter__') and not isinstance(intrinsic_obj, str):
+                            intrinsic = np.array(intrinsic_obj)
+                        else:
+                            intrinsic = np.array([[1545.0, 0.0, 960.0], [0.0, 1545.0, 560.0], [0.0, 0.0, 1.0]])


where do these numbers come from?
add reference source? or maybe this should be another None case instead of fallback?

Intrinsic now falls back to None on parse failure, consistent with the other fields.

mwatson-nvidia · 2026-05-23T00:38:40Z

-        self.envs: List[RawDataset] = env_utils.get_raw_datasets(data_dirs)
+        # Pass dataset-specific kwargs to raw datasets
+        dataset_kwargs = dataset_kwargs or {}
+        self.dataset_kwargs = dataset_kwargs  # Save for later use in parallel preprocessing


I wonder if we can change this to

self.dataset_kwargs = dataset_kwargs or {}

and then just use self.dataset_kwargs everywhere?

- rename NuplanDataset.__init__ param `name` → `env_name` to match RawDataset convention - always copy scene_record to data_access_info to avoid mutating shared scenes list - simplify dataset_kwargs: assign self.dataset_kwargs = dataset_kwargs or {} directly - convert use_central_tokens to derived property in both NuplanDataset and NuPlanObject - consolidate log-name parsing into _get_log_name_from_scene_name, use in all 4 callsites - remove resolution param from is_map_cached (SceneCache, DFCache, UnifiedDataset) - raise TypeError instead of silently wrapping non-dict data_access_info - add source comment for NUPLAN_REAL_LIDAR2EGO_* constants - add pickle trust comment in get_sensor_calibration - replace silent except-pass with logger.warning/debug, return None on failure - fix quaternion fallback to return None instead of garbage value - add public map_api, scene_name_to_index properties and get_scene_cache() helper Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove unused **kwargs from NuplanDataset.__init__ - Add docstring note on cache invalidation when central_tokens_config or timestep args change (rebuild_cache=True required) - Strengthen pickle.loads() comment to reference official NuPlan ORM deserialization path (sql_types.SimplePickleType, camera.Camera) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

NuplanDataset.__init__ no longer accepts **kwargs, so callers must pass only the parameters it knows about. Add _NUPLAN_VALID_KWARGS whitelist in get_raw_dataset() to silently drop extra keys such as config_dir and asset_base_path that belong to the caller (alpasim) rather than to trajdata. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

WCJ-BERT mentioned this pull request Apr 24, 2026

Feat: Unified Scene Data Flow NVlabs/alpasim#56

Open

WCJ-BERT and others added 5 commits May 15, 2026 16:20

[Nuplan] Support central token extraction and dual-mode data loading

c827a73

[USDZ] Remove usdz dataset support

3e89999

Revert the usdz dataset added in c60712d: drop the src/trajdata/dataset_specific/usdz package, the usdz dispatch branch in env_utils, and the README table row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

WCJ-BERT force-pushed the alpasim branch from d22a271 to 3e89999 Compare May 15, 2026 16:20

minor fix

494904d

mwatson-nvidia reviewed May 23, 2026

View reviewed changes

WCJ-BERT and others added 4 commits June 6, 2026 15:54

dataset: expose scene lookup and cache helpers

dbad30a

Conversation

WCJ-BERT commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WCJ-BERT commented Mar 19, 2026 •

edited

Loading