[misc] feat: support fsspec (gs:// , s3://) sources in copy_to_local#6850
[misc] feat: support fsspec (gs:// , s3://) sources in copy_to_local#6850dkondoetsy wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for remote object-store URLs (such as gs:// and s3://) using fsspec. It updates verl/utils/fs.py to detect fsspec paths, fetch remote files/directories, and handle trailing slashes, while adding comprehensive unit tests. The feedback suggests wrapping both the fsspec import and the get_fs_token_paths call in a try-except block to ensure that missing backend errors (like gcsfs or s3fs) are properly caught and reported with a helpful error message.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
verl's copy_to_local handled local + hdfs:// only, so configs that point data.train_files / model paths at gs:// (or other fsspec object stores) were treated as local and failed to resolve. Add fsspec-backed fetching: - is_fsspec_path(): detects a remote fsspec scheme (gs://, s3://, ...), kept distinct from is_non_local() (hdfs://, fetched by hdfs_io) and from local paths / HF model ids. is_non_local() is unchanged, so its other callers keep hdfs-only semantics. - copy_local_path_from_hdfs() now also fetches fsspec paths; _fetch_remote() dispatches hdfs:// -> hdfs_io.copy, otherwise fsspec fs.get (file or dir), reusing the existing md5 cache + filelock + directory-record machinery. - fsspec is imported lazily with a clear error naming the backend to install (gcsfs for gs://, s3fs for s3://); added fsspec to requirements.txt. Tests (tests/utils/test_fs_on_cpu.py) cover is_fsspec_path classification, a copy_to_local fetch via the always-available in-memory fsspec backend (no network), and unchanged local-path passthrough. AI assistance (Claude Code) was used. Signed-off-by: Derrick Kondo <dkondo@etsy.com> Co-authored-by: Claude
25ac281 to
bc85b71
Compare
|
@eric-haibin-lin, @tongyx361 when you have a chance. |
Support GCS and S3 in copy_to_local via fsspec.
copy_to_local handled local + hdfs:// only. So configs that point data.train_files / model paths at gs:// (or other fsspec object stores) are treated as local and failed to resolve.
Add fsspec-backed fetching:
is_fsspec_path(): detects a remote fsspec scheme (gs://, s3://, ...) and is kept distinct from is_non_local()
(hdfs://, fetched by hdfs_io) and from local paths / HF model ids. is_non_local() is unchanged
copy_local_path_from_hdfs() now also fetches fsspec paths; _fetch_remote() dispatches hdfs:// -> hdfs_io.copy,
otherwise it uses fsspec's fs.get (file or dir), reusing the existing md5 cache + filelock + directory-record machinery.
fsspec is imported lazily with a clear error naming the backend to install (gcsfs for gs://, s3fs for s3://); added fsspec to requirements.txt.
Tests (tests/utils/test_fs_on_cpu.py) cover is_fsspec_path classification, a copy_to_local fetch via the always-available in-memory fsspec backend (no network), and unchanged local-path passthrough.
Why not use fsspec for hdfs?
verl's hdfs_io.copy is tuned to a cluster's HDFS CLI/env (Kerberos auth, HADOOP_HOME/CLASSPATH, libhdfs). fsspec's HDFS goes through pyarrow's HadoopFileSystem (needs a JVM + classpath) or webhdfs (different endpoint/auth). Equivalence isn't guaranteed.
This PR was developed with AI assistance (Claude Code). I have reviewed every changed line and run the tests.
Checklist
Test
tests/utils/test_fs_on_cpu.py(new cases, using the always-available in-memory fsspecbackend — no network/gcsfs needed):
test_is_fsspec_path— scheme classification (gs/s3 → fsspec; hdfs/file/local/HF-id → not).test_copy_to_local_fetches_fsspec_file— fetch a single file to the local md5 cache.test_copy_to_local_local_path_passthrough— local paths returned unchanged (no regression).test_copy_to_local_rejects_fsspec_glob— a glob fails loud (no silent single-shard).test_copy_to_local_fsspec_dir_trailing_slash—memory://dir/fetches recursively.Run:
pytest tests/utils/test_fs_on_cpu.pyAPI and Usage Example
No API change — same
copy_to_localsignature; remote schemes are now resolved.Design & Code Changes
hdfs:// (kept on hdfs_io) and file:///local/HF-ids. is_non_local() is left
hdfs-only, so its other callers (rm_dataset, checkpoint managers) are unchanged.
(file or dir). Raises on a multi-path glob match instead of silently using one shard.
fsspec imported lazily with a clear "install gcsfs/s3fs" error.
reusing the existing md5 cache + filelock + directory-record machinery. Trailing slash is
stripped for fsspec dir URLs.
Checklist Before Submitting
mainstill hasis_non_local→ hdfs-only. Not a duplicate.pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.