Skip to content

[misc] feat: support fsspec (gs:// , s3://) sources in copy_to_local#6850

Open
dkondoetsy wants to merge 2 commits into
verl-project:mainfrom
dkondoetsy:feat-fsspec-copy-to-local
Open

[misc] feat: support fsspec (gs:// , s3://) sources in copy_to_local#6850
dkondoetsy wants to merge 2 commits into
verl-project:mainfrom
dkondoetsy:feat-fsspec-copy-to-local

Conversation

@dkondoetsy

@dkondoetsy dkondoetsy commented Jun 25, 2026

Copy link
Copy Markdown

Support GCS and S3 in copy_to_local via fsspec.

copy_to_local handled local + hdfs:// only. So configs that point data.train_files / model paths at gs:// (or other fsspec object stores) are treated as local and failed to resolve.

Add fsspec-backed fetching:

  • is_fsspec_path(): detects a remote fsspec scheme (gs://, s3://, ...) and is kept distinct from is_non_local()
    (hdfs://, fetched by hdfs_io) and from local paths / HF model ids. is_non_local() is unchanged

  • copy_local_path_from_hdfs() now also fetches fsspec paths; _fetch_remote() dispatches hdfs:// -> hdfs_io.copy,
    otherwise it uses fsspec's fs.get (file or dir), reusing the existing md5 cache + filelock + directory-record machinery.

  • fsspec is imported lazily with a clear error naming the backend to install (gcsfs for gs://, s3fs for s3://); added fsspec to requirements.txt.

Tests (tests/utils/test_fs_on_cpu.py) cover is_fsspec_path classification, a copy_to_local fetch via the always-available in-memory fsspec backend (no network), and unchanged local-path passthrough.

Why not use fsspec for hdfs?

verl's hdfs_io.copy is tuned to a cluster's HDFS CLI/env (Kerberos auth, HADOOP_HOME/CLASSPATH, libhdfs). fsspec's HDFS goes through pyarrow's HadoopFileSystem (needs a JVM + classpath) or webhdfs (different endpoint/auth). Equivalence isn't guaranteed.

This PR was developed with AI assistance (Claude Code). I have reviewed every changed line and run the tests.

Checklist

Test

tests/utils/test_fs_on_cpu.py (new cases, using the always-available in-memory fsspec
backend — no network/gcsfs needed):

  • test_is_fsspec_path — scheme classification (gs/s3 → fsspec; hdfs/file/local/HF-id → not).
  • test_copy_to_local_fetches_fsspec_file — fetch a single file to the local md5 cache.
  • test_copy_to_local_local_path_passthrough — local paths returned unchanged (no regression).
  • test_copy_to_local_rejects_fsspec_glob — a glob fails loud (no silent single-shard).
  • test_copy_to_local_fsspec_dir_trailing_slashmemory://dir/ fetches recursively.

Run: pytest tests/utils/test_fs_on_cpu.py

API and Usage Example

No API change — same copy_to_local signature; remote schemes are now resolved.

from verl.utils.fs import copy_to_local
local = copy_to_local("gs://my-bucket/gsm8k/train.parquet")  # requires `gcsfs` installed
# e.g. data.train_files=gs://my-bucket/gsm8k/train.parquet now resolves on the worker node

Design & Code Changes

  • is_fsspec_path(): detects a remote fsspec scheme (gs://, s3://, …) via regex, excluding
    hdfs:// (kept on hdfs_io) and file:///local/HF-ids. is_non_local() is left
    hdfs-only, so its other callers (rm_dataset, checkpoint managers) are unchanged.
  • _fetch_remote(): dispatches hdfs:// → hdfs_io.copy, otherwise fsspec fs.get
    (file or dir). Raises on a multi-path glob match instead of silently using one shard.
    fsspec imported lazily with a clear "install gcsfs/s3fs" error.
  • copy_local_path_from_hdfs(): guard broadened to is_non_local(src) or is_fsspec_path(src),
    reusing the existing md5 cache + filelock + directory-record machinery. Trailing slash is
    stripped for fsspec dir URLs.
  • fsspec added to requirements.tx

Checklist Before Submitting

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for remote object-store URLs (such as gs:// and s3://) using fsspec. It updates verl/utils/fs.py to detect fsspec paths, fetch remote files/directories, and handle trailing slashes, while adding comprehensive unit tests. The feedback suggests wrapping both the fsspec import and the get_fs_token_paths call in a try-except block to ensure that missing backend errors (like gcsfs or s3fs) are properly caught and reported with a helpful error message.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread verl/utils/fs.py Outdated
verl's copy_to_local handled local + hdfs:// only, so configs that point
data.train_files / model paths at gs:// (or other fsspec object stores) were
treated as local and failed to resolve. Add fsspec-backed fetching:

- is_fsspec_path(): detects a remote fsspec scheme (gs://, s3://, ...), kept
  distinct from is_non_local() (hdfs://, fetched by hdfs_io) and from local
  paths / HF model ids. is_non_local() is unchanged, so its other callers keep
  hdfs-only semantics.
- copy_local_path_from_hdfs() now also fetches fsspec paths; _fetch_remote()
  dispatches hdfs:// -> hdfs_io.copy, otherwise fsspec fs.get (file or dir),
  reusing the existing md5 cache + filelock + directory-record machinery.
- fsspec is imported lazily with a clear error naming the backend to install
  (gcsfs for gs://, s3fs for s3://); added fsspec to requirements.txt.

Tests (tests/utils/test_fs_on_cpu.py) cover is_fsspec_path classification, a
copy_to_local fetch via the always-available in-memory fsspec backend (no
network), and unchanged local-path passthrough.

AI assistance (Claude Code) was used.

Signed-off-by: Derrick Kondo <dkondo@etsy.com>
Co-authored-by: Claude
@dkondoetsy dkondoetsy force-pushed the feat-fsspec-copy-to-local branch from 25ac281 to bc85b71 Compare June 25, 2026 15:17
@dkondoetsy dkondoetsy changed the title [misc] feat: support gs:// (fsspec) sources in copy_to_local [misc] feat: support fsspec (gs:// , s3://) sources in copy_to_local Jun 25, 2026
@CLAassistant

CLAassistant commented Jun 25, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@dkondoetsy dkondoetsy marked this pull request as ready for review June 25, 2026 22:08
@dkondoetsy

Copy link
Copy Markdown
Author

@eric-haibin-lin, @tongyx361 when you have a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants