Skip to content

fine_tune fails: SDK generates dataset.data_dir=/workspace/dataset/. (trailing /.) #32

Description

@abhijeet-dhumal

Description

When submitting a fine_tune job with a top-level HuggingFace dataset URI (e.g. hf://tatsu-lab/alpaca), the Kubeflow Trainer SDK constructs dataset.data_dir=/workspace/dataset/. as a torchtune CLI override. torchtune's alpaca_cleaned_dataset passes this as a path component inside HF URIs, producing invalid double-slash paths (hf:///workspace/dataset/./tatsu-lab/alpaca) which cause the training job to crash at dataset load time.

Root cause (SDK-level): kubeflow.trainer.backends.kubernetes.utils.get_args_using_torchtune_config unconditionally calls os.path.join(constants.DATASET_PATH, relative_path) even when relative_path == ".", which Python's os.path.join returns as /workspace/dataset/..

Fix:

Strip trailing /. and append dataset.source=<local_path> + dataset.data_dir=null so torchtune uses load_dataset(source) against the PVC-resident parquet files directly. The actual fix belongs in the Kubeflow Trainer SDK: if relative_path == ".": args.append(f"dataset.data_dir={DATASET_PATH}") (skip the os.path.join).

Steps to Reproduce

  1. Configure fine_tune with a top-level HF dataset: dataset="hf://tatsu-lab/alpaca"
  2. Submit the job
  3. Check torchtune logs — dataset load fails with HF URI error

Expected Behavior

dataset.data_dir=/workspace/dataset (no trailing /.)

Actual Behavior

dataset.data_dir=/workspace/dataset/. causes torchtune to construct invalid HF URIs

Version

No response

Python Version

3.11

Metadata

Metadata

Labels

bugSomething isn't workinggood first issueGood for newcomershelp wantedExtra attention is needed

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions