Description
When submitting a fine_tune job with a top-level HuggingFace dataset URI (e.g. hf://tatsu-lab/alpaca), the Kubeflow Trainer SDK constructs dataset.data_dir=/workspace/dataset/. as a torchtune CLI override. torchtune's alpaca_cleaned_dataset passes this as a path component inside HF URIs, producing invalid double-slash paths (hf:///workspace/dataset/./tatsu-lab/alpaca) which cause the training job to crash at dataset load time.
Root cause (SDK-level): kubeflow.trainer.backends.kubernetes.utils.get_args_using_torchtune_config unconditionally calls os.path.join(constants.DATASET_PATH, relative_path) even when relative_path == ".", which Python's os.path.join returns as /workspace/dataset/..
Fix:
Strip trailing /. and append dataset.source=<local_path> + dataset.data_dir=null so torchtune uses load_dataset(source) against the PVC-resident parquet files directly. The actual fix belongs in the Kubeflow Trainer SDK: if relative_path == ".": args.append(f"dataset.data_dir={DATASET_PATH}") (skip the os.path.join).
Steps to Reproduce
- Configure fine_tune with a top-level HF dataset: dataset="hf://tatsu-lab/alpaca"
- Submit the job
- Check torchtune logs — dataset load fails with HF URI error
Expected Behavior
dataset.data_dir=/workspace/dataset (no trailing /.)
Actual Behavior
dataset.data_dir=/workspace/dataset/. causes torchtune to construct invalid HF URIs
Version
No response
Python Version
3.11
Description
When submitting a fine_tune job with a top-level HuggingFace dataset URI (e.g. hf://tatsu-lab/alpaca), the Kubeflow Trainer SDK constructs dataset.data_dir=/workspace/dataset/. as a torchtune CLI override. torchtune's alpaca_cleaned_dataset passes this as a path component inside HF URIs, producing invalid double-slash paths (hf:///workspace/dataset/./tatsu-lab/alpaca) which cause the training job to crash at dataset load time.
Root cause (SDK-level): kubeflow.trainer.backends.kubernetes.utils.get_args_using_torchtune_config unconditionally calls os.path.join(constants.DATASET_PATH, relative_path) even when relative_path == ".", which Python's os.path.join returns as /workspace/dataset/..
Fix:
Strip trailing /. and append dataset.source=<local_path> + dataset.data_dir=null so torchtune uses load_dataset(source) against the PVC-resident parquet files directly. The actual fix belongs in the Kubeflow Trainer SDK: if relative_path == ".": args.append(f"dataset.data_dir={DATASET_PATH}") (skip the os.path.join).
Steps to Reproduce
Expected Behavior
dataset.data_dir=/workspace/dataset (no trailing /.)
Actual Behavior
dataset.data_dir=/workspace/dataset/. causes torchtune to construct invalid HF URIs
Version
No response
Python Version
3.11