Description
On OpenShift (restricted SCC), fine_tune jobs fail during model/dataset download because neither the initializer pods nor the training node pod set HF_HOME. The HuggingFace library defaults to writing to /.cache/huggingface, which is on the read-only root filesystem under OpenShift's restricted SCC.
Both the dataset-initializer and model-initializer pods crash during download. Even if initializers succeed (via hostPath/PVC workaround), the training node pod itself fails when torchtune tries to access tokenizer config.
Steps to Reproduce
- On an OpenShift cluster with restricted SCC
- Submit fine_tune(model="hf://Qwen/Qwen2.5-1.5B-Instruct", dataset="hf://tatsu-lab/alpaca", runtime="torchtune-...")
- Watch initializer pods — crash with PermissionError: [Errno 13] Permission denied: '/.cache'
Expected Behavior
HF model/dataset downloads succeed; job runs to completion.
Actual Behavior
All pods that write to /.cache crash immediately. Job fails.
Fix :
Inject HF_HOME=/workspace/.hf into: HuggingFaceModelInitializer and HuggingFaceDatasetInitializer via an hf_home field (SDK initializer ENV support)
spec.trainer.env on the TrainJob CR (not via runtimePatches which are blocked by the admission webhook)
/workspace is always writable (it's the ClusterTrainingRuntime PVC).
Version: kubeflow SDK 0.4.0, OpenShift 4.17+
Version
No response
Python Version
3.11
Description
On OpenShift (restricted SCC), fine_tune jobs fail during model/dataset download because neither the initializer pods nor the training node pod set HF_HOME. The HuggingFace library defaults to writing to /.cache/huggingface, which is on the read-only root filesystem under OpenShift's restricted SCC.
Both the dataset-initializer and model-initializer pods crash during download. Even if initializers succeed (via hostPath/PVC workaround), the training node pod itself fails when torchtune tries to access tokenizer config.
Steps to Reproduce
Expected Behavior
HF model/dataset downloads succeed; job runs to completion.
Actual Behavior
All pods that write to /.cache crash immediately. Job fails.
Fix :
Inject HF_HOME=/workspace/.hf into: HuggingFaceModelInitializer and HuggingFaceDatasetInitializer via an hf_home field (SDK initializer ENV support)
spec.trainer.env on the TrainJob CR (not via runtimePatches which are blocked by the admission webhook)
/workspace is always writable (it's the ClusterTrainingRuntime PVC).
Version: kubeflow SDK 0.4.0, OpenShift 4.17+
Version
No response
Python Version
3.11