Skip to content

fine_tune fails on OpenShift: no HF_HOME set, pods write to read-only /.cache #33

Description

@abhijeet-dhumal

Description

On OpenShift (restricted SCC), fine_tune jobs fail during model/dataset download because neither the initializer pods nor the training node pod set HF_HOME. The HuggingFace library defaults to writing to /.cache/huggingface, which is on the read-only root filesystem under OpenShift's restricted SCC.

Both the dataset-initializer and model-initializer pods crash during download. Even if initializers succeed (via hostPath/PVC workaround), the training node pod itself fails when torchtune tries to access tokenizer config.

Steps to Reproduce

  1. On an OpenShift cluster with restricted SCC
  2. Submit fine_tune(model="hf://Qwen/Qwen2.5-1.5B-Instruct", dataset="hf://tatsu-lab/alpaca", runtime="torchtune-...")
  3. Watch initializer pods — crash with PermissionError: [Errno 13] Permission denied: '/.cache'

Expected Behavior

HF model/dataset downloads succeed; job runs to completion.

Actual Behavior

All pods that write to /.cache crash immediately. Job fails.

Fix :

Inject HF_HOME=/workspace/.hf into: HuggingFaceModelInitializer and HuggingFaceDatasetInitializer via an hf_home field (SDK initializer ENV support)
spec.trainer.env on the TrainJob CR (not via runtimePatches which are blocked by the admission webhook)
/workspace is always writable (it's the ClusterTrainingRuntime PVC).

Version: kubeflow SDK 0.4.0, OpenShift 4.17+

Version

No response

Python Version

3.11

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggood first issueGood for newcomershelp wantedExtra attention is needed

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions