fine_tune fails on OpenShift: no HF_HOME set, pods write to read-only /.cache

### Description

On OpenShift (restricted SCC), fine_tune jobs fail during model/dataset download because neither the initializer pods nor the training node pod set HF_HOME. The HuggingFace library defaults to writing to /.cache/huggingface, which is on the read-only root filesystem under OpenShift's restricted SCC.

Both the dataset-initializer and model-initializer pods crash during download. Even if initializers succeed (via hostPath/PVC workaround), the training node pod itself fails when torchtune tries to access tokenizer config.

### Steps to Reproduce

1. On an OpenShift cluster with restricted SCC
2. Submit fine_tune(model="hf://Qwen/Qwen2.5-1.5B-Instruct", dataset="hf://tatsu-lab/alpaca", runtime="torchtune-...")
3. Watch initializer pods — crash with PermissionError: [Errno 13] Permission denied: '/.cache'

### Expected Behavior

HF model/dataset downloads succeed; job runs to completion.

### Actual Behavior

All pods that write to /.cache crash immediately. Job fails.

## Fix : 

Inject HF_HOME=/workspace/.hf into: HuggingFaceModelInitializer and HuggingFaceDatasetInitializer via an hf_home field (SDK initializer ENV support)
spec.trainer.env on the TrainJob CR (not via runtimePatches which are blocked by the admission webhook)
/workspace is always writable (it's the ClusterTrainingRuntime PVC).

Version: kubeflow SDK 0.4.0, OpenShift 4.17+


### Version

_No response_

### Python Version

3.11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fine_tune fails on OpenShift: no HF_HOME set, pods write to read-only /.cache #33

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Fix :

Version

Python Version

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

fine_tune fails on OpenShift: no HF_HOME set, pods write to read-only /.cache #33

Description

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Fix :

Version

Python Version

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions