Description
When using run_custom_training(packages=["torch", "transformers"], ...) on OpenShift, the pre-script pip install step runs pip install --user which writes to /.local. Under OpenShift's restricted SCC, the root filesystem is read-only and /.local is not backed by a writable volume — the job fails immediately with:
PermissionError: [Errno 13] Permission denied: '/.local'
The volumes parameter does auto-inject a writable emptyDir at /workspace, but the emptyDir volumes defined by the user (e.g. for /.local) are not mounted as volumeMounts on the training container — only /workspace is.
Steps to Reproduce
- On an OpenShift cluster with restricted SCC
- Submit:
run_custom_training(
runtime="torch-distributed",
script="import torch; print(torch.__version__)",
packages=["torch"],
confirmed=True
)
- Check logs → PermissionError: /.local
Expected Behavior
Packages are installed successfully and the training script runs.
Actual Behavior
Job fails at pip install with PermissionError.
Workaround
Do NOT use packages. Install inside the script to the writable /workspace/lib:
import subprocess, sys, os
lib_dir = '/workspace/lib'
os.makedirs(lib_dir, exist_ok=True)
subprocess.run([
sys.executable, '-m', 'pip', 'install',
'--target', lib_dir, '--quiet',
'transformers', 'peft', 'trl'
], check=True)
sys.path.insert(0, lib_dir
)
MCP Server Version
No response
Python Version
Python-3.11
Kubernetes Version
No response
MCP Client
None
References
Proposed Fixes
- Documentation: Update platform-fixes.md resource to document this gap and the workaround
- System prompt: Add guidance to INSTRUCTION_SECTIONS["training"] telling agents not to use packages on OpenShift
- Log analysis: Add OpenShift-specific failure pattern to monitoring.py (_FAILURE_PATTERNS) so get_training_logs returns actionable hints
- Long-term: Change the SDK's pip install step to use --target=/workspace/lib instead of --user when running on OpenShift
Description
When using
run_custom_training(packages=["torch", "transformers"], ...)on OpenShift, the pre-script pip install step runspip install --userwhich writes to/.local. Under OpenShift's restricted SCC, the root filesystem is read-only and/.localis not backed by a writable volume — the job fails immediately with:PermissionError: [Errno 13] Permission denied: '/.local'The
volumesparameter does auto-inject a writable emptyDir at/workspace, but the emptyDir volumes defined by the user (e.g. for/.local) are not mounted as volumeMounts on the training container — only/workspaceis.Steps to Reproduce
Expected Behavior
Packages are installed successfully and the training script runs.
Actual Behavior
Job fails at pip install with PermissionError.
Workaround
Do NOT use packages. Install inside the script to the writable /workspace/lib:
MCP Server Version
No response
Python Version
Python-3.11
Kubernetes Version
No response
MCP Client
None
References
Proposed Fixes