Skip to content

run_custom_training packages parameter fails on OpenShift (read-only /.local) #41

Description

@abhijeet-dhumal

Description

When using run_custom_training(packages=["torch", "transformers"], ...) on OpenShift, the pre-script pip install step runs pip install --user which writes to /.local. Under OpenShift's restricted SCC, the root filesystem is read-only and /.local is not backed by a writable volume — the job fails immediately with:

PermissionError: [Errno 13] Permission denied: '/.local'

The volumes parameter does auto-inject a writable emptyDir at /workspace, but the emptyDir volumes defined by the user (e.g. for /.local) are not mounted as volumeMounts on the training container — only /workspace is.

Steps to Reproduce

  1. On an OpenShift cluster with restricted SCC
  2. Submit:
run_custom_training(
    runtime="torch-distributed",
    script="import torch; print(torch.__version__)",
    packages=["torch"],
    confirmed=True
)
  1. Check logs → PermissionError: /.local

Expected Behavior

Packages are installed successfully and the training script runs.

Actual Behavior

Job fails at pip install with PermissionError.

Workaround

Do NOT use packages. Install inside the script to the writable /workspace/lib:

import subprocess, sys, os
lib_dir = '/workspace/lib'
os.makedirs(lib_dir, exist_ok=True)
subprocess.run([
    sys.executable, '-m', 'pip', 'install',
    '--target', lib_dir, '--quiet',
    'transformers', 'peft', 'trl'
], check=True)
sys.path.insert(0, lib_dir
)

MCP Server Version

No response

Python Version

Python-3.11

Kubernetes Version

No response

MCP Client

None

References

Proposed Fixes

  • Documentation: Update platform-fixes.md resource to document this gap and the workaround
  • System prompt: Add guidance to INSTRUCTION_SECTIONS["training"] telling agents not to use packages on OpenShift
  • Log analysis: Add OpenShift-specific failure pattern to monitoring.py (_FAILURE_PATTERNS) so get_training_logs returns actionable hints
  • Long-term: Change the SDK's pip install step to use --target=/workspace/lib instead of --user when running on OpenShift

Metadata

Metadata

Assignees

Labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions