Skip to content

Restrict Python to <3.14 to prevent CPU RAM OOM in distributed training#128

Merged
hanaol merged 1 commit into
mainfrom
hanaol/python-3.14-oom
May 21, 2026
Merged

Restrict Python to <3.14 to prevent CPU RAM OOM in distributed training#128
hanaol merged 1 commit into
mainfrom
hanaol/python-3.14-oom

Conversation

@hanaol

@hanaol hanaol commented May 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Restrict requires-python to <3.14 after discovering Python 3.14.0 causes progressive CPU RAM exhaustion during distributed training
  • Pin .python-version to 3.13 and update uv.lock accordingly

Background

Training with Python 3.14.0 resulted in processes being OOM-killed by the Linux kernel consistently across all dataset configurations. Downgrading to Python 3.13.9 (with an otherwise identical environment and torch version) resolved the issue. The root cause is likely an incompatibility between Python 3.14's memory management and PyTorch's C extensions under multi-process DDP workloads.

@hanaol

hanaol commented May 4, 2026

Copy link
Copy Markdown
Collaborator Author

@claude review this PR

@claude

claude Bot commented May 4, 2026

Copy link
Copy Markdown

Claude finished @hanaol's task in 33s —— View job


PR Review: Restrict Python to <3.14 to prevent CPU RAM OOM

  • Gather context and read changed files
  • Review pyproject.toml changes
  • Review .python-version
  • Review uv.lock changes
  • Post final review

Summary

This is a straightforward, well-scoped fix. The changes are minimal and targeted.

Changes Reviewed

pyproject.tomlrequires-python = ">=3.11,<3.14"

The constraint is correct and clearly communicates the intention. One minor observation: the classifiers list (lines 15–26) does not include "Programming Language :: Python :: 3.14", so no update is needed there — but it's worth confirming the classifiers are still accurate as Python 3.13 is now the pinned version.

.python-version3.13

Pinning to 3.13 is the right call given the observed OOM behavior. This ensures consistent local development and CI environments via uv.

uv.lock — regenerated for Python 3.13

The large reduction (-572 lines) reflects the removal of Python 3.14-specific resolved dependencies/markers. This is expected and correct.

Notes

  • The PR description is clear and provides sufficient reproduction context (DDP workloads, Linux OOM kill, resolved by downgrading to 3.13.9).
  • No tracking issue is referenced, but for a workaround of this nature (pending upstream fix in Python 3.14 or PyTorch), it may be worth filing an issue to revisit this constraint once the root cause is identified upstream.
  • Consider whether CI is pinned to a specific Python version and whether it needs to be updated to match (e.g., GitHub Actions python-version matrix or setup-python step).

Verdict

Approve — the change is correct, minimal, and well-justified. No issues found.

@hanaol hanaol force-pushed the hanaol/python-3.14-oom branch from 4b0ef2f to 6b92dd1 Compare May 20, 2026 17:54
@hanaol

hanaol commented May 20, 2026

Copy link
Copy Markdown
Collaborator Author

@claude re-review this PR

@hanaol hanaol requested a review from forklady42 May 20, 2026 17:55
@claude

claude Bot commented May 20, 2026

Copy link
Copy Markdown

Claude finished @hanaol's task in 1m 0s —— View job


PR Re-Review: Restrict Python to <3.14 to prevent CPU RAM OOM

  • Gather context
  • Read changed files (pyproject.toml, .python-version, uv.lock)
  • Check CI workflows for Python version alignment
  • Post final review

Summary

The PR is correct and minimal. No issues found. All three changed files are consistent with each other and with the stated intent.


pyproject.toml

requires-python = ">=3.11,<3.14"

Correct constraint. The classifiers list (lines 13–26) already stops at "Programming Language :: Python :: 3.13" — no update needed there.

.python-version

3.13

Pinning to 3.13 is correct and ensures uv uses a consistent, known-good interpreter for both local dev and CI.

uv.lock

The large reduction (−572 net lines) is expected: removing Python 3.14 wheel entries for every package. The lock header correctly reflects requires-python = ">=3.11, <3.14". No issues.

CI Workflows

All CI workflows (gpu-e2e.yml, gen-expected.yml, gpu-benchmark.yml) run uv sync without an explicit Python version step — they rely on uv reading .python-version automatically. This means CI will use Python 3.13 as intended, with no workflow changes required.


Minor Notes

  • No tracking issue referenced for revisiting this constraint once Python 3.14 + PyTorch DDP compatibility is resolved upstream. Worth filing one to avoid the restriction becoming permanent by default.
  • The requires-python in the lock file (>=3.11, <3.14) has a space after the comma while pyproject.toml has none (>=3.11,<3.14). This is cosmetic and generated by uv — no action needed.

Verdict

Approve — changes are correct, consistent, and minimal. Safe to merge.

Python 3.14.0 caused progressive CPU RAM exhaustion during DDP training.
Downgrading to Python 3.13 resolved the issue. Pin .python-version to
3.13 and update uv.lock accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@hanaol hanaol force-pushed the hanaol/python-3.14-oom branch from 6b92dd1 to 90880bb Compare May 21, 2026 20:05
@hanaol hanaol merged commit d715bb9 into main May 21, 2026
3 checks passed
@hanaol hanaol deleted the hanaol/python-3.14-oom branch May 21, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants