fix(training): scale data prep + dataloader to 3B-token corpora (ZEB-137)#256
fix(training): scale data prep + dataloader to 3B-token corpora (ZEB-137)#256jenglund wants to merge 6 commits into
Conversation
Pretrain HarmonyModelConfig.target() (24L/1280H/~350M params) on ~2B tokens of expanded FineWeb-Edu (Option B, ~35% of Chinchilla- optimal). Produces 6 intermediate checkpoints spanning 20-100% of training budget for ZEB-138 capacity-curve sweep, plus the final canonical checkpoint that serves as substrate for both ZEB-138 scale-up and teacher-match experiments. Zero code changes - train.py --config target is already wired at line 847, prepare_data.py --max-tokens is parameterized, and --save-every / --checkpoint-interval handle artifact drops and resumability. Primary hardware: KRILE (RTX 4090), estimated 5-7 days wall time. Cloud A100 fallback opt-in when wall time justifies: ~2 days wall, ~$55-75 spend, within the $1000 research-program cloud budget cap. Embraces the commodity-hardware research axis (what can be built on single-GPU gaming rigs) as part of the Harmony project framing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KRILE flagged during prereq check: HarmonyModelConfig.target() materializes to ~474M params (L=24 × H=1280 × ffn=3413 SwiGLU × vocab=32000 = 473.5M), not the ~350M the spec initially claimed. Updates throughout: - Model-size label: 350M → 474M across all body text - Chinchilla math: 20× params = 9.5B tokens (not 7B); Option B at ~21% of Chinchilla (not 35%) - Step-time estimate: 85-105s at optimized throughput (up from 60-80s); KRILE to verify in Step 1 smoke test - Wall time: ~8.1 days (up from 6.3d); acceptance range 7.6-9.5d depending on observed step time - Cloud A100: ~2.7 days at ~$70-90 (up from 2d / $50-60) - Option C: 9.5B tokens = ~38 days on KRILE / ~$350 cloud (up from 18-25 days / $250) - Smoke-test acceptance window updated to 85-105s - Pre-flight checklist step-time window updated Checkpoint naming corrected to match actual train.py output: model_step_N.safetensors + optimizer_step_N.pt pairs plus rolling checkpoint.pt from --checkpoint-interval (previously placeholder ckpt_N.pt names that don't correspond to actual artifacts). Added naming note under header explaining filename/branch still reference "350M" as a legacy handle; content reflects the true 474M figure. No code change required — target() as-defined is the intended substrate; the 350M label was loose at spec-write time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…137)
Both prepare_data.py and make_hf_dataloader accumulated tokens as list[int]
(~36 B/token of PyLong overhead), OOM'ing a 46 GB host around 1.4 B tokens
and blocking ZEB-137 Step 0 onwards. Scaling to the 3 B-token FineWeb-Edu-3B
target required four independent fixes, each surfacing as the one before it
was removed:
1. prepare_data: token_stream is array.array('H') (2 B/token; Mistral
vocab 32000 fits uint16) instead of list[int].
2. prepare_data: gc + pa.default_memory_pool().release_unused() +
malloc_trim(0) every 10k docs. Without this, glibc holds freed
tokenizer-churn segments in its arenas and pyarrow retains streamed
shard chunks in its pool, together leaking ~40 B/token of unreleased
fragmentation.
3. prepare_data: write arrow IPC stream directly in 10k-row batches via
pyarrow, bypassing Dataset.from_dict which materializes the full
dataset in memory before saving. Uses FixedSizeList schema because
variable-length ListArray's int32 offsets overflow above ~1.05 M rows
(our 3 B-token prep produces ~1.46 M rows). Emits the dataset_info.json
and state.json sidecars load_from_disk expects.
4. make_hf_dataloader: drops to dataset.data.column("input_ids") and
concatenates arrow chunks into a single int32 numpy buffer (zero-copy
from mmap), then torch.from_numpy. Batches cast to int64 at yield
for nn.Embedding compatibility. Handles both FixedSizeList (new prep
output) and legacy variable-length List columns.
Validated: all 11 prepare_data tests + 2 make_hf_dataloader tests pass
(6 unrelated ι CSV-column-count test failures predate this branch). Full
3B prep: rc=0, 47 min wall time, 8.99 GB peak RSS, 12 GB train + 115 MB
val on disk, load_from_disk roundtrips correctly at all offsets. Dataloader
on the 3B dataset: 6.5 s init, 22.7 GB peak RSS, batches shape (32, 2049)
int64 with values in vocab range.
Also ships two tooling additions used during this investigation:
diag_memory.py reproduces the glibc/pyarrow leak patterns in 4 isolated
scenarios; run_prep_with_rss.sh wraps a prep run with an RSS sampler.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
PR author is in the excluded authors list. |
|
CodeAnt AI is reviewing your PR. |
📝 WalkthroughWalkthroughAdds a Harmony pretraining spec and tooling: Arrow-backed data preparation with typed buffers and periodic memory reclamation, a memory diagnostic CLI, an RSS-monitoring prep wrapper, dataloader changes to use Arrow/NumPy zero-copy buffers, and tests for validation and empty-dataset handling. No public API signatures were changed. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Prep as prepare_data.py
participant Tokenizer as AutoTokenizer
participant Mem as MemoryManager
participant Arrow as PyArrowWriter
participant Disk
User->>Prep: start run_prepare_data(max_tokens, seq_len, val_fraction)
Prep->>Tokenizer: stream text -> tokenize
Tokenizer-->>Prep: token_ids
Prep->>Prep: append token_ids to array.array("H")
loop every 10000 docs
Prep->>Mem: gc.collect() / pyarrow.pool.release_unused()
alt malloc_trim available
Prep->>Mem: malloc_trim(0)
end
end
Prep->>Prep: numpy.frombuffer -> reshape -> split train/val
Prep->>Arrow: write streaming IPC (.arrow) + dataset_info.json/state.json
Arrow->>Disk: persist arrow batches & metadata
Disk-->>User: prepared dataset written
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Review Summary by QodoFix memory leaks in 3B-token data prep and dataloader; add Harmony-474M pretraining spec
WalkthroughsDescription• Fix four critical memory leaks in data preparation and dataloader that blocked 3B-token corpus
processing
• Replace Python list[int] token accumulation with array.array('H') to reduce per-token overhead
from 36B to 2B
• Add periodic heap trimming (gc.collect + malloc_trim + pyarrow pool release) every 10k docs to
prevent RSS climb
• Replace Dataset.from_dict materialization with direct pyarrow RecordBatchStreamWriter to avoid 2×
peak memory spike
• Fix int32 offset overflow in ListArray by switching to FixedSizeListArray schema for >1.05M rows
• Optimize dataloader to use zero-copy numpy views of arrow buffers instead of materializing Python
lists
• Add memory diagnostic tool and RSS monitoring script for validation
• Add comprehensive ZEB-137 pretraining specification for 474M Harmony model on 2B tokens
Diagramflowchart LR
A["Python list[int]<br/>36B/token"] -->|Replace| B["array.array'H'<br/>2B/token"]
C["Dataset.from_dict<br/>2x peak memory"] -->|Replace| D["RecordBatchStreamWriter<br/>80MB batches"]
E["ListArray int32<br/>offsets overflow"] -->|Replace| F["FixedSizeListArray<br/>no offsets"]
G["Unbounded heap<br/>fragmentation"] -->|Add| H["gc.collect +<br/>malloc_trim every 10k"]
I["Python list extend<br/>in dataloader"] -->|Replace| J["Zero-copy numpy<br/>arrow views"]
B --> K["3B tokens<br/>fits in 8.99GB RSS"]
D --> K
F --> K
H --> K
J --> K
File Changes1. training/ct87/diag_memory.py
|
Code Review by Qodo
1.
|
User descriptionSummaryFour independent memory bugs collectively blocked ZEB-137 Step 0 (FineWeb-Edu-3B prep) and the downstream smoke test. Each surfaced as the one before it was removed; all four are fixed in this PR. The four bugs1.
|
PR Code Suggestions ✨Latest suggestions up to commit
|
| Category | Suggestion | Severity |
| Possible bug |
✅
libc.so.6 is unavailable | Major |
| Logic error |
✅
| Major |
|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/superpowers/specs/2026-04-17-harmony-350m-pretraining-design.md`:
- Around line 195-208: The fenced code block that renders the checkpoint
file-tree (the block beginning with "training/checkpoints/harmony_474m_v1/")
should include a language specifier for consistent rendering; update the opening
fence from ``` to ```text (or ```plaintext) so the tree is rendered as plain
text across viewers and add no other changes to the block contents.
In `@training/ct87/diag_memory.py`:
- Around line 51-65: Several print statements use f-strings even though they
contain no placeholders; update them to plain string literals to avoid
misleading f-prefixes. Locate the print calls inside scenario_a_text_only (e.g.,
print(f"\n=== A: iterate text only (no tokenizer, no accumulate) ==="), print(f"
baseline rss=..."), print(f" final rss=...")) and the analogous print usages in
the other scenario functions referenced (the lines at 70, 91, 115), and remove
the leading "f" so they become normal strings (e.g., print("\n=== A: ... ===")).
Do not change content or interpolation where placeholders are actually present
(keep f-strings only when formatting variables).
- Around line 113-137: scenario_d_accumulate_trim currently only calls
gc.collect() and malloc_trim() and misses releasing pyarrow's pool like
_release_unused_heap() does; update scenario_d_accumulate_trim to call the same
pyarrow release helper (either call _pa_pool.release_unused() directly or invoke
the existing release_all() helper used in prepare_data.py) at the same points
where gc.collect() and malloc_trim() are called (e.g., inside the n % 10_000
branch and at the end) so pyarrow memory is reclaimed for an accurate comparison
with prepare_data.py.
- Around line 162-171: Several lines use semicolon-joined statements (e.g.,
"gc.collect(); malloc_trim()" and print calls) which static analysis flagged;
split each into separate statements to improve readability and satisfy linters.
Locate the blocks around scenario_b_tokenize_only, scenario_c_accumulate and the
gc/malloc_trim calls and replace combined statements like "gc.collect();
malloc_trim()" with two lines calling gc.collect() and malloc_trim() separately,
and ensure the print(...) calls are on their own lines after the cleanup calls
so rss_kb(), malloc_trim, and gc.collect are each invoked on separate
statements.
In `@training/ct87/prepare_data.py`:
- Around line 98-106: The except block that sets _pa_pool is too broad; narrow
the caught exceptions to only ImportError and AttributeError so you don't mask
other failures: in the try that does "import pyarrow as _pa" and "_pa_pool =
_pa.default_memory_pool()", change the blanket "except Exception" to "except
(ImportError, AttributeError)" to handle only import or attribute access
problems while allowing other errors to propagate.
- Around line 228-229: The nested context managers using pa.OSFile(arrow_path,
"wb") as sink and pa.ipc.new_stream(sink, arrow_schema) as writer should be
combined into a single with statement to simplify the code; replace the
two-level nesting around arrow_path, arrow_schema and writer with a single "with
pa.OSFile(arrow_path, 'wb') as sink, pa.ipc.new_stream(sink, arrow_schema) as
writer:" and keep the inner block unchanged.
In `@training/ct87/train.py`:
- Around line 92-94: Pre-allocate a reusable CPU tensor and fill slices instead
of calling torch.stack and .to each iteration: create a buffer like batch_buffer
= torch.empty((batch_size, window), dtype=torch.long) once (adjacent to where
rng/all_tokens_t/window/batch_size are defined), then inside the loop generate
starts = torch.randint(..., generator=rng) and for i, s in enumerate(starts):
batch_buffer[i].copy_(all_tokens_t[s : s + window].to(dtype=batch_buffer.dtype))
(or use .to(dtype=batch_buffer.dtype, copy=False) if needed), and yield
batch_buffer; this avoids repeated torch.stack and extra .to allocations while
keeping symbols starts, batch_buffer, all_tokens_t, window, batch_size, and rng
unchanged.
In `@training/run_prep_with_rss.sh`:
- Around line 16-17: The script assigns WT and then changes directory with cd
"$WT" but doesn't handle a failing cd; update the cd invocation (the line using
the WT variable) to bail out on failure by appending an exit on error (e.g., use
the shellcheck-suggested "|| exit 1" behavior) so that if cd "$WT" fails the
script stops instead of continuing in the wrong directory.
- Line 8: The script currently enables only 'set -u' which leaves the script
non-fail-fast; update the startup options by adding '-e' so failures abort early
(replace the existing 'set -u' occurrence with a combined option such as 'set
-eu' or 'set -euo pipefail') and ensure this change is placed at the top of the
script before any commands or background processes are started.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a9287484-65bb-4c48-a3a4-0e83a60eb114
📒 Files selected for processing (5)
docs/superpowers/specs/2026-04-17-harmony-350m-pretraining-design.mdtraining/ct87/diag_memory.pytraining/ct87/prepare_data.pytraining/ct87/train.pytraining/run_prep_with_rss.sh
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{sh,bash}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{sh,bash}: Always use non-interactive flags with shell file operations (cp, mv, rm) to avoid hanging on confirmation prompts. Use: cp -f source dest, mv -f source dest, rm -f file, rm -rf directory, cp -rf source dest
Use non-interactive flags with scp: use -o BatchMode=yes for non-interactive mode
Use non-interactive flags with ssh: use -o BatchMode=yes to fail instead of prompting
Use non-interactive flags with apt-get: use -y flag
Use non-interactive flags with brew: use HOMEBREW_NO_AUTO_UPDATE=1 environment variable
Files:
training/run_prep_with_rss.sh
🪛 markdownlint-cli2 (0.22.0)
docs/superpowers/specs/2026-04-17-harmony-350m-pretraining-design.md
[warning] 195-195: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.15.10)
training/ct87/prepare_data.py
[warning] 105-105: Do not catch blind exception: Exception
(BLE001)
[warning] 228-229: Use a single with statement with multiple contexts instead of nested with statements
(SIM117)
training/ct87/diag_memory.py
[error] 53-53: f-string without any placeholders
Remove extraneous f prefix
(F541)
[warning] 60-60: Use enumerate() for index variable n in for loop
(SIM113)
[error] 70-70: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 91-91: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 115-115: f-string without any placeholders
Remove extraneous f prefix
(F541)
[warning] 154-154: Missing return type annotation for private function fresh
(ANN202)
[error] 162-162: Multiple statements on one line (semicolon)
(E702)
[error] 166-166: Multiple statements on one line (semicolon)
(E702)
[error] 170-170: Multiple statements on one line (semicolon)
(E702)
🪛 Shellcheck (0.11.0)
training/run_prep_with_rss.sh
[warning] 17-17: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
🔇 Additional comments (14)
training/ct87/train.py (3)
47-54: LGTM! Clear documentation of the memory-efficient approach.The docstring accurately explains the memory savings (12 B vs ~108 GB for 3B tokens) and the cast-to-int64 contract for nn.Embedding compatibility.
83-88: LGTM! Defensive lifetime management.The
_keep_alivetuple correctly holds references to bothdatasetandflatto prevent HF's internals from closing mmap handles or numpy deallocating the buffer while the iterator is active. Thenoqa: F841acknowledges the intentional unused variable.
67-73: Consider clarifying the "legacy POC" format support with concrete handling or tests.The comment claims
FixedSizeList (new prep output) and variable-length List (legacy POC)both expose.valuesas int32 arrays. However, onlyFixedSizeListArraywith int32 is generated inprepare_data.py, and no tests validate compatibility with legacy datasets. If legacy data must be supported, either:
- Add a test case demonstrating a legacy dataset that works with the current code, or
- Implement fallback error handling: wrap
.to_numpy(zero_copy_only=True)in a try/except and retry withzero_copy_only=FalseonArrowInvalid.Currently, the code has no protection against
zero_copy_only=Trueraising an exception if the Arrow buffer cannot be zero-copied.training/ct87/prepare_data.py (6)
72-74: LGTM! Clear docstring update.The docstring accurately reflects the new implementation using typed buffers with EOS separators.
117-125: LGTM! Robust vocab size validation.The assertions provide clear failure modes if a larger vocab tokenizer is ever used, preventing silent truncation. The error message helpfully suggests the fix (widening to 'I').
136-139: LGTM! Memory-efficient token accumulation.Using
array.array("H")(2 bytes/token) instead oflist[int](~36 bytes/token) is the correct fix for scaling to 3B tokens. The typecode 'H' matches the uint16 range validated above.
168-173: LGTM! Zero-copy numpy view for chunking.
np.frombuffercreates a view without copying, and the reshape operation is also a view. This keeps memory usage minimal during the chunking phase.
245-257: LGTM! HF-compatible metadata files.The
state.jsonstructure matches what HuggingFace'sload_from_diskexpects. The fingerprint format is reasonable for caching purposes.
202-204: No changes needed—pa.list_(..., list_size=seq_len)is the correct API for FixedSizeListArray.The PyArrow documentation confirms that
pa.list_(value_type, list_size)is the correct and intended way to define a fixed-size list type schema. The code correctly uses this pattern.training/run_prep_with_rss.sh (2)
27-33: LGTM! Proper background process handling.The
-uflag for unbuffered Python output ensures logs are flushed immediately, and capturing$!for the PID is correct for the monitoring loop.
39-53: LGTM! Robust RSS polling with graceful handling.The loop correctly handles process termination (
kill -0), missing proc files (conditional|| true), and empty values (the-n "$rss"check). The polling approach is appropriate for long-running processes.docs/superpowers/specs/2026-04-17-harmony-350m-pretraining-design.md (1)
1-334: LGTM! Comprehensive and well-structured spec.The document clearly defines the training protocol, acceptance criteria, risk mitigations, and handoff expectations. The references to code locations align with the implementation changes in this PR.
training/ct87/diag_memory.py (2)
24-29: LGTM! Correct RSS reading from procfs.Reading
/proc/{pid}/statusand parsing VmRSS is the standard approach for process memory monitoring on Linux.
89-110: LGTM! Scenario C correctly mirrorsprepare_data.pyaccumulation pattern.The
array.array("H")usage with.extend()and.append(eos)matches the production code inprepare_data.py(context snippet 1), making this a valid baseline for comparison.
| ``` | ||
| training/checkpoints/harmony_474m_v1/ | ||
| ├── model_step_1500.safetensors + optimizer_step_1500.pt # ~20% budget — early undertrained teacher candidate | ||
| ├── model_step_3000.safetensors + optimizer_step_3000.pt # ~40% budget | ||
| ├── model_step_4500.safetensors + optimizer_step_4500.pt # ~60% budget | ||
| ├── model_step_6000.safetensors + optimizer_step_6000.pt # ~80% budget | ||
| ├── model_step_7500.safetensors + optimizer_step_7500.pt # ~95% budget | ||
| ├── model_step_7800.safetensors + optimizer_step_7800.pt # final (100%) — canonical checkpoint | ||
| ├── checkpoint.pt # rolling resumable (from --checkpoint-interval, overwritten) | ||
| ├── train.csv # training log | ||
| └── config.json # config metadata (auto-generated by train.py) | ||
|
|
||
| Step 3 post-processing may rename/package these for ZEB-138 consumer convenience (e.g., stripping optimizer state, packaging model+config+metadata as a single safetensors + JSON sidecar). | ||
| ``` |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add language specifier to fenced code block.
The file tree listing should specify a language (e.g., text or plaintext) for consistent rendering.
📝 Proposed fix
-```
+```text
training/checkpoints/harmony_474m_v1/🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 195-195: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/specs/2026-04-17-harmony-350m-pretraining-design.md` around
lines 195 - 208, The fenced code block that renders the checkpoint file-tree
(the block beginning with "training/checkpoints/harmony_474m_v1/") should
include a language specifier for consistent rendering; update the opening fence
from ``` to ```text (or ```plaintext) so the tree is rendered as plain text
across viewers and add no other changes to the block contents.
| starts = torch.randint(0, total - window + 1, (batch_size,), generator=rng) | ||
| batch = torch.stack([all_tokens_t[s : s + window] for s in starts]) | ||
| yield batch | ||
| yield batch.to(torch.long) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider pre-allocating the batch tensor to avoid repeated allocations.
Each iteration calls torch.stack() which allocates a new tensor, then .to(torch.long) allocates another. For high-throughput training, pre-allocating a reusable buffer could reduce allocation overhead.
That said, since the tensors move to GPU immediately after yield, the CPU allocation pressure may be negligible in practice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@training/ct87/train.py` around lines 92 - 94, Pre-allocate a reusable CPU
tensor and fill slices instead of calling torch.stack and .to each iteration:
create a buffer like batch_buffer = torch.empty((batch_size, window),
dtype=torch.long) once (adjacent to where rng/all_tokens_t/window/batch_size are
defined), then inside the loop generate starts = torch.randint(...,
generator=rng) and for i, s in enumerate(starts):
batch_buffer[i].copy_(all_tokens_t[s : s + window].to(dtype=batch_buffer.dtype))
(or use .to(dtype=batch_buffer.dtype, copy=False) if needed), and yield
batch_buffer; this avoids repeated torch.stack and extra .to allocations while
keeping symbols starts, batch_buffer, all_tokens_t, window, batch_size, and rng
unchanged.
Smoke test passed — all spec acceptance criteria met ✅Ran Config adjustment from the spec's smoke command: Results:
Training-loss curve: 10.87 → 10.65 → 9.93 → 8.74 → 7.72 → 7.52 (end warmup) → 7.41 → 7.31 → 7.25 → 7.07 → 6.95 → 6.80 → 6.66 → 6.54 → 6.44 → 6.37 → 6.25 → 6.17 → 6.05 → 5.98 Wall time: 71 minutes (22:12 → 23:23). Extrapolating to Step 2's 7800 steps at ~21s/step = ~46 h ≈ 2 days rather than the spec's 6-day projection. The 4090 is running ~3× faster than the conservative estimate (plausible: recent PyTorch Flash Attention + BF16 on Ada tensor cores). Operational note on VRAM: Test plan checkbox update:
Step 2 (full ~7800-step run) ready to launch once this PR is reviewed / merged. |
Bugs (Qodo / CodeAnt / CodeRabbit):
1. Empty-chunks crash in make_hf_dataloader: np.concatenate([]) raises
an unhelpful ValueError when the dataset has zero rows (can happen when
prep sees fewer tokens than seq_len and drops the partial chunk). Now
builds the chunk list first and falls back to np.empty(0, int32) when
empty so the informative `total < window` guard fires downstream.
New test (test_zero_rows_raises_friendly_error) locks in the behaviour
by writing a real 0-row IPC stream + sidecars and asserting the
actionable ValueError fires.
2. Docstring inaccuracy: the tensor isn't a zero-copy view of the arrow
buffer — np.concatenate materialises a new contiguous int32 buffer.
Per-chunk conversion is still zero-copy, but the final stream is a
copy; docstring now says so explicitly and calls out the 4 B/token
peak during init.
3. Hard glibc dependency: ctypes.CDLL("libc.so.6") raises OSError on
macOS/musl. Wrapped in try/except with a no-op fallback so prep can
still run there (at reduced heap-release effectiveness — the 40 B/token
regime stays on non-glibc). Same fix in diag_memory.py.
Nits picked up:
- Narrow `except Exception` → `except (ImportError, AttributeError)`
around pyarrow pool access (prepare_data.py, diag_memory.py).
- Combine nested `with pa.OSFile(...)` / `with pa.ipc.new_stream(...)`
into a single parenthesised with block.
- Remove extraneous f-prefixes from f-strings without placeholders
(diag_memory.py scenario headers).
- Split semicolon-joined `gc.collect(); malloc_trim()` calls into
separate statements (diag_memory.py main).
- Scenario D in diag_memory.py now calls the same release_all() helper
(gc + pyarrow pool + glibc trim) that prepare_data.py uses — previously
it was missing the pyarrow pool release and understated the
reclamation effectiveness.
- run_prep_with_rss.sh: `set -u` → `set -eu`, add `|| exit 1` on cd.
Not taken:
- Pre-allocate batch tensor in make_hf_dataloader (reviewer noted the
CPU allocation pressure is negligible since tensors move to GPU
immediately; agreed).
- Spec doc's fenced code block language specifier (not in this PR's
diff — it's from the upstream spec-correction commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review feedback addressed — commit
|
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@training/ct87/diag_memory.py`:
- Around line 23-28: rss_kb() currently assumes /proc is present and will crash
on non-Linux systems; update it to first detect platform or availability of
/proc and fall back to a portable method (or raise a clear error). Specifically,
in function rss_kb check for existence of f"/proc/{os.getpid()}/status" (or
sys.platform.startswith("linux")) and only parse VmRSS when present; otherwise
use a portable fallback like resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
(convert units as needed) or, if no reliable method is available, raise an
explicit NotImplementedError with a clear message; keep the function name rss_kb
and its int return contract.
In `@training/ct87/prepare_data.py`:
- Around line 180-197: Add early validation at the start of run_prepare_data():
ensure seq_len is an integer > 0 and val_fraction is a number in [0,1) (allow 0
but reject negative or >=1) and raise a clear ValueError or exit with a
descriptive message if these checks fail; this prevents proceeding to
tokenization/stream creation with invalid parameters (refer to run_prepare_data,
seq_len, val_fraction in the diff).
In `@training/ct87/train.py`:
- Around line 65-70: The code assumes load_from_disk(data_path) returns a single
Dataset with a .data attribute but run_prepare_data() may write train/ and val/
subdirs so callers can pass the parent dir and load_from_disk will return a
DatasetDict; update the logic after load_from_disk(data_path) to detect a
DatasetDict (instance of datasets.DatasetDict or has keys like "train"/"val")
and either select the appropriate split (e.g., dataset["train"] or
dataset.get("train")) before accessing .data.column("input_ids"), or raise a
clear ValueError telling callers to pass the specific split directory (e.g.,
".../train" or ".../val"); reference the variables/functions dataset,
load_from_disk, and the access to .data.column("input_ids") when making the
change.
In `@training/run_prep_with_rss.sh`:
- Around line 55-59: Because set -e can cause the script to exit immediately
when wait "$PID" fails, replace the bare wait with an if/else that runs wait
"$PID" and captures its exit status into RC (e.g., if wait "$PID"; then RC=0;
else RC=$?; fi) so that the subsequent lines that append "EXIT=$RC TS=..." to
"$RSS_CSV" and exit with RC always execute; locate the block using wait, PID, RC
and RSS_CSV and implement the if/else around wait, then use the captured RC for
the logging and final exit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7e07aa42-8527-4745-9eba-d0a96a1635c1
📒 Files selected for processing (5)
training/ct87/diag_memory.pytraining/ct87/prepare_data.pytraining/ct87/train.pytraining/run_prep_with_rss.shtraining/tests/test_train.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{sh,bash}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{sh,bash}: Always use non-interactive flags with shell file operations (cp, mv, rm) to avoid hanging on confirmation prompts. Use: cp -f source dest, mv -f source dest, rm -f file, rm -rf directory, cp -rf source dest
Use non-interactive flags with scp: use -o BatchMode=yes for non-interactive mode
Use non-interactive flags with ssh: use -o BatchMode=yes to fail instead of prompting
Use non-interactive flags with apt-get: use -y flag
Use non-interactive flags with brew: use HOMEBREW_NO_AUTO_UPDATE=1 environment variable
Files:
training/run_prep_with_rss.sh
🪛 Ruff (0.15.10)
training/ct87/diag_memory.py
[warning] 81-81: Use enumerate() for index variable n in for loop
(SIM113)
[warning] 176-176: Missing return type annotation for private function fresh
(ANN202)
🔇 Additional comments (2)
training/tests/test_train.py (1)
113-171: Good regression coverage for empty Arrow datasets.This exercises the same streaming-IPC + sidecar layout that
prepare_datanow emits and locks in the friendlyValueErrorinstead of NumPy's concatenate failure.training/ct87/train.py (1)
75-89: Nice empty-dataset fallback.Falling back to an empty
int32buffer preserves the user-facingtotal < windowerror instead of leakingnp.concatenate([])out of the loader.
Issues addressed (CodeRabbit + Cursor):
1. Redundant pyarrow import (Cursor): pa is already imported unconditionally
above the try/except, so the fallback branch was unreachable. Dropped
the try/except — _pa_pool comes from the unconditional `pa` import.
2. Parameter validation in run_prepare_data (CodeRabbit): added fail-fast
checks for seq_len > 0 and val_fraction in [0.0, 1.0). A bad arg would
otherwise surface hours into a prep run after CPU+network tokenization
had already completed.
3. DatasetDict detection in make_hf_dataloader (CodeRabbit): if a caller
passes a parent directory containing train/val subdirs, load_from_disk
returns a DatasetDict and the subsequent .data access would AttributeError.
Now detects and raises ValueError with a clear fix hint ("pass
.../train or .../val").
4. /proc guard in diag_memory.rss_kb (CodeRabbit): used to open
/proc/<pid>/status unconditionally; now raises an explicit RuntimeError
on platforms without /proc. Chose fail-fast over falling back to
resource.getrusage.ru_maxrss — the latter is peak RSS, not current,
which would be a confusing semantic shift for a growth-monitoring
diagnostic.
5. wait + set -e in run_prep_with_rss.sh (CodeRabbit): `set -e` caused a
bare `wait "$PID"` to exit the script before the CSV's EXIT= row and
the stdout DONE line were written on a failed prep run — defeating
the whole post-mortem purpose. Now wraps wait in if/else and captures
RC so the exit record always writes.
Tests added:
- test_rejects_zero_seq_len, test_rejects_negative_seq_len,
test_rejects_negative_val_fraction, test_rejects_full_val_fraction
(TestParameterValidation class in test_prepare_data.py)
- test_datasetdict_raises_friendly_error (TestHfDataloaderGuard in
test_train.py; builds a real DatasetDict, saves, verifies the
dataloader surfaces the actionable error)
All 18 prepare_data + dataloader-guard tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 2 of review feedback addressed — commit
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
training/ct87/train.py (1)
65-99:⚠️ Potential issue | 🟡 MinorValidate
seq_lenandbatch_sizebefore loading data.
seq_len=0currently slips through aswindow=1, andbatch_size=0does not fail untiltorch.stack([]). Both should be rejected here with a clearValueErrorinstead of surfacing as downstream training/runtime errors.♻️ Proposed fix
def make_hf_dataloader( data_path: str, seq_len: int, batch_size: int, seed: int = 42, ) -> Iterator[torch.Tensor]: @@ import numpy as np from datasets import DatasetDict, load_from_disk + + if seq_len <= 0: + raise ValueError(f"seq_len must be > 0, got {seq_len!r}") + if batch_size <= 0: + raise ValueError(f"batch_size must be > 0, got {batch_size!r}") dataset = load_from_disk(data_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@training/ct87/train.py` around lines 65 - 99, Before calling load_from_disk in train.py, validate that seq_len and batch_size are positive integers: check that seq_len > 0 and batch_size > 0 and raise a clear ValueError if not (e.g. "seq_len must be > 0" and "batch_size must be > 0"); this should be done prior to computing window = seq_len + 1 (and before any data loading such as load_from_disk or operations that use seq_len/window) so invalid values are rejected early with an actionable error instead of failing later in torch.stack or downstream logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@training/ct87/diag_memory.py`:
- Around line 92-98: Scenario A currently counts empty/whitespace examples (in
the loop that updates total_chars, n, and calls report("A", ...)), causing
n_docs to represent a different workload than B/C/D; update the loop that
iterates over ds to skip examples where example["text"] is empty or only
whitespace (e.g., check example["text"].strip() == "" and continue) before
incrementing total_chars or n and before the n >= n_docs break, so counting and
reporting in Scenario A match the tokenizing scenarios.
In `@training/run_prep_with_rss.sh`:
- Around line 27-65: The script currently launches ct87.prepare_data in
background and never forwards INT/TERM, leaving an orphaned child; add a trap at
top-level that on SIGINT and SIGTERM sends a termination signal to the
background PID (use the PID variable set after launching), then waits for that
PID so the existing footer (the wait/RC capture, RSS_CSV final line and exit)
always runs; ensure the trap handler checks PID is set/non-empty before kill,
forwards the signal (TERM) and optionally falls back to KILL if the child
doesn't exit within a short timeout, then returns so the script continues to the
existing wait/exit path.
- Line 25: The header written to "$RSS_CSV" declares columns
"ts,rss_kb,vmhwm_kb,vmpeak_kb,vmsize_kb,state" but later the script appends a
footer like "EXIT=... TS=..." which is not a valid CSV row for that schema;
update the header to include explicit footer/event columns (e.g., add
"exit_code,event_ts") or change the footer emission to write a normal CSV row
matching the existing header (populate ts and state columns and put the exit
code into a new column if you add one). Locate the header write (echo
"ts,rss_kb,vmhwm_kb,vmpeak_kb,vmsize_kb,state" > "$RSS_CSV") and the
footer/emitted line (the EXIT=... TS=... write around line 63) and make them
consistent so all appended lines are valid CSV rows.
---
Outside diff comments:
In `@training/ct87/train.py`:
- Around line 65-99: Before calling load_from_disk in train.py, validate that
seq_len and batch_size are positive integers: check that seq_len > 0 and
batch_size > 0 and raise a clear ValueError if not (e.g. "seq_len must be > 0"
and "batch_size must be > 0"); this should be done prior to computing window =
seq_len + 1 (and before any data loading such as load_from_disk or operations
that use seq_len/window) so invalid values are rejected early with an actionable
error instead of failing later in torch.stack or downstream logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 41acd4e8-3824-4128-869e-d82e700d5837
📒 Files selected for processing (6)
training/ct87/diag_memory.pytraining/ct87/prepare_data.pytraining/ct87/train.pytraining/run_prep_with_rss.shtraining/tests/test_prepare_data.pytraining/tests/test_train.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{sh,bash}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{sh,bash}: Always use non-interactive flags with shell file operations (cp, mv, rm) to avoid hanging on confirmation prompts. Use: cp -f source dest, mv -f source dest, rm -f file, rm -rf directory, cp -rf source dest
Use non-interactive flags with scp: use -o BatchMode=yes for non-interactive mode
Use non-interactive flags with ssh: use -o BatchMode=yes to fail instead of prompting
Use non-interactive flags with apt-get: use -y flag
Use non-interactive flags with brew: use HOMEBREW_NO_AUTO_UPDATE=1 environment variable
Files:
training/run_prep_with_rss.sh
🪛 Ruff (0.15.10)
training/tests/test_prepare_data.py
[warning] 98-99: Use a single with statement with multiple contexts instead of nested with statements
(SIM117)
[warning] 107-108: Use a single with statement with multiple contexts instead of nested with statements
(SIM117)
[warning] 116-117: Use a single with statement with multiple contexts instead of nested with statements
(SIM117)
[warning] 125-126: Use a single with statement with multiple contexts instead of nested with statements
(SIM117)
training/ct87/train.py
[warning] 71-75: Prefer TypeError exception for invalid type
(TRY004)
[warning] 71-75: Avoid specifying long messages outside the exception class
(TRY003)
training/ct87/diag_memory.py
[warning] 32-36: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 94-94: Use enumerate() for index variable n in for loop
(SIM113)
[warning] 189-189: Missing return type annotation for private function fresh
(ANN202)
training/ct87/prepare_data.py
[warning] 81-81: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 83-85: Avoid specifying long messages outside the exception class
(TRY003)
1. Scenario A workload consistency (CodeRabbit, minor): scenario A was counting empty/whitespace rows while B/C/D skipped them, so n_docs meant different workloads across scenarios and the RSS comparison drifted depending on how many blanks happened to land in the first n_docs of the stream. Now all four scenarios apply the same `if not text or not text.strip(): continue` filter. 2. Signal forwarding in run_prep_with_rss.sh (CodeRabbit, major): if the wrapper was interrupted (Ctrl-C / SIGTERM), the background ct87.prepare_data kept running as an orphan — a real problem for multi-hour prep jobs where the operator thought they stopped the run. Now traps INT/TERM and forwards SIGTERM to $PID, then lets the existing monitor loop see the child die and run the normal footer/exit path. Trap disarms itself on first fire so a second Ctrl-C can force immediate wrapper exit. 3. CSV-schema-valid footer in run_prep_with_rss.sh (CodeRabbit, minor): the old `EXIT=$RC TS=...` footer wasn't a valid row for the declared 6-column header, so csv.DictReader / pandas.read_csv would choke on the last line. Now emits `ts,,,,,EXIT:<rc>` — 6 fields matching the header, with the exit code in the `state` column (distinguishable from normal single-char process states R/S/D/Z). 4. seq_len + batch_size validation in make_hf_dataloader (CodeRabbit, outside-diff, minor): seq_len=0 would slip through as window=1 and yield degenerate (batch, 1) batches; batch_size=0 would later crash in torch.stack on an empty list. Now fails fast with clear ValueErrors at function entry. Two new unit tests. All 20 prepare_data + dataloader-guard tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 3 of review feedback addressed — commit
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c35eca5. Configure here.
| train_np = chunks_np | ||
| val_np = None | ||
| num_train = int(train_np.shape[0]) | ||
| num_val = int(val_np.shape[0]) if val_np is not None else 0 |
There was a problem hiding this comment.
Production chunking/splitting logic lacks dedicated unit tests
Low Severity
run_prepare_data no longer calls chunk_stream, concatenate_and_chunk, or split_chunks — it uses inline numpy operations (np.frombuffer / reshape / slice) for chunking and splitting. However, the unit tests in TestConcatenateAndChunk and TestSplitChunks still only exercise the old list-based helpers, which are now dead code in the production path. The actual production chunking/splitting logic has no dedicated unit-level coverage (only the end-to-end @pytest.mark.network tests exercise it). The old functions could be removed or the tests redirected to cover the new numpy-based logic.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c35eca5. Configure here.
🎉 Phase 2 pretraining completeFull 7800-step run of 474M target model on FineWeb-Edu-3B finished cleanly.
Config adjustment from spec: launched with B=8 grad_accum=16 + Loss trajectory (train / val at each archival):
Val loss dipped slightly in the late steps (3.31 → 3.36 → 3.23) as the lr-decay schedule kicked in at ~step 7000 (1.62e-4 → 1.15e-4 → 5.4e-5 → 0 linear to step 7800). Loss curve is clean — no divergence, no plateau before decay, no overfitting signal. On-disk artifacts in
Ready for ZEB-138's 2×2 scale + teacher-match matrix. PR test-plan checkbox update
|
|
CodeAnt AI is running the review. |
User descriptionSummaryFour independent memory bugs collectively blocked ZEB-137 Step 0 (FineWeb-Edu-3B prep) and the downstream smoke test. Each surfaced as the one before it was removed; all four are fixed in this PR. The four bugs1.
|
Sequence DiagramThis PR changes how FineWeb Edu data is preprocessed and consumed by training, switching to a memory efficient streaming pipeline and a flat int32 token buffer that supports multi billion token datasets. sequenceDiagram
participant Operator
participant DataPrep
participant HFData as HF dataset API
participant Storage
participant Training
participant Dataloader
Operator->>DataPrep: run_prepare_data(output_dir, seq_len, val_fraction, max_tokens)
DataPrep->>DataPrep: Validate parameters and initialize tokenizer and token buffer
DataPrep->>HFData: Stream documents from FineWeb Edu
HFData-->>DataPrep: Texts to tokenize and append with EOS into flat token stream
DataPrep->>Storage: Write train and val Arrow datasets plus metadata
Training->>Dataloader: make_hf_dataloader(data_path, seq_len, batch_size)
Dataloader->>Storage: Load Arrow dataset from disk
Dataloader->>Dataloader: Build flat int32 token tensor and sample random windows
Dataloader-->>Training: Yield infinite batches of token windows for training
Generated by CodeAnt AI |
|
CodeAnt AI is running the review. |
User descriptionSummaryFour independent memory bugs collectively blocked ZEB-137 Step 0 (FineWeb-Edu-3B prep) and the downstream smoke test. Each surfaced as the one before it was removed; all four are fixed in this PR. The four bugs1.
|
Sequence DiagramThis PR changes how FineWeb-Edu is tokenized and stored, and how the training loop loads it, to support 3B token corpora with bounded memory by streaming into a compact Arrow format and building a flat token tensor for random-window batching. sequenceDiagram
participant Operator
participant PrepPipeline
participant HFSource as HF dataset API
participant Storage
participant TrainLoop
participant Model
Operator->>PrepPipeline: Run prepare_data with seq_len, max_tokens, val_fraction
PrepPipeline->>HFSource: Stream FineWeb Edu documents
PrepPipeline->>PrepPipeline: Tokenize to uint16 buffer with EOS and periodically release heap
PrepPipeline->>Storage: Chunk buffer, split train and val, write Arrow datasets with fixed size lists
Operator->>TrainLoop: Run train with target config and prepared train and val paths
TrainLoop->>TrainLoop: Initialize dataloader that builds flat int32 token tensor from Arrow chunks
TrainLoop->>TrainLoop: Sample random windows of seq_len plus one tokens and cast batches to int64
TrainLoop->>Model: Feed batches for forward and backward training steps
Generated by CodeAnt AI |
PR Code Suggestions ✨Latest suggestions up to commit
|
| Category | Suggestion | Severity |
| Logic error |
An interrupted sleep causes premature script termination under errexit, skipping normal child shutdown and final loggingWith training/run_prep_with_rss.sh [64] Why it matters? 🤔
Steps of Reproduction ✅1. Run the wrapper script `training/run_prep_with_rss.sh` with valid arguments (e.g., `./training/run_prep_with_rss.sh /tmp/out 1000000`) so it starts `ct87.prepare_data` in the background at lines 27–33 and records logs to `training/logs/prep_<tag>.log`.
2. Observe that the script has `set -eu` enabled at line 8 and that it enters the sampling loop at line 51 (`while kill -0 "$PID" 2>/dev/null; do`) where it periodically reads `/proc/$PID/status` and then executes `sleep "$INTERVAL"` at line 64 between samples.
3. While the script is running and currently in the `sleep "$INTERVAL"` call at line 64, send SIGINT from the controlling terminal by pressing Ctrl-C; the shell receives SIGINT, runs the `_forward_signal` trap handler defined at lines 39–44 and installed at line 45, which forwards `TERM` to the child process.
4. Because `sleep` is interrupted by SIGINT, it exits with a non-zero status, and under `set -e` at line 8 this non-zero exit from `sleep "$INTERVAL"` (a simple command in the loop body) causes the wrapper to terminate immediately, so the script never reaches the `wait "$PID"` block at lines 68–72 nor the final `EXIT:<rc>` CSV write at line 75, leaving the RSS CSV without a terminal exit-row despite the child having been signaled.Fix in Cursor | Fix in VSCode Claude (Use Cmd/Ctrl + Click for best experience) Prompt for AI Agent 🤖This is a comment left during a code review.
**Path:** training/run_prep_with_rss.sh
**Line:** 64:64
**Comment:**
*Logic Error: With `set -e` enabled, `sleep` can return a non-zero status when interrupted by INT/TERM, which makes the wrapper exit immediately before `wait` runs and before the final `EXIT:<rc>` row is written. Make the sleep interruption non-fatal so the script can always reach the controlled shutdown path.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix | Major |
|
CodeAnt AI finished running the review. |
PR Code Suggestions ✨Latest suggestions up to commit
|
| Category | Suggestion | Severity |
| Possible bug |
Running the module from the wrong working directory can make the training package unimportable at runtimeThe script changes into the repository root and then runs training/run_prep_with_rss.sh [16-17] Why it matters? 🤔
Steps of Reproduction ✅1. Observe package layout: `ct87` exists only as a source package under `training/ct87` (verified via `ls /workspace/harmony/training`, which shows a `ct87` directory, and `/workspace/harmony/training/ct87/prepare_data.py` lines 1–12 define the module).
2. Note the documented usage in `/workspace/harmony/training/ct87/prepare_data.py:7-11`, which explicitly says:
`Run from the training/ directory:` followed by `python3 -m ct87.prepare_data --output ../data/fineweb-edu-poc ...`, implying reliance on `training/` being on `sys.path` rather than an installed `ct87` package.
3. Also see `/workspace/harmony/docs/findings/2026-04-18-harmony-teacher-oracle-validation.md:147-148`, which documents running another module similarly via `cd ~/work/zeblithic/harmony/training` then `.venv/bin/python -m ct87.generate_oracle_table`, again indicating a workflow that runs modules from `training/` with source layout, not via an installed package.
4. From the repo root `/workspace/harmony`, run the new wrapper (added in this PR) as `bash training/run_prep_with_rss.sh /tmp/out 1000000`. The script at `training/run_prep_with_rss.sh:16-17` sets `WT="$(cd "$(dirname "$0")/.." && pwd)"` and `cd "$WT"`, so it executes in the repo root, then invokes `training/.venv/bin/python -u -m ct87.prepare_data` at line 27. In an environment matching the documented workflow (where `ct87` lives only under `training/ct87` and has not been installed into `training/.venv`), Python's import machinery does not see a top-level `ct87` package from the repo root, and the process fails immediately with `ModuleNotFoundError: No module named 'ct87'` before any data prep or RSS sampling occurs.Fix in Cursor | Fix in VSCode Claude (Use Cmd/Ctrl + Click for best experience) Prompt for AI Agent 🤖This is a comment left during a code review.
**Path:** training/run_prep_with_rss.sh
**Line:** 16:17
**Comment:**
*Possible Bug: The script changes into the repository root and then runs `python -m ct87.prepare_data`, but `ct87` lives under `training/` and is typically run from that directory. Without installing the package into the venv, this will fail with `ModuleNotFoundError`. Run from `training/` (or set `PYTHONPATH`/use an installed package path) before invoking the module.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix | Critical |
|
CodeAnt AI finished running the review. |


Summary
Four independent memory bugs collectively blocked ZEB-137 Step 0 (FineWeb-Edu-3B prep) and the downstream smoke test. Each surfaced as the one before it was removed; all four are fixed in this PR.
The four bugs
1.
prepare_data.py—list[int]token accumulationPython lists of ints cost ~36 B/token (28 B PyLong + 8 B slot). At 3 B tokens that's 108 GB. Fixed with
array.array('H')(2 B/token; Mistral vocab 32000 fits uint16 with a loud-fail assert for larger vocabs).2.
prepare_data.py— glibc + pyarrow pool retentionEven after Python frees per-document
tokenizer.encode()allocations, glibc holds freed segments in its arenas (~33 B/token leaked fragmentation) and pyarrow retains streamed shard chunks in its own pool (~1 GB). Without draining, RSS climbs past 7 GB at 100 M tokens; at 3 B it OOMs. Fixed withgc.collect()+pa.default_memory_pool().release_unused()+libc.malloc_trim(0)every 10k docs — keeps RSS flat at ~2.6 GB steady state.3.
prepare_data.py—Dataset.from_dictmaterialization + int32 offset overflowDataset.from_dict(arr_2d).save_to_disk()materializes the full dataset in memory during conversion (needs >2× data size), spiking 29 GB peak at 3 B tokens. The default variable-lengthListArraypath ALSO usespa.int32()offsets, overflowing whenn_rows × seq_len > 2^31 ≈ 2.15 B— the 3 B-token prep produces ~2.99 B items and fails withArrowInvalid: Value 2147483648 too large to fit in C integer type. Fixed with apa.ipc.RecordBatchStreamWriterwriting 10k-row batches directly. Schema isFixedSizeListArray(no offsets at all; also ~0.05% more compact on disk). Manually emit thedataset_info.jsonandstate.jsonsidecars thatload_from_diskexpects. Peak per-batch memory bounded at ~80 MB.4.
train.py::make_hf_dataloader— samelist[int]bug, different code pathfor example in dataset: all_tokens.extend(example["input_ids"])→torch.tensor(all_tokens, dtype=torch.long). Same 36 B/token overhead, OOM at the same threshold. Fixed withdataset.data.column("input_ids").chunks[*].values.to_numpy(zero_copy_only=True)→np.concatenateinto an int32 buffer →torch.from_numpy. Batches cast to int64 at yield time fornn.Embeddingcompatibility. Handles bothFixedSizeList(new prep output) and legacy variable-lengthListcolumns.Validation
TestCsvLogging,TestOverfit,TestGradientAccumulation) exist on HEAD without this patch — they track against recent ι work adding forensic columns that the tests haven't been updated for.data/fineweb-edu-3b/train: 1,451,598 chunks = 2.97 B tokens, 12 GBdata/fineweb-edu-3b/val: 14,662 chunks = 30 M tokens, 115 MBList(Value('int32'), length=2048)(FixedSizeList)load_from_diskroundtrips cleanly, including row 1,000,000 (above the old int32-offset-overflow threshold)(32, 2049)int64 with values in Mistral vocab range [2, 30827]. Steady-state RAM during training projects to ~18-20 GB.Operational note
The worktree's
training/.venvis a symlink tomain/training/.venv, whose editablect87install has a hardcodedMAPPINGin__editable___ct87_0_1_0_finder.pypointing atmain/training/ct87/. Edits to worktree source files are silently ignored unless the MAPPING is redirected. Not in-code but noted here and in the session memory for future sessions.Also included
training/run_prep_with_rss.sh: wraps a prep run with an RSS sampler writing CSV alongside the stdout log. Useful for any future streaming-tokenization runs.training/ct87/diag_memory.py: reproduces the four memory-leak patterns (A: iterate text only, B: tokenize + discard, C: tokenize + accumulate, D: C with periodic trim) in a single script. Concrete artifact for the investigation; kept for future debugging.Test plan
🤖 Generated with Claude Code
Note
Medium Risk
Touches the data-prep and training dataloader paths, changing in-memory representations and Arrow serialization; regressions could surface as subtle data corruption or high RAM usage during training at scale.
Overview
Enables multi-billion-token pretraining runs by reworking
ct87.prepare_dataandmake_hf_dataloaderto avoid Pythonlist[int]token materialization and to load tokens from Arrow more efficiently.prepare_data.pynow tokenizes into anarray.array('H'), periodically releases retained heap/Arrow memory during streaming, chunks via zero-copy NumPy views, and writes HF-compatible datasets directly with PyArrow streaming IPC using fixed-size lists to avoid offset overflow and large peak memory.Adds operational tooling (
ct87.diag_memory.py,run_prep_with_rss.sh) plus new unit tests for parameter validation and friendlier dataloader failure modes (bad args, DatasetDict paths, empty datasets).Reviewed by Cursor Bugbot for commit c35eca5. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
Documentation
Chores
Tests