Skip to content

Issue 42: add Boltz tracing + SLURM runner + docs#45

Open
ArishV1 wants to merge 6 commits into
AI2Science:mainfrom
ArishV1:issue42-boltz-tracing
Open

Issue 42: add Boltz tracing + SLURM runner + docs#45
ArishV1 wants to merge 6 commits into
AI2Science:mainfrom
ArishV1:issue42-boltz-tracing

Conversation

@ArishV1

@ArishV1 ArishV1 commented Feb 22, 2026

Copy link
Copy Markdown

Context

  • Delivers end-to-end Boltz-2 inference + tracing for Issue #42 (VizFold-style attention traces and arc diagrams).
  • Prioritizes reproducibility on PACE ICE / Slurm GPU (scratch env, deterministic runner, stable OUT_RUN naming, consistent layout, and automated validation—including strict checks aligned with what the Slurm runner enforces).

New Functionality

Tracing injection (no Boltz fork)

  • Tracing remains via boltz_trace/sitecustomize.py, active when BOLTZ_TRACE_DIR is set (runner puts boltz_trace first on PYTHONPATH).
  • Exports attention-style traces in VizFold’s text format:
    • msa_row_attn_layer{L}.txt
    • triangle_start_attn_layer{L}_residue_idx_{r}.txt
    • triangle_end_attn_layer{L}_residue_idx_{r}.txt
  • Supports BOLTZ_TRACE_LAYERS, BOLTZ_TRACE_RESIDUES, BOLTZ_TRACE_TOPK, BOLTZ_TRACE_HEAD=all|<idx> (runner maps BOLTZ_TRACE_HEAD → internal TRACE_HEAD; default all).

Proxy-head expansion for format compatibility

  • scripts/boltz/expand_proxy_heads.py: when traces are effectively single-head in proxy mode, duplicates blocks so the arc pipeline can target a consistent multi-head layout (default 4 heads in the runner).
  • Explicitly documented and logged as a format shim (not independent per-head attention in proxy mode); exit status wired for scripting.

Optional activation dumps

  • act_npz/pairformer_boltz/*.npz when BOLTZ_ACT_DIR is set (runner sets it): validates expected keys/shapes under --strict (pair_norm, pair_slice).

Offline format + layout regression (CPU, no GPU)

  • scripts/boltz/check_trace_format_fixtures.py + committed scripts/boltz/fixtures/*.txt for trace text format checks.
  • tests/test_boltz_trace_validate.py: builds a minimal OUT_RUN and runs validate_boltz_traces.py --strict, plus a negative test for manifest.json run_dir mismatch.

Runner / Reproducibility

Environment build (scratch, GPU-ready, avoids conflicts)

  • environment_boltz.yml, scripts/boltz/boltz_pip_extras.txt, scripts/boltz/setup_boltz_env.sh:
    • conda env create/update on $SCR, pip extras + pinned boltz, smoke checks, boltz CLI verification.
    • Header hints emphasize fresh prefix vs in-place update when debugging RDKit/Boltz import issues (see Boltz_Inference.md).

Slurm runner

  • scripts/boltz/run_boltz_trace.sbatch, scripts/boltz/run_boltz_trace.sh
  • Runner behavior:
    • default BOLTZ_ENV=$SCR/conda/envs/boltz_clean (override with BOLTZ_ENV / sbatch --export=ALL).
    • RUN_ID includes ${SLURM_JOB_ID:-$$} to avoid collisions when multiple jobs start in the same second.
    • outputs pred/, attn_txt/, arc_png/, act_npz/, manifest.json; scratch caches (HF_HOME, TORCH_HOME, etc.).
    • boltz predict --no_kernels for ICE GPU compatibility (H100 and other allocated GPU types per site #SBATCH).
    • copies components/pairformer_boltz/attn_txt/*.txt to attn_txt/ root for legacy plot/validate consumers.
    • runs expand_proxy_heads.py when TRACE_HEAD=all (with explicit log line about duplicated proxy heads).
    • ends with validate_boltz_traces.py --strict (fail the job if validation fails).

Manifest

  • manifest.json: run_dir, repo path + short git sha, inputs, trace knobs, output subdirs, cache + Boltz flags — validated under --strict against the actual filesystem tree.

Validation + Visualization

Validation (scripts/boltz/validate_boltz_traces.py)

  • Core checks (always): required trace files, header/edge parsing, per-file head counts, heuristic arc_png count vs heads/layers/residues.
  • --strict additionally requires:
    • manifest.json present with required keys and realpath-aligned run_dir / outputs.* vs --run_dir
    • non-empty pred/ file tree
    • well-formed attn_txt/component_status.json (msa, pairformer_boltz, sm_boltz entries)
    • if act_npz/pairformer_boltz/*.npz exist, NumPy shape/key sanity checks

Arc diagram generation

  • Same VizFold utilities as before; PNG naming along the lines of:
    • arc_png/msa_row_head_*_layer_*_BOLTZ_arc.png
    • arc_png/tri_start_res_*_...
    • arc_png/tri_end_res_*_...

CI + repo hygiene

  • .github/workflows/undefined_names.yml: new boltz_trace_format job (fixture checker + unittest tests.test_boltz_trace_validate on Python 3.11); existing flake8 job unchanged.
  • .gitignore: ignore large run artifacts; un-ignore scripts/boltz/fixtures/*.txt and submission/issue42-demonstration-evidence/screenshots/*.png (root *.png is otherwise ignored).

Documentation

  • docs/source/Boltz_Inference.md:
    • Issue Extend VizFold Inference and Visualization to Boltz #42 scope table (env, traces, activations, structures, metadata, validation, CI).
    • Quickstart framed as PACE ICE / Slurm GPU (not H100-only); partition/GRES edit note.
    • Stronger clean-prefix / scratch guidance; RDKit sanity check; proxy vs true attention; expand_proxy_heads semantics.
    • Reviewer pack pointer: submission/issue42-demonstration-evidence/REPRODUCIBILITY.md (commands + embedded screenshots).
  • README.md: links Boltz doc + submission REPRODUCIBILITY.md; fixes “Openfold” to “Openfold implementation” spelling on the OpenFold line.

How to test

CPU (matches CI)

cd "$(git rev-parse --show-toplevel)"
python3 scripts/boltz/check_trace_format_fixtures.py
python3 -m unittest tests.test_boltz_trace_validate -v
# optional local parity with Actions’ syntax scan:
python3 -m venv /path/on/scratch/flake8-venv && . /path/on/scratch/flake8-venv/bin/activate
pip install flake8
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
deactivate

ICE / Slurm (full pipeline)

cd "$(git rev-parse --show-toplevel)"
export SCR=/storage/ice1/2/0/$USER

export ENV="$SCR/conda/envs/boltz_clean_fresh"
rm -rf "$ENV"   # first-time / troubleshooting
bash scripts/boltz/setup_boltz_env.sh

if ! command -v module >/dev/null 2>&1; then
  [ -f /etc/profile.d/modules.sh ] && source /etc/profile.d/modules.sh
  [ -f /usr/share/Modules/init/bash ] && source /usr/share/Modules/init/bash
fi
module load anaconda3 >/dev/null 2>&1 || true
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate "$ENV"

python - <<'PY'
from rdkit import rdBase
from rdkit.Chem import Mol
print("RDKit OK:", rdBase.rdkitVersion)
PY

export BOLTZ_ENV="$ENV"
mkdir -p outputs/boltz_runs
JOBID=$(sbatch --export=ALL scripts/boltz/run_boltz_trace.sbatch | awk '{print $4}')
echo "JOBID=$JOBID"

Wait for COMPLETED, then:

sacct -j "$JOBID" --format=JobID,State,ExitCode,Elapsed -n
tail -n 100 "outputs/boltz_runs/slurm-${JOBID}.out"

OUT_RUN=$(grep -m1 '^\[INFO\] OUT_RUN=' "outputs/boltz_runs/slurm-${JOBID}.out" | sed 's/.*OUT_RUN=//')
echo "OUT_RUN=$OUT_RUN"
[ -n "$OUT_RUN" ] || { echo "[ERROR] OUT_RUN is empty." >&2; exit 2; }

find "$OUT_RUN/attn_txt" -name "*.txt" | wc -l
find "$OUT_RUN/act_npz"  -name "*.npz" | wc -l
find "$OUT_RUN/arc_png"  -name "*.png" | wc -l

python3 scripts/boltz/validate_boltz_traces.py \
  --run_dir "$OUT_RUN" --layers 0 --residues 18 --strict

Expected: Slurm stdout ends with [PASS] validate_boltz_traces.py; strict re-run exits 0 with [OK] manifest.json ... paths match --run_dir and no [FAIL].

@Elliptic461

Copy link
Copy Markdown

Arish, I am getting this error.
It seems like it isn't producing anything to Attn.txt
Screenshot_20260319_223048

@Vishram1123 Vishram1123 mentioned this pull request Mar 20, 2026
@jayvenn21

Copy link
Copy Markdown

hey team! this is Jayanth, the team lead from the apache airavata esmfold team and i wanted to touch base since the semester is wrapping up. On the ESMFold side we're basically done as all intermediate extraction (attention, activations, trunk, structure module), validation tests, docs, and a frontend dashboard are merged on our fork. Suresh mentioned in our last meeting that he wants both ESMFold and Boltz work merged into the main vizfold repo before the semester ends.
a few questions on your end:

  • where are you at with Boltz extraction? Last I heard you were finishing the last of the three sections. is that done or close?
  • are your traces writing to the same archive format (structure/predicted.pdb, trace/attention/, trace/activations/, meta.json, etc.)? That'll make the merge a lot smoother if both backends produce the same layout.
  • do you guys have a branch or fork we should be looking at? Happy to review PRs or help resolve any integration issues on our side.
    no rush on a detailed response, just want to make sure we're aligned so the merge doesn't become a last-minute scramble. Let me know if you need help with anything.

@ArishV1

ArishV1 commented Apr 24, 2026

Copy link
Copy Markdown
Author

hey team! this is Jayanth, the team lead from the apache airavata esmfold team and i wanted to touch base since the semester is wrapping up. On the ESMFold side we're basically done as all intermediate extraction (attention, activations, trunk, structure module), validation tests, docs, and a frontend dashboard are merged on our fork. Suresh mentioned in our last meeting that he wants both ESMFold and Boltz work merged into the main vizfold repo before the semester ends. a few questions on your end:

  • where are you at with Boltz extraction? Last I heard you were finishing the last of the three sections. is that done or close?
  • are your traces writing to the same archive format (structure/predicted.pdb, trace/attention/, trace/activations/, meta.json, etc.)? That'll make the merge a lot smoother if both backends produce the same layout.
  • do you guys have a branch or fork we should be looking at? Happy to review PRs or help resolve any integration issues on our side.
    no rush on a detailed response, just want to make sure we're aligned so the merge doesn't become a last-minute scramble. Let me know if you need help with anything.

Hi,

1.) Boltz extraction. We’re done / merge-ready on our side: Slurm runs on ICE produce structure outputs under pred/, VizFold-style attention traces under attn_txt/, arc PNGs, optional act_npz/ summaries, manifest.json, plus CPU validation and an extra Actions job so the Boltz path is checked without a GPU. Final step is landing the PR on main.

  1. Same archive layout as ESMFold: not exactly. We use a flat OUT_RUN layout (pred/, attn_txt/, arc_png/, act_npz/, manifest.json) rather than structure/, trace/attention/, meta.json, etc. Same kind of artifacts, different paths/names, we should check with Suresh on whether we need a shared canonical tree or a small adapter before/after merge.

  2. Branch:issue42-boltz-tracing. Here is the link: Issue 42: add Boltz tracing + SLURM runner + docs #45. It is pretty much final except I might make a screen recording for results and push it. Feel free to leave a review.

If any changes need to be made for alignment,
Yin, Kevin and Ramasubramanian, Vishal Subramanian feel free to coordinate with Vennamreddy, Jayanth. I do believe both PRs should not cause any automatic breakage on merge. For conflicts, you can check our PR description, as it is pretty detailed when it comes to summarizing exactly every file we edited. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants