Download structural data from PDB-REDO and cluster local PDB/mmCIF complexes using chain-level sequence evidence from MMseqs and multimer structural evidence from Foldseek.
Python dependencies are managed with uv:
uv sync --devMMseqs and Foldseek are external binaries. To install the official project-local GPU-capable builds:
uv run pdbcluster bootstrap-tools --tool-dir .tools
uv run pdbcluster validate-tools --tool-dir .toolsThe bootstrap command downloads:
https://mmseqs.com/latest/mmseqs-linux-gpu.tar.gzhttps://mmseqs.com/foldseek/foldseek-linux-gpu.tar.gz
The clustering CLI expects:
<data_dir>/<pdb_id>/<pdb_id>_final.cif
<data_dir>/<pdb_id>/<pdb_id>_final.pdb
CIF is preferred when both CIF and PDB exist. Foldseek receives native structure files through generated symlink directories. Biotite is used to extract every protein-chain sequence for MMseqs; final clusters are still assigned per PDB entry, not per chain.
uv run pdbcluster run \
--data-dir /path/to/pdbredo \
--out-dir /path/to/clusters \
--pdb-list ~/conf_dataset_list.txt \
--seq-id 0.30 \
--seq-cov 0.80 \
--complex-seq-cov 0.50 \
--tm 0.50 \
--struct-cov 0.80 \
--max-seqs 0 \
--gpu-devices 0,1,2,3 \
--threads 32The pipeline first runs MMseqs all-vs-all on chain FASTA records. Passing chain
hits are collapsed into structure-level sequence edges: for each candidate
PDB-entry pair, chains are matched 1-to-1 with an optimal (Hungarian) assignment,
and the pair is kept only if the matched chains cover at least --complex-seq-cov
of both entries. This min(qcov, tcov) complex-coverage rule keeps assemblies of
different stoichiometry (e.g. a monomer vs. its homo-tetramer) in separate clusters.
The connected components of the sequence-edge graph then gate Foldseek: structural
comparison only runs within a component, so Foldseek never wastes TMalign on
sequence-dissimilar pairs, and sequence singletons skip Foldseek entirely. Each
component with two or more members is compared all-vs-all with a single
foldseek easy-multimersearch call (--alignment-type 1), with monomers and
monomer-vs-multimer pairs handled via --monomer-include-mode 0 and
--min-aligned-chains 1. Structure coverage is enforced at alignment time by
Foldseek's -c <struct-cov>.
A pair survives fusion only if it has both a sequence edge and a structure edge
with min(qTM, tTM) >= --tm. Final clusters are the connected components of that
fused graph (single linkage), giving one cluster assignment per PDB entry.
--max-seqs 0 (the default) means "never truncate": it resolves to the total chain
count for the MMseqs search and to the current sequence-component size for each
Foldseek search. A positive --max-seqs is passed directly to both tools as an
approximate sensitivity/speed cap.
Pass --force to ignore all cached stage outputs and recompute from scratch. Use
--no-gpu to run the searches on CPU.
--pdb-list is optional. When present, discovery is restricted to the requested
entries instead of scanning every PDB directory under --data-dir, and only those
structures are prepared and symlinked into work/structures/. The list file should
contain one entry per line, either as a PDB-REDO stem such as 1abc_final or as a
bare PDB ID such as 1abc. Blank lines and comment lines starting with # are
ignored. The underscore alias --pdb_list is also accepted.
After clustering, generate train/valid/test split files from final_clusters.tsv:
uv run pdbcluster split \
--out-dir /path/to/clusters \
--split-name mydb \
--train 0.8 \
--valid 0.1 \
--test 0.1 \
--seed 42This writes three files to --out-dir:
mydb_train.txtmydb_valid.txtmydb_test.txt
Each line contains one PDB identifier in the form <pdb_id>_final.
Splitting is cluster-aware: all members of the same cluster are always placed in
the same split, preventing data leakage across sets. The --train/--valid/--test
values are relative weights (not required to sum to 1.0, so 8 1 1 is equivalent to
0.8 0.1 0.1).
Clusters with >= --max-cluster-size members (default 500) are unconditionally
placed in the training set. The remaining clusters are sorted by size descending
and assigned in test → valid → train order, so the largest eligible clusters land in
the eval sets and small or singleton clusters fill the training set. --seed breaks
ties among equal-size clusters and makes the assignment reproducible.
Tool stages write sidecar *.params.json files. A cached stage is reused only when
its expected outputs exist and the cached params exactly match the current command,
thresholds, resolved --max-seqs, tool version, and input fingerprint. Changing
inputs or relevant arguments reruns the affected stage automatically; Foldseek is
cached per sequence component, so adding new structures only recomputes the
components they touch. Pass --force to bypass every cache and recompute.
pdbcluster run prints concise progress lines as stages start, finish, hit the
cache, skip singleton sequence components, or fail. The same events are appended
to progress.jsonl as one JSON object per event. They include component IDs where
relevant and the tool log path for command stages. External tool stdout and stderr
are written to the stage .log file along with the command, exit code, and
elapsed time.
manifest.tsv: one row per protein chain for entries that parse successfully, one row per PDB entry for parse failures. Columns:pdb_id,chain_uid,chain_id,source_path,structure_path,format,sequence_length,status,error.work/chains.fasta: chain-level FASTA input for MMseqs.work/structures/: symlinks to native structure files.progress.jsonl: append-only stage progress events; the CLI also prints these events live while running.mmseqs/seq_edges.tsv: raw chain-level MMseqs evidence.mmseqs/seq_edges.log: MMseqs command, stdout/stderr, exit code, and elapsed time.mmseqs/structure_seq_edges.tsv: passing sequence evidence collapsed to PDB-entry pairs with matched chains, complex coverage, and evidence count.foldseek/components/<Sxxxxxx>/: per-sequence-component Foldseek work dirs, including the component structure symlinks andsearch.logfor Foldseek stdout/stderr.foldseek/structure_edges.tsv: multimer structure edges (min(qTM, tTM)per PDB-entry pair) parsed from Foldseekeasy-multimersearch_reportfiles. This is sequence-gated, not a global all-vs-all structure clustering. Coverage and interface LDDT are recorded asNA(Foldseek's-cenforces coverage during alignment; it does not emit per-complex coverage in this report).final_edges.tsv: fused edges that cleared both sequence and structure thresholds, with the sequence-support summary, multimer TM scores, and source sequence component.final_clusters.tsv: final cluster assignment per PDB entry — columnspdb_id,final_cluster,final_representative,sequence_component,sequence_length.run_manifest.json: commands run, versions, config, counts, and elapsed time.
The downloader accepts the RCSB search filters as command-line flags and writes the expected directory layout:
python download_pdb_redo.py /path/to/output \
--workers 16 \
--include-pdb \
--method "X-RAY DIFFRACTION" \
--max-resolution 3.0 \
--max-rfree 0.25 \
--min-residues 50 \
--max-residues 500 \
--polymer-entity-type "Protein (only)"Write one <pdb_id>_final stem per matching entry to a file. No output directory
is needed:
python download_pdb_redo.py \
--file-list queried_files.txt \
--method "X-RAY DIFFRACTION" \
--max-resolution 3.0 --max-rfree 0.25 \
--min-residues 50 --max-residues 500 \
--polymer-entity-type "Protein (only)"To see which entries from the query are absent or incomplete in an existing output
directory, use --check-dir. This logs a count to stderr and, combined with
--file-list, writes only the missing stems to a file:
python download_pdb_redo.py \
--check-dir /path/to/output \
--file-list missing_files.txt \
[filter flags...]--download-missing (requires --check-dir) downloads the entries that are absent
or incomplete. Before starting, it reads download_manifest.tsv from the output
directory (if present) and skips any entries that previously failed, so interrupted
runs can be resumed cleanly:
python download_pdb_redo.py \
--check-dir /path/to/output \
--download-missing \
[filter flags...]Run python download_pdb_redo.py --help to see the full set of knobs and defaults.