Lightweight Python toolkit for parsing Protein Data Bank (PDB) Nuclear Magnetic Resonance (NMR) structures and generating distance and torsion-angle constraints tailored for Distance Geometry Problem (DGP) workflows.
The package extracts structural information from PDB files and produces constraint sets that can be used in algorithms for protein structure determination.
Current release: v0.2.0 (2026-05-20).
Typical applications include:
- generation of distance-constraint lists
- generation of backbone torsion angles
- extraction of 3D coordinates
pdb-parser/
│
├── data/ # configuration and input lists
│ ├── params.cfg
│ ├── instance_reorder.cfg
│ └── pdb_ids.txt
│
├── src/pdb_parser/
│ ├── pdb_parser.py # CLI for parsing and writing output files
│ ├── instance_reorder.py # CLI for reordering previously written outputs
│ ├── geometry/ # distance geometry utilities
│ ├── io/ # PDB parsing and filtering routines
│ ├── pipeline/ # main parsing pipeline
│ ├── reordering/ # vertex ordering strategies
│ └── utils/ # auxiliary utilities
│
├── pyproject.toml
├── README.md
├── CITATION.cff
└── .gitignore
For development use from a local clone, install the package in editable mode:
python -m pip install -e .
This exposes the command-line entry points:
pdb-parser --help
instance-reorder --help
If you prefer not to install the package, run the modules directly from the source tree with PYTHONPATH=src:
PYTHONPATH=src python -m pdb_parser.pdb_parser --help
PYTHONPATH=src python -m pdb_parser.instance_reorder --help
From the repository root, the following commands should display the available command-line options:
PYTHONPATH=src python -m pdb_parser.pdb_parser --help
PYTHONPATH=src python -m pdb_parser.instance_reorder --help
./run_pipeline.sh --help
The parser stage is controlled by data/params.cfg.
The instance reordering stage is controlled by data/instance_reorder.cfg.
Both files use the same key: value syntax and support # comments.
Below is a complete description of the parameters currently supported by each stage.
After installing the package with pip install -e ., the parsing stage can be executed as
pdb-parser data/pdb_ids.txt data/params.cfg data/pdb data/seeds data/outputs
or, equivalently,
python -m pdb_parser.pdb_parser data/pdb_ids.txt data/params.cfg data/pdb data/seeds data/outputs
The arguments are:
| argument | description |
|---|---|
data/pdb_ids.txt |
text file containing the list of PDB identifiers to process |
data/params.cfg |
configuration file controlling the parser behavior |
data/pdb |
directory where PDB structures will be stored |
data/seeds |
directory where per-PDB seed files will be read and written |
data/outputs |
directory where the generated constraint files will be written |
The instance reordering stage consumes files already written by the parser and can be executed as
instance-reorder data/pdb_ids.txt data/instance_reorder.cfg data/outputs
or, equivalently,
python -m pdb_parser.instance_reorder data/pdb_ids.txt data/instance_reorder.cfg data/outputs
The arguments are:
| argument | description |
|---|---|
data/pdb_ids.txt |
text file containing the list of PDB identifiers to reorder |
data/instance_reorder.cfg |
configuration file controlling the instance reordering behavior |
data/outputs |
directory containing the files generated by the parser stage |
The ordering choice is configured through order_id in data/instance_reorder.cfg, with supported values from 1 to 10.
Among these orderings, order_id=1 corresponds to the preprint A hybrid combinatorial-continuous strategy for solving molecular distance geometry problems (arXiv:2510.19970), and order_id=9 corresponds to the preprint An Angle-Based Algorithmic Framework for the Interval Discretizable Distance Geometry Problem (arXiv:2508.09143).
The reordered files are written to data/outputs/<pdb_id>/reordered/.
The repository also provides run_pipeline.sh, a wrapper for running the parser and instance reordering stages from the source tree without installing the package.
Default usage runs the full pipeline:
./run_pipeline.sh data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs
The script always accepts the same six positional arguments:
| argument | description |
|---|---|
data/pdb_ids.txt |
text file containing the list of PDB identifiers |
data/params.cfg |
configuration file for the parser stage |
data/instance_reorder.cfg |
configuration file for the instance reordering stage |
data/pdb |
directory where downloaded PDB structures will be stored |
data/seeds |
directory where per-PDB seed files will be read and written |
data/outputs |
directory where parser and reordered outputs are written |
Execution modes are selected with --mode:
./run_pipeline.sh --mode full data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs
./run_pipeline.sh --mode parser data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs
./run_pipeline.sh --mode reorder data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs
./run_pipeline.sh --mode reorder-all data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs
The reorder-all mode rewrites order_id in a temporary copy of instance_reorder.cfg and runs the reordering stage for order_id=1..10 by default. A custom list can be provided with --orders, for example:
./run_pipeline.sh --mode reorder-all --orders 1,3,9 data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs
To save stage logs instead of printing command output directly, use --log-dir:
./run_pipeline.sh --mode reorder-all --log-dir logs data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs
For each processed pdb_id, the parser stage writes files to
data/outputs/<pdb_id>/
Main parser outputs:
-
X_<pdb_id>_model<model_number>_chain<chain_id>.datStructure file with one atom per line. Columns:atom_id atom_name resid resname x y zFormat:%5d %-4s %4d %3s %8.3f%8.3f%8.3f -
A_<pdb_id>_model<model_number>_chain<chain_id>.datAngular constraint file with one residue per line. Columns:resid resname omega_center omega_radius phi_center phi_radius psi_center psi_radiusFormat:%4d %3s %9.4f %9.4f %9.4f %9.4f %9.4f %9.4f -
I_<pdb_id>_model<model_number>_chain<chain_id>.datDistance-constraint file. Columns:atom_id_i atom_id_j resid_i resid_j d_l d_u atom_name_i atom_name_j resname_i resname_jFormat:%5d %5d %6d %6d %20.16f %20.16f %4s %4s %s %s
Per-PDB seed files are stored in
data/seeds/<pdb_id>/
using the filename
seed_<pdb_id>.dat
If at least one random stream is needed for the current execution, the parser reuses the stored seeds when the file already exists, or creates the directory tree and writes the file before generating the random outputs.
If the current execution is fully deterministic, no seed file is needed.
In that case, the parser does not create seed_<pdb_id>.dat.
The file stores only the seed fields that can actually affect the
current execution. If a later run changes the stochastic components for
the same pdb_id, the parser updates the file accordingly.
The supported seed fields are:
seed_interval_centered_distancesonly whendistance_constraints = interval_centeredseed_talos_angle_centersonly when the effective torsion interval width is greater than0seed_phi_psi_maskonly when thephi/psiselection is genuinely random; for example, it is not stored whenpercentage_backbone_torsion_angles = 100
The instance reordering stage writes files to
data/outputs/<pdb_id>/reordered/
Main reordered outputs:
-
X_<pdb_id>_model<model_number>_chain<chain_id>_ddgpHCorder<order_id>.datCoordinate-only file, one line per atom in reordered order. Columns:x y zFormat:%.3f %.3f %.3f -
I_<pdb_id>_model<model_number>_chain<chain_id>_ddgpHCorder<order_id>.datReordered distance-constraint file. Columns:atom_id_i atom_id_j resid_i resid_j d_l d_u atom_name_i atom_name_j resname_i resname_jFormat:%5d %5d %6d %6d %20.16f %20.16f %4s %4s %s %s -
T_<pdb_id>_model<model_number>_chain<chain_id>_ddgpHCorder<order_id>.datClique matrix used by the reordered instance. Columns:atom_id_i3 atom_id_i2 atom_id_i1 atom_id_i sign_tau abs_tau delta_tauFormat:%d %d %d %d %d %.6f %.6f
Important: the reordered X_* file is intentionally different from the parser-stage X_* file. The parser-stage file stores atom identifiers and residue metadata together with coordinates, while the reordered file stores only the x y z coordinates in reordered atom order.
The file
data/pdb_ids.txt
contains a list of PDB identifiers, one per line.
Example
1TOS
1UAO
Each identifier corresponds to a structure that will be retrieved and processed by the pipeline.
Selects which model from the NMR PDB structure will be used.
Specifies which chain in the selected model will be processed. Only atoms belonging to this chain are considered.
Defines which atoms are extracted from the PDB structure to build the protein graph used by the parser and by the reordering stage.
For the current public interface, the active option is:
| option | description |
|---|---|
backbone_plus_hydrogens |
backbone atoms N, CA, C plus the directly attached hydrogens needed by the article-oriented iDDGP construction |
This is also the atom-selection layout assumed by the current instance reordering stage.
The current workflow follows the protein-chain model described in the
manuscript section Generating iDDGP Instances from Protein Structures.
For residue i, the backbone is represented by the triplet
{N_i, Cα_i, C_i} and the backbone torsions are
phi_i := C_{i-1} - N_i - Cα_i - C_ipsi_i := N_i - Cα_i - C_i - N_{i+1}omega_i := Cα_{i-1} - C_{i-1} - N_i - Cα_i
Only the backbone is modeled explicitly. Side-chain atoms are not part of
the generated instances. To improve triangulation, the parser keeps the
hydrogens attached to N and CA, with residue-specific handling:
- standard residues use one backbone amide hydrogen among
H,H1,H2, orH3, plusHA - glycine uses
HA2when available, otherwiseHA3 - proline has no backbone amide hydrogen, so one of
HD2orHD3is used as theHNsurrogate, together withHA
The reorder stage resolves these concrete atom names to the logical labels used by the construction:
HN -> H1,H,HD2, orHD3HA -> HA,HA2, orHA3
For each processed chain, the final distance-constraint file I_* is
obtained by merging several families of restraints:
- exact covalent and local geometric distances induced by the filtered backbone topology
- exact peptide-plane distances between consecutive residues under the planar peptide-group assumption
- optional van der Waals lower bounds
- NMR-like inter-hydrogen distance intervals generated from the reference PDB coordinates
- distance intervals induced by sampled backbone torsion intervals
This matches the article-oriented workflow: backbone geometry is treated as the structural scaffold, while uncertain NMR information is represented as interval data.
Defines how NMR-derived distance constraints are generated.
Currently supported options:
| option | description |
|---|---|
precise |
generates exact NMR distances [d_{ij}, d_{ij}] for atom pairs within the cutoff |
interval_centered |
generates synthetic intervals around the reference PDB distance |
interval_experimental |
generates NOESY-like intervals using strong, medium, and weak experimental classes |
Note that covalent distances, planar constraints, and peptide-group distances are always treated as precise.
When using precise, the parser keeps each accepted NMR-derived distance as
an exact value,
where d_ij is the reference distance measured in the PDB structure.
Pairs with d_ij >= max_distance are ignored.
In this mode, the parser also forces the backbone angular intervals to be exact:
torsion_angle_width = 0percentage_backbone_torsion_angles = 100
So the generated A_* file contains zero-width angular intervals instead
of sampled angular ranges.
Parameter used by this mode:
| parameter | meaning |
|---|---|
max_distance |
maximum allowed distance for keeping an exact NMR-derived pair |
When using interval_centered distance intervals are generated around the reference distance extracted
from the PDB.
The reference distance,
and the resulting interval is
where the interval width satisfies vdwr_hh is the lower-bound parameter from params.cfg.
The distance-generation parameters depend on the selected mode.
| parameter | meaning |
|---|---|
epsilon_short |
interval width for atoms in the same or adjacent residues |
epsilon_long |
interval width for atoms in non-adjacent residues |
vdwr_hh |
lower-bound floor used for hydrogen-hydrogen distance intervals in both interval modes |
max_distance |
cutoff used by precise and maximum allowed upper bound used by interval_centered |
When using interval_experimental, the parser generates NOESY-like
intervals from the reference PDB distances using three intensity classes:
- strong:
[vdwr_hh, noe_strong] - medium:
[vdwr_hh, noe_medium] - weak:
[vdwr_hh, noe_weak]
Pairs whose reference distance is larger than noe_weak are ignored.
Parameters used by this mode:
| parameter | meaning |
|---|---|
noe_strong |
upper bound for strong NOE peaks |
noe_medium |
upper bound for medium NOE peaks |
noe_weak |
upper bound for weak NOE peaks |
vdwr_hh |
shared lower-bound floor, already configurable in data/params.cfg |
This experimental option is already available in the parser through
distance_constraints: interval_experimental.
Optional lower-bound distance constraints based on the van der Waals radii of protein atoms can be included.
Options:
| value | meaning |
|---|---|
yes |
include van der Waals lower-bound constraints |
no |
ignore van der Waals constraints |
Torsion angles are derived from the PDB structure and converted into intervals.
Given a reference torsion angle
and the resulting interval is
where torsion_angle_width defines the total interval width. This corresponds to
The parser writes these sampled omega/phi/psi intervals to A_*. It then
converts them to additional distance intervals and merges them into the
final I_* file.
The percentage of backbone torsion angles (percentage_backbone_torsion_angles
| value | behavior |
|---|---|
100 |
all backbone torsion angles are used |
<100 |
a random subset of torsion angles is used |
Angles that are not selected receive the default
range
When distance_constraints: precise is used, the parser overrides this
parameter internally and always uses 100.
model_number: 1
chain_id: A
atom_selection: backbone_plus_hydrogens
distance_constraints: interval_centered
epsilon_short: 1.0
epsilon_long: 2.0
max_distance: 5.0
vdwr_hh: 1.8
vdw_constraints: yes
torsion_angle_width: 50.0
percentage_backbone_torsion_angles: 100.0
Experimental distance-constraint example:
model_number: 1
chain_id: A
atom_selection: backbone_plus_hydrogens
distance_constraints: interval_experimental
noe_strong: 2.5
noe_medium: 3.5
noe_weak: 5.0
vdwr_hh: 1.8
vdw_constraints: yes
torsion_angle_width: 50.0
percentage_backbone_torsion_angles: 100.0
Precise distance-constraint example:
model_number: 1
chain_id: A
atom_selection: backbone_plus_hydrogens
distance_constraints: precise
max_distance: 5.0
vdwr_hh: 1.8
vdw_constraints: yes
torsion_angle_width: 0.0
percentage_backbone_torsion_angles: 100.0
The instance reordering stage uses its own configuration file.
Currently supported parameters:
| parameter | meaning |
|---|---|
model_number |
model number used when the parser-stage files were generated |
chain_id |
chain identifier used when the parser-stage files were generated |
order_id |
ordering choice; supported values are integers from 1 to 10 |
For order_id, any value from 1 to 10 generates a constant DDGP ordering vector filled with that identifier for all internal residues. Among these choices, order_id=1 is the ordering used in A hybrid combinatorial-continuous strategy for solving molecular distance geometry problems (arXiv:2510.19970), while order_id=9 is the ordering used in An Angle-Based Algorithmic Framework for the Interval Discretizable Distance Geometry Problem (arXiv:2508.09143).
The parser-stage X_* file preserves the filtered atom list in PDB order.
The reordering stage is what turns that output into a DDGP-compatible
instance.
Important: this stage currently assumes that the parser was run with
atom_selection: backbone_plus_hydrogens.
For the first residue, the code chooses one initialization pattern
according to the hydrogen names available in that residue, covering the
expected N-terminus variants (H3, H2, H1, H) and the proline/glycine
special cases discussed above.
For the remaining residues, atoms are appended residue by residue using a
fixed five-atom internal pattern controlled by order_id.
The internal-residue patterns available in the current public interface are:
order_id=1:N, CA, C, HN, HAorder_id=2:N, CA, C, HA, HNorder_id=3:N, CA, HN, HA, Corder_id=4:N, CA, HN, C, HAorder_id=5:N, HN, CA, HA, Corder_id=6:N, HN, CA, C, HAorder_id=7:HN, N, CA, HA, Corder_id=8:HN, N, CA, C, HAorder_id=9:HN, CA, N, HA, Corder_id=10:HN, CA, N, C, HA
In all cases, the concrete aliases HN -> H/H1/HD2/HD3 and
HA -> HA/HA2/HA3 are resolved from the actual residue contents. In
practice, the first residue may contain N-terminus variants such as H3,
H2, and H1, while non-terminal non-proline residues are expected to
provide H for the amide-hydrogen position.
Besides the reordered X_* and I_* files, this stage also writes the
clique matrix T_*, which is the discrete structure used by the reordered
instance. The selected order_id is appended to each reordered filename
as _ddgpHCorder<order_id>.dat.
model_number: 1
chain_id: A
order_id: 9
Clone the repository:
git clone https://github.com/wdarocha/pdb-parser.git
cd pdb-parser
This project requires Python 3.10 or newer.
Create and activate a virtual environment if desired, then install the package in editable mode:
pip install -e .
The package depends on the following Python libraries:
MDAnalysisnumpypandasscipyrequests
With the current pyproject.toml, these dependencies are installed automatically when running pip install -e ..
If you prefer to install them manually first, you can use:
pip install MDAnalysis numpy pandas scipy requests
If this code is useful in your research, please cite the repository (also available via the “Cite this repository” button on GitHub thanks to the included CITATION.cff) and cite the preprint associated with the ordering used in your experiments.
The reordering code supports order_id values from 1 to 10. At the moment, the preprint references documented in this repository are:
order_id=1:A hybrid combinatorial-continuous strategy for solving molecular distance geometry problems(arXiv:2510.19970).order_id=9:An Angle-Based Algorithmic Framework for the Interval Discretizable Distance Geometry Problem(arXiv:2508.09143).
If your work uses one of these two orderings, cite the corresponding preprint. If your work uses both, cite both preprints.
Leonardo D. Secchin, Wagner da Rocha, Mariana da Rosa, Leo Liberti, Carlile Lavor.
A hybrid combinatorial-continuous strategy for solving molecular distance geometry problems.
arXiv:2510.19970, 2025.
https://arxiv.org/abs/2510.19970
@misc{secchin2025hybrid,
title = {A hybrid combinatorial-continuous strategy for solving molecular distance geometry problems},
author = {Leonardo D. Secchin and Wagner da Rocha and Mariana da Rosa and Leo Liberti and Carlile Lavor},
year = {2025},
eprint = {2510.19970},
archivePrefix= {arXiv},
primaryClass = {math.OC},
url = {https://arxiv.org/abs/2510.19970}
}Wagner A. A. da Rocha, Carlile Lavor, Leo Liberti, Leticia de Melo Costa, Leonardo D. Secchin, Therese E. Malliavin.
An Angle-Based Algorithmic Framework for the Interval Discretizable Distance Geometry Problem.
arXiv:2508.09143, 2025.
https://arxiv.org/abs/2508.09143
@misc{darocha2025anglebased,
title = {An Angle-Based Algorithmic Framework for the Interval Discretizable Distance Geometry Problem},
author = {Wagner A. A. da Rocha and Carlile Lavor and Leo Liberti and Leticia de Melo Costa and Leonardo D. Secchin and Therese E. Malliavin},
year = {2025},
eprint = {2508.09143},
archivePrefix= {arXiv},
primaryClass = {q-bio.BM},
url = {https://arxiv.org/abs/2508.09143}
}This repository is licensed under the MIT License. © 2025 Wagner Alan Aparecido da Rocha
Developed and maintained by Wagner Alan Aparecido da Rocha.
Special thanks to Leonardo D.Secchin for the valuable support provided during the development of the code.