Skip to content

wdarocha/pdb-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdb-parser

Lightweight Python toolkit for parsing Protein Data Bank (PDB) Nuclear Magnetic Resonance (NMR) structures and generating distance and torsion-angle constraints tailored for Distance Geometry Problem (DGP) workflows.

The package extracts structural information from PDB files and produces constraint sets that can be used in algorithms for protein structure determination.

Current release: v0.2.0 (2026-05-20).

Typical applications include:

  • generation of distance-constraint lists
  • generation of backbone torsion angles
  • extraction of 3D coordinates

🔬 Project structure

pdb-parser/
│
├── data/                  # configuration and input lists
│   ├── params.cfg
│   ├── instance_reorder.cfg
│   └── pdb_ids.txt
│
├── src/pdb_parser/
│   ├── pdb_parser.py      # CLI for parsing and writing output files
│   ├── instance_reorder.py # CLI for reordering previously written outputs
│   ├── geometry/          # distance geometry utilities
│   ├── io/                # PDB parsing and filtering routines
│   ├── pipeline/          # main parsing pipeline
│   ├── reordering/        # vertex ordering strategies
│   └── utils/             # auxiliary utilities
│
├── pyproject.toml
├── README.md
├── CITATION.cff
└── .gitignore

Installation

For development use from a local clone, install the package in editable mode:

python -m pip install -e .

This exposes the command-line entry points:

pdb-parser --help
instance-reorder --help

If you prefer not to install the package, run the modules directly from the source tree with PYTHONPATH=src:

PYTHONPATH=src python -m pdb_parser.pdb_parser --help
PYTHONPATH=src python -m pdb_parser.instance_reorder --help

Quick check

From the repository root, the following commands should display the available command-line options:

PYTHONPATH=src python -m pdb_parser.pdb_parser --help
PYTHONPATH=src python -m pdb_parser.instance_reorder --help
./run_pipeline.sh --help

📂 Input configuration

The parser stage is controlled by data/params.cfg.

The instance reordering stage is controlled by data/instance_reorder.cfg.

Both files use the same key: value syntax and support # comments.

Below is a complete description of the parameters currently supported by each stage.


Running the parser stage

After installing the package with pip install -e ., the parsing stage can be executed as

pdb-parser data/pdb_ids.txt data/params.cfg data/pdb data/seeds data/outputs

or, equivalently,

python -m pdb_parser.pdb_parser data/pdb_ids.txt data/params.cfg data/pdb data/seeds data/outputs

The arguments are:

argument description
data/pdb_ids.txt text file containing the list of PDB identifiers to process
data/params.cfg configuration file controlling the parser behavior
data/pdb directory where PDB structures will be stored
data/seeds directory where per-PDB seed files will be read and written
data/outputs directory where the generated constraint files will be written

Running the instance reordering stage

The instance reordering stage consumes files already written by the parser and can be executed as

instance-reorder data/pdb_ids.txt data/instance_reorder.cfg data/outputs

or, equivalently,

python -m pdb_parser.instance_reorder data/pdb_ids.txt data/instance_reorder.cfg data/outputs

The arguments are:

argument description
data/pdb_ids.txt text file containing the list of PDB identifiers to reorder
data/instance_reorder.cfg configuration file controlling the instance reordering behavior
data/outputs directory containing the files generated by the parser stage

The ordering choice is configured through order_id in data/instance_reorder.cfg, with supported values from 1 to 10.

Among these orderings, order_id=1 corresponds to the preprint A hybrid combinatorial-continuous strategy for solving molecular distance geometry problems (arXiv:2510.19970), and order_id=9 corresponds to the preprint An Angle-Based Algorithmic Framework for the Interval Discretizable Distance Geometry Problem (arXiv:2508.09143).

The reordered files are written to data/outputs/<pdb_id>/reordered/.


Running pipeline stages with one script

The repository also provides run_pipeline.sh, a wrapper for running the parser and instance reordering stages from the source tree without installing the package.

Default usage runs the full pipeline:

./run_pipeline.sh data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs

The script always accepts the same six positional arguments:

argument description
data/pdb_ids.txt text file containing the list of PDB identifiers
data/params.cfg configuration file for the parser stage
data/instance_reorder.cfg configuration file for the instance reordering stage
data/pdb directory where downloaded PDB structures will be stored
data/seeds directory where per-PDB seed files will be read and written
data/outputs directory where parser and reordered outputs are written

Execution modes are selected with --mode:

./run_pipeline.sh --mode full data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs
./run_pipeline.sh --mode parser data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs
./run_pipeline.sh --mode reorder data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs
./run_pipeline.sh --mode reorder-all data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs

The reorder-all mode rewrites order_id in a temporary copy of instance_reorder.cfg and runs the reordering stage for order_id=1..10 by default. A custom list can be provided with --orders, for example:

./run_pipeline.sh --mode reorder-all --orders 1,3,9 data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs

To save stage logs instead of printing command output directly, use --log-dir:

./run_pipeline.sh --mode reorder-all --log-dir logs data/pdb_ids.txt data/params.cfg data/instance_reorder.cfg data/pdb data/seeds data/outputs

Output files

For each processed pdb_id, the parser stage writes files to

data/outputs/<pdb_id>/

Main parser outputs:

  • X_<pdb_id>_model<model_number>_chain<chain_id>.dat Structure file with one atom per line. Columns: atom_id atom_name resid resname x y z Format: %5d %-4s %4d %3s %8.3f%8.3f%8.3f

  • A_<pdb_id>_model<model_number>_chain<chain_id>.dat Angular constraint file with one residue per line. Columns: resid resname omega_center omega_radius phi_center phi_radius psi_center psi_radius Format: %4d %3s %9.4f %9.4f %9.4f %9.4f %9.4f %9.4f

  • I_<pdb_id>_model<model_number>_chain<chain_id>.dat Distance-constraint file. Columns: atom_id_i atom_id_j resid_i resid_j d_l d_u atom_name_i atom_name_j resname_i resname_j Format: %5d %5d %6d %6d %20.16f %20.16f %4s %4s %s %s

Per-PDB seed files are stored in

data/seeds/<pdb_id>/

using the filename

seed_<pdb_id>.dat

If at least one random stream is needed for the current execution, the parser reuses the stored seeds when the file already exists, or creates the directory tree and writes the file before generating the random outputs.

If the current execution is fully deterministic, no seed file is needed. In that case, the parser does not create seed_<pdb_id>.dat.

The file stores only the seed fields that can actually affect the current execution. If a later run changes the stochastic components for the same pdb_id, the parser updates the file accordingly.

The supported seed fields are:

  • seed_interval_centered_distances only when distance_constraints = interval_centered
  • seed_talos_angle_centers only when the effective torsion interval width is greater than 0
  • seed_phi_psi_mask only when the phi/psi selection is genuinely random; for example, it is not stored when percentage_backbone_torsion_angles = 100

The instance reordering stage writes files to

data/outputs/<pdb_id>/reordered/

Main reordered outputs:

  • X_<pdb_id>_model<model_number>_chain<chain_id>_ddgpHCorder<order_id>.dat Coordinate-only file, one line per atom in reordered order. Columns: x y z Format: %.3f %.3f %.3f

  • I_<pdb_id>_model<model_number>_chain<chain_id>_ddgpHCorder<order_id>.dat Reordered distance-constraint file. Columns: atom_id_i atom_id_j resid_i resid_j d_l d_u atom_name_i atom_name_j resname_i resname_j Format: %5d %5d %6d %6d %20.16f %20.16f %4s %4s %s %s

  • T_<pdb_id>_model<model_number>_chain<chain_id>_ddgpHCorder<order_id>.dat Clique matrix used by the reordered instance. Columns: atom_id_i3 atom_id_i2 atom_id_i1 atom_id_i sign_tau abs_tau delta_tau Format: %d %d %d %d %d %.6f %.6f

Important: the reordered X_* file is intentionally different from the parser-stage X_* file. The parser-stage file stores atom identifiers and residue metadata together with coordinates, while the reordered file stores only the x y z coordinates in reordered atom order.


PDB list

The file

data/pdb_ids.txt

contains a list of PDB identifiers, one per line.

Example

1TOS
1UAO

Each identifier corresponds to a structure that will be retrieved and processed by the pipeline.


Parser configuration parameters (params.cfg)

Model selection

Selects which model from the NMR PDB structure will be used.


Chain selection

Specifies which chain in the selected model will be processed. Only atoms belonging to this chain are considered.


Atom selection strategy

Defines which atoms are extracted from the PDB structure to build the protein graph used by the parser and by the reordering stage.

For the current public interface, the active option is:

option description
backbone_plus_hydrogens backbone atoms N, CA, C plus the directly attached hydrogens needed by the article-oriented iDDGP construction

This is also the atom-selection layout assumed by the current instance reordering stage.


Protein-chain model used for instance generation

The current workflow follows the protein-chain model described in the manuscript section Generating iDDGP Instances from Protein Structures. For residue i, the backbone is represented by the triplet {N_i, Cα_i, C_i} and the backbone torsions are

  • phi_i := C_{i-1} - N_i - Cα_i - C_i
  • psi_i := N_i - Cα_i - C_i - N_{i+1}
  • omega_i := Cα_{i-1} - C_{i-1} - N_i - Cα_i

Only the backbone is modeled explicitly. Side-chain atoms are not part of the generated instances. To improve triangulation, the parser keeps the hydrogens attached to N and CA, with residue-specific handling:

  • standard residues use one backbone amide hydrogen among H, H1, H2, or H3, plus HA
  • glycine uses HA2 when available, otherwise HA3
  • proline has no backbone amide hydrogen, so one of HD2 or HD3 is used as the HN surrogate, together with HA

The reorder stage resolves these concrete atom names to the logical labels used by the construction:

  • HN -> H1, H, HD2, or HD3
  • HA -> HA, HA2, or HA3

Constraint families used to build one instance

For each processed chain, the final distance-constraint file I_* is obtained by merging several families of restraints:

  • exact covalent and local geometric distances induced by the filtered backbone topology
  • exact peptide-plane distances between consecutive residues under the planar peptide-group assumption
  • optional van der Waals lower bounds
  • NMR-like inter-hydrogen distance intervals generated from the reference PDB coordinates
  • distance intervals induced by sampled backbone torsion intervals

This matches the article-oriented workflow: backbone geometry is treated as the structural scaffold, while uncertain NMR information is represented as interval data.


Distance constraint model

Defines how NMR-derived distance constraints are generated.

Currently supported options:

option description
precise generates exact NMR distances [d_{ij}, d_{ij}] for atom pairs within the cutoff
interval_centered generates synthetic intervals around the reference PDB distance
interval_experimental generates NOESY-like intervals using strong, medium, and weak experimental classes

Note that covalent distances, planar constraints, and peptide-group distances are always treated as precise.


Precise distance constraints

When using precise, the parser keeps each accepted NMR-derived distance as an exact value,

$$ \mathcal{D}_{ij} = [d_{ij}, d_{ij}] $$

where d_ij is the reference distance measured in the PDB structure. Pairs with d_ij >= max_distance are ignored.

In this mode, the parser also forces the backbone angular intervals to be exact:

  • torsion_angle_width = 0
  • percentage_backbone_torsion_angles = 100

So the generated A_* file contains zero-width angular intervals instead of sampled angular ranges.

Parameter used by this mode:

parameter meaning
max_distance maximum allowed distance for keeping an exact NMR-derived pair

Synthetic distance intervals

When using interval_centered distance intervals are generated around the reference distance extracted from the PDB.

The reference distance, $d_{ij}$, is perturbed as

$$ d_{ij}^* \sim \mathcal{N}\left(d_{ij},\left(\frac{\varepsilon_{ij}}{8}\right)^2\right) $$

and the resulting interval is

$$ \mathcal{D}_{ij} = \left[ \max\left(d_{ij}^* - \frac{\varepsilon_{ij}}{2},\ \mathrm{vdwr_hh}\right), \min\left(d_{ij}^* + \frac{\varepsilon_{ij}}{2},\ d_{\mathrm{max}}\right) \right] $$

where the interval width satisfies $\varepsilon_{ij} = 8\sigma$, corresponding to $\pm4\sigma$ around the mean, and vdwr_hh is the lower-bound parameter from params.cfg.


Distance parameters

The distance-generation parameters depend on the selected mode.

parameter meaning
epsilon_short interval width for atoms in the same or adjacent residues
epsilon_long interval width for atoms in non-adjacent residues
vdwr_hh lower-bound floor used for hydrogen-hydrogen distance intervals in both interval modes
max_distance cutoff used by precise and maximum allowed upper bound used by interval_centered

Experimental distance intervals

When using interval_experimental, the parser generates NOESY-like intervals from the reference PDB distances using three intensity classes:

  • strong: [vdwr_hh, noe_strong]
  • medium: [vdwr_hh, noe_medium]
  • weak: [vdwr_hh, noe_weak]

Pairs whose reference distance is larger than noe_weak are ignored.

Parameters used by this mode:

parameter meaning
noe_strong upper bound for strong NOE peaks
noe_medium upper bound for medium NOE peaks
noe_weak upper bound for weak NOE peaks
vdwr_hh shared lower-bound floor, already configurable in data/params.cfg

This experimental option is already available in the parser through distance_constraints: interval_experimental.


van der Waals constraints

Optional lower-bound distance constraints based on the van der Waals radii of protein atoms can be included.

Options:

value meaning
yes include van der Waals lower-bound constraints
no ignore van der Waals constraints

Torsion-angle intervals

Torsion angles are derived from the PDB structure and converted into intervals.

Given a reference torsion angle $\tau_{i}$, taken from the PDB structure, a perturbed value is sampled as

$$ \tau_i^* \sim \mathcal{N}\left(\tau_i,\left(\frac{\Delta\tau_i}{8}\right)^2\right) $$

and the resulting interval is

$$ \left[ \tau_i^* - \frac{\Delta\tau_i}{2}, \tau_i^* + \frac{\Delta\tau_i}{2} \right], $$

where torsion_angle_width defines the total interval width. This corresponds to $\Delta\tau = 8\sigma$

The parser writes these sampled omega/phi/psi intervals to A_*. It then converts them to additional distance intervals and merges them into the final I_* file.


Backbone torsion selection

The percentage of backbone torsion angles ($\phi/\psi$) that will be included as interval constraints is controlled by percentage_backbone_torsion_angles

value behavior
100 all backbone torsion angles are used
<100 a random subset of torsion angles is used

Angles that are not selected receive the default range $(-180^{o},\ 180^{o}]$

When distance_constraints: precise is used, the parser overrides this parameter internally and always uses 100.


Parser configuration parameters example

model_number: 1
chain_id: A
atom_selection: backbone_plus_hydrogens
distance_constraints: interval_centered
epsilon_short: 1.0
epsilon_long: 2.0
max_distance: 5.0
vdwr_hh: 1.8
vdw_constraints: yes
torsion_angle_width: 50.0
percentage_backbone_torsion_angles: 100.0

Experimental distance-constraint example:

model_number: 1
chain_id: A
atom_selection: backbone_plus_hydrogens
distance_constraints: interval_experimental
noe_strong: 2.5
noe_medium: 3.5
noe_weak: 5.0
vdwr_hh: 1.8
vdw_constraints: yes
torsion_angle_width: 50.0
percentage_backbone_torsion_angles: 100.0

Precise distance-constraint example:

model_number: 1
chain_id: A
atom_selection: backbone_plus_hydrogens
distance_constraints: precise
max_distance: 5.0
vdwr_hh: 1.8
vdw_constraints: yes
torsion_angle_width: 0.0
percentage_backbone_torsion_angles: 100.0

Instance reordering configuration (instance_reorder.cfg)

The instance reordering stage uses its own configuration file.

Currently supported parameters:

parameter meaning
model_number model number used when the parser-stage files were generated
chain_id chain identifier used when the parser-stage files were generated
order_id ordering choice; supported values are integers from 1 to 10

For order_id, any value from 1 to 10 generates a constant DDGP ordering vector filled with that identifier for all internal residues. Among these choices, order_id=1 is the ordering used in A hybrid combinatorial-continuous strategy for solving molecular distance geometry problems (arXiv:2510.19970), while order_id=9 is the ordering used in An Angle-Based Algorithmic Framework for the Interval Discretizable Distance Geometry Problem (arXiv:2508.09143).

How the reordering stage builds the DDGP/iDDGP instance

The parser-stage X_* file preserves the filtered atom list in PDB order. The reordering stage is what turns that output into a DDGP-compatible instance.

Important: this stage currently assumes that the parser was run with atom_selection: backbone_plus_hydrogens.

For the first residue, the code chooses one initialization pattern according to the hydrogen names available in that residue, covering the expected N-terminus variants (H3, H2, H1, H) and the proline/glycine special cases discussed above.

For the remaining residues, atoms are appended residue by residue using a fixed five-atom internal pattern controlled by order_id.

The internal-residue patterns available in the current public interface are:

  • order_id=1: N, CA, C, HN, HA
  • order_id=2: N, CA, C, HA, HN
  • order_id=3: N, CA, HN, HA, C
  • order_id=4: N, CA, HN, C, HA
  • order_id=5: N, HN, CA, HA, C
  • order_id=6: N, HN, CA, C, HA
  • order_id=7: HN, N, CA, HA, C
  • order_id=8: HN, N, CA, C, HA
  • order_id=9: HN, CA, N, HA, C
  • order_id=10: HN, CA, N, C, HA

In all cases, the concrete aliases HN -> H/H1/HD2/HD3 and HA -> HA/HA2/HA3 are resolved from the actual residue contents. In practice, the first residue may contain N-terminus variants such as H3, H2, and H1, while non-terminal non-proline residues are expected to provide H for the amide-hydrogen position.

Besides the reordered X_* and I_* files, this stage also writes the clique matrix T_*, which is the discrete structure used by the reordered instance. The selected order_id is appended to each reordered filename as _ddgpHCorder<order_id>.dat.

Instance reordering configuration example

model_number: 1
chain_id: A
order_id: 9

⚙️ Installation

Clone the repository:

git clone https://github.com/wdarocha/pdb-parser.git
cd pdb-parser

This project requires Python 3.10 or newer.

Create and activate a virtual environment if desired, then install the package in editable mode:

pip install -e .

The package depends on the following Python libraries:

  • MDAnalysis
  • numpy
  • pandas
  • scipy
  • requests

With the current pyproject.toml, these dependencies are installed automatically when running pip install -e ..

If you prefer to install them manually first, you can use:

pip install MDAnalysis numpy pandas scipy requests

📖 Citation

If this code is useful in your research, please cite the repository (also available via the “Cite this repository” button on GitHub thanks to the included CITATION.cff) and cite the preprint associated with the ordering used in your experiments.

The reordering code supports order_id values from 1 to 10. At the moment, the preprint references documented in this repository are:

  • order_id=1: A hybrid combinatorial-continuous strategy for solving molecular distance geometry problems (arXiv:2510.19970).
  • order_id=9: An Angle-Based Algorithmic Framework for the Interval Discretizable Distance Geometry Problem (arXiv:2508.09143).

If your work uses one of these two orderings, cite the corresponding preprint. If your work uses both, cite both preprints.

Preprint for order_id=1

Leonardo D. Secchin, Wagner da Rocha, Mariana da Rosa, Leo Liberti, Carlile Lavor.
A hybrid combinatorial-continuous strategy for solving molecular distance geometry problems.
arXiv:2510.19970, 2025.
https://arxiv.org/abs/2510.19970

BibTeX for order_id=1

@misc{secchin2025hybrid,
  title        = {A hybrid combinatorial-continuous strategy for solving molecular distance geometry problems},
  author       = {Leonardo D. Secchin and Wagner da Rocha and Mariana da Rosa and Leo Liberti and Carlile Lavor},
  year         = {2025},
  eprint       = {2510.19970},
  archivePrefix= {arXiv},
  primaryClass = {math.OC},
  url          = {https://arxiv.org/abs/2510.19970}
}

Preprint for order_id=9

Wagner A. A. da Rocha, Carlile Lavor, Leo Liberti, Leticia de Melo Costa, Leonardo D. Secchin, Therese E. Malliavin.
An Angle-Based Algorithmic Framework for the Interval Discretizable Distance Geometry Problem.
arXiv:2508.09143, 2025.
https://arxiv.org/abs/2508.09143

BibTeX for order_id=9

@misc{darocha2025anglebased,
  title        = {An Angle-Based Algorithmic Framework for the Interval Discretizable Distance Geometry Problem},
  author       = {Wagner A. A. da Rocha and Carlile Lavor and Leo Liberti and Leticia de Melo Costa and Leonardo D. Secchin and Therese E. Malliavin},
  year         = {2025},
  eprint       = {2508.09143},
  archivePrefix= {arXiv},
  primaryClass = {q-bio.BM},
  url          = {https://arxiv.org/abs/2508.09143}
}

📜 License

This repository is licensed under the MIT License. © 2025 Wagner Alan Aparecido da Rocha


👤 Author

Developed and maintained by Wagner Alan Aparecido da Rocha.


🙏 Acknowledgments

Special thanks to Leonardo D.Secchin for the valuable support provided during the development of the code.

About

pdb-parser is a lightweight Python toolkit for processing NMR Protein Data Bank (PDB) structures and generating distance and torsion-angle constraints for Distance Geometry Problem (DGP) workflows. It extracts backbone and hydrogen atoms from protein chains and builds reordered DDGP/iDDGP instances for protein structure determination.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors