molecular_simulations.analysis.ipSAE module

Interface prediction Score from Aligned Errors (ipSAE) module.

This module computes interaction prediction scores from pLDDT and PAE data, adapted from https://doi.org/10.1101/2025.02.10.637595. Supports outputs from structure prediction tools like Boltz and AlphaFold.

class molecular_simulations.analysis.ipSAE.ipSAE(structure_file, plddt_file, pae_file, out_path=None, skip_chains=None)[source]

Bases: object

Compute interaction prediction Score from Aligned Errors.

Computes various model quality scores including pDockQ, pDockQ2, LIS, ipTM, and ipSAE for structure predictions.

parser

ModelParser instance for structure file.

plddt_file

Path to pLDDT data file.

pae_file

Path to PAE data file.

path

Output directory path.

scores

Polars DataFrame of computed scores after run().

Parameters:
  • structure_file (Path | str) – Path to PDB/CIF model file.

  • plddt_file (Path | str) – Path to pLDDT numpy file (.npz with ‘plddt’ key).

  • pae_file (Path | str) – Path to PAE numpy file (.npz with ‘pae’ key).

  • out_path (Path | str | None) – Output directory path. If None, uses parent directory of plddt_file.

Example

>>> scorer = ipSAE("model.pdb", "plddt.npz", "pae.npz")
>>> scorer.run()
>>> print(scorer.scores)

Initialize the ipSAE scorer.

Parameters:
  • structure_file (Path | str) – Path to structure file.

  • plddt_file (Path | str) – Path to pLDDT data file.

  • pae_file (Path | str) – Path to PAE data file.

  • out_path (Path | str | None) – Output directory path.

  • skip_chains (set[str] | list[str] | None) – Chain IDs to exclude from scoring (e.g. glycan or ligand chains). Their pLDDT/PAE tokens are inferred from remaining tokens and dropped before scoring.

__init__(structure_file, plddt_file, pae_file, out_path=None, skip_chains=None)[source]

Initialize the ipSAE scorer.

Parameters:
  • structure_file (Path | str) – Path to structure file.

  • plddt_file (Path | str) – Path to pLDDT data file.

  • pae_file (Path | str) – Path to PAE data file.

  • out_path (Path | str | None) – Output directory path.

  • skip_chains (set[str] | list[str] | None) – Chain IDs to exclude from scoring (e.g. glycan or ligand chains). Their pLDDT/PAE tokens are inferred from remaining tokens and dropped before scoring.

parse_structure_file()[source]

Parse the structure file and extract relevant details.

Runs the parser to read the structure file and classifies chains as protein or nucleic acid.

Return type:

None

prepare_scorer()[source]

Initialize the ScoreCalculator for computing scores.

Creates a ScoreCalculator instance with chain information extracted from the parsed structure.

Return type:

None

run()[source]

Execute the complete ipSAE scoring workflow.

Parses structure, computes distogram, loads pLDDT and PAE data, runs the scorer, and saves results.

Return type:

None

save_scores()[source]

Save scores DataFrame to a Parquet file.

Return type:

None

load_pLDDT_file()[source]

Load and scale pLDDT data.

Return type:

ndarray

Returns:

pLDDT array scaled to 0-100 range.

load_PAE_file()[source]

Load PAE data from file.

Return type:

ndarray

Returns:

PAE array from the ‘pae’ key in the npz file.

class molecular_simulations.analysis.ipSAE.ScoreCalculator(chains, chain_pair_type, n_residues, pdockq_cutoff=8.0, pae_cutoff=12.0)[source]

Bases: object

Calculate model quality scores from structure predictions.

Computes pDockQ, pDockQ2, LIS, ipTM, and ipSAE scores for all chain pairs in a structure.

chains

Array of chain IDs for each residue.

unique_chains

Unique chain IDs in the structure.

chain_pair_type

Dictionary mapping chain ID to type.

n_res

Array of residue types.

permuted

List of all chain pairs to evaluate.

scores

DataFrame of computed scores after compute_scores().

Parameters:
  • chains (ndarray) – Array of chain IDs.

  • chain_pair_type (dict[str, str]) – Dictionary mapping chain ID to chain type (‘protein’ or ‘nucleic_acid’).

  • n_residues (int) – Number of residues per chain.

  • pdockq_cutoff (float) – Distance cutoff for pDockQ in Angstroms. Defaults to 8.0.

  • pae_cutoff (float) – PAE cutoff for ipSAE in Angstroms. Defaults to 12.0.

Example

>>> calc = ScoreCalculator(chains, chain_types, n_residues)
>>> calc.compute_scores(distances, plddt, pae)
>>> print(calc.scores)

Initialize the ScoreCalculator.

Parameters:
  • chains (ndarray) – Array of chain IDs.

  • chain_pair_type (dict[str, str]) – Chain ID to type mapping.

  • n_residues (int) – Residue type array.

  • pdockq_cutoff (float) – pDockQ distance cutoff.

  • pae_cutoff (float) – PAE cutoff.

__init__(chains, chain_pair_type, n_residues, pdockq_cutoff=8.0, pae_cutoff=12.0)[source]

Initialize the ScoreCalculator.

Parameters:
  • chains (ndarray) – Array of chain IDs.

  • chain_pair_type (dict[str, str]) – Chain ID to type mapping.

  • n_residues (int) – Residue type array.

  • pdockq_cutoff (float) – pDockQ distance cutoff.

  • pae_cutoff (float) – PAE cutoff.

compute_scores(distances, pLDDT, PAE)[source]

Compute all scores for all chain pairs.

Calculates pDockQ, pDockQ2, LIS, ipTM, and ipSAE scores for each permutation of chain pairs.

Parameters:
  • distances (ndarray) – Pairwise distance matrix between all residues.

  • pLDDT (ndarray) – Per-residue pLDDT values (0-100 scale).

  • PAE (ndarray) – Predicted aligned error matrix.

Return type:

None

compute_pDockQ_scores(chain1, chain2)[source]

Compute pDockQ and pDockQ2 scores for a chain pair.

pDockQ depends solely on pLDDT, while pDockQ2 depends on both pLDDT and PAE.

Parameters:
  • chain1 (str) – First chain identifier.

  • chain2 (str) – Second chain identifier.

Return type:

tuple[float, float]

Returns:

Tuple of (pDockQ, pDockQ2) scores.

compute_LIS(chain1, chain2)[source]

Compute Local Interaction Score (LIS) for a chain pair.

LIS is based on a subset of the predicted aligned error using a cutoff of 12 Å. Values range in (0, 1] where 1 indicates perfect accuracy.

Adapted from: https://doi.org/10.1101/2024.02.19.580970

Parameters:
  • chain1 (str) – First chain identifier.

  • chain2 (str) – Second chain identifier.

Return type:

float

Returns:

LIS value for the chain pair.

compute_ipTM_ipSAE(chain1, chain2)[source]

Compute ipTM and ipSAE scores for a chain pair.

ipTM uses d0 based on total chain pair length and averages over all chain2 residues. ipSAE uses a per-residue d0 based on the count of chain2 residues with PAE below the cutoff for each aligned residue in chain1, averaging only over those valid residues.

Parameters:
  • chain1 (str) – First chain identifier (aligned chain).

  • chain2 (str) – Second chain identifier (scored chain).

Return type:

tuple[Any, Any]

Returns:

Tuple of (ipTM, ipSAE) scores.

get_max_values()[source]

Extract maximum scores for undirected chain pairs.

Because some scores like ipSAE are asymmetric (A->B != B->A), takes the maximum score for either direction as the undirected score.

Return type:

None

permute_chains()[source]

Generate all permutations of chain pairs.

Creates all unique ordered pairs of chains, excluding self-pairs.

Return type:

None

static pDockQ_score(x)[source]

Compute pDockQ score.

Formula: pDockQ = 0.724 / (1 + exp(-0.052 * (x - 152.611))) + 0.018

Reference: https://doi.org/10.1038/s41467-022-28865-w

Parameters:

x (float) – Mean pLDDT scaled by log10 of the number of residue pairs meeting pLDDT and distance cutoffs.

Return type:

float

Returns:

pDockQ score.

static pDockQ2_score(x)[source]

Compute pDockQ2 score.

Formula: pDockQ2 = 1.31 / (1 + exp(-0.075 * (x - 84.733))) + 0.005

Reference: https://doi.org/10.1093/bioinformatics/btad424

Parameters:

x (float) – Mean pLDDT scaled by mean PAE score.

Return type:

float

Returns:

pDockQ2 score.

static compute_pTM(x, d0)[source]

Compute pTM score.

Formula: pTM = 1.0 / (1 + (x / d0)^2)

Parameters:
  • x (ndarray) – PAE values (scalar or array).

  • d0 (float) – Distance parameter from compute_d0.

Return type:

ndarray

Returns:

pTM score(s).

static compute_d0(L, pair_type)[source]

Compute d0 parameter for pTM calculation.

Formula: d0 = max(min_value, 1.24 * (L - 15)^(1/3) - 1.8)

Parameters:
  • L (int) – Sequence length (minimum 27).

  • pair_type (str) – ‘protein’ or ‘nucleic_acid’.

Return type:

float

Returns:

d0 parameter value.

static compute_d0_array(L, pair_type)[source]

Compute d0 parameter for an array of sequence lengths.

Vectorized version of compute_d0 for per-residue d0 calculation used in ipSAE scoring.

Parameters:
  • L (ndarray) – Array of sequence lengths.

  • pair_type (str) – ‘protein’ or ‘nucleic_acid’.

Return type:

ndarray

Returns:

Array of d0 parameter values.

class molecular_simulations.analysis.ipSAE.ModelParser(structure, skip_chains=None)[source]

Bases: object

Parse structure files to extract residue and atom information.

Handles both PDB and CIF format files, extracting C-alpha, C-beta, and nucleic acid backbone atom coordinates.

structure

Path to the structure file.

protein_token_indices

pLDDT/PAE indices for kept-chain anchor tokens; populated by build_protein_token_indices().

residues

List of dictionaries containing residue information.

chains

List of chain IDs for each residue.

chain_types

Dictionary mapping chain ID to type after classify_chains().

Parameters:

structure (Path | str) – Path to PDB or CIF file.

Example

>>> parser = ModelParser("model.pdb")
>>> parser.parse_structure_file()
>>> parser.classify_chains()

Initialize the ModelParser.

Parameters:
  • structure (Path | str) – Path to PDB or CIF file.

  • skip_chains (set[str] | list[str] | None) – Chain IDs to exclude (e.g. glycan/ligand chains whose per-atom token counts in the pLDDT/PAE arrays aren’t recoverable from the structure file alone).

__init__(structure, skip_chains=None)[source]

Initialize the ModelParser.

Parameters:
  • structure (Path | str) – Path to PDB or CIF file.

  • skip_chains (set[str] | list[str] | None) – Chain IDs to exclude (e.g. glycan/ligand chains whose per-atom token counts in the pLDDT/PAE arrays aren’t recoverable from the structure file alone).

parse_structure_file()[source]

Parse the structure file and extract atom/residue data.

Atoms in skip_chains are ignored. Chain order is preserved so that build_protein_token_indices() can later infer the token span of skipped chains by arithmetic.

Return type:

None

build_protein_token_indices(total_tokens)[source]

Derive pLDDT/PAE indices for kept-chain anchor tokens.

Each kept chain contributes one token per residue; any run of consecutive skipped chains occupies the token span left over between kept chains. Solvable when skipped chains form at most one contiguous block in chain_order — the default layout emitted by Chai, Boltz, and AlphaFold (polymers first, then non-polymer chains).

Parameters:

total_tokens (int) – Length of the pLDDT array (== PAE dim).

Raises:

ValueError – If skipped chains appear in multiple non-contiguous runs, making per-run sizes ambiguous.

Return type:

None

classify_chains()[source]

Classify chains as protein or nucleic acid.

Reads through residue data to assign chain identity based on whether nucleic acid residues are detected.

Return type:

None

NUCLEIC_ACIDS = frozenset({'A', 'C', 'DA', 'DC', 'DG', 'DT', 'G', 'U'})
STANDARD_RESIDUES = frozenset({'A', 'ALA', 'ARG', 'ASN', 'ASP', 'C', 'CYS', 'DA', 'DC', 'DG', 'DT', 'G', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'U', 'VAL'})
static parse_pdb_line(line, *args, **kwargs)[source]

Parse a single line of a PDB file.

Parameters:
  • line (str) – Line from the PDB file.

  • *args – Unused, for API compatibility with parse_cif_line.

Return type:

dict[str, Any]

Returns:

Dictionary with atom/residue information.

static parse_cif_line(line, fields, allow_missing_seq_id=False)[source]

Parse a single line of a CIF file.

Parameters:
  • line (str) – Line from the CIF file.

  • fields (dict[str, int]) – Dictionary mapping field names to column indices.

  • allow_missing_seq_id (bool) – If True, fall back to auth_seq_id when label_seq_id is ‘.’. Used for non-polymer residues (ligands, glycans) which lack a label_seq_id.

Return type:

dict[str, Any] | None

Returns:

Dictionary with atom/residue information, or None if residue_id is missing and fallback is disabled.

static package_line(atom_num, atom_name, residue_name, chain_id, residue_id, x, y, z)[source]

Package parsed line data into a dictionary.

Parameters:
  • atom_num (str) – Atom index.

  • atom_name (str) – Atom name (e.g., ‘CA’, ‘CB’).

  • residue_name (str) – Residue name (e.g., ‘ALA’).

  • chain_id (str) – Chain identifier.

  • residue_id (str) – Residue sequence number.

  • x (str) – X coordinate as string.

  • y (str) – Y coordinate as string.

  • z (str) – Z coordinate as string.

Return type:

dict[str, Any]

Returns:

Dictionary containing parsed atom/residue data.