molecular_simulations.analysis.ipSAE module¶

Interface prediction Score from Aligned Errors (ipSAE) module.

This module computes interaction prediction scores from pLDDT and PAE data, adapted from https://doi.org/10.1101/2025.02.10.637595. Supports outputs from structure prediction tools like Boltz and AlphaFold.

class molecular_simulations.analysis.ipSAE.ipSAE(structure_file, plddt_file, pae_file, out_path=None, skip_chains=None)[source]¶

Bases: object

Compute interaction prediction Score from Aligned Errors.

Computes various model quality scores including pDockQ, pDockQ2, LIS, ipTM, and ipSAE for structure predictions.

parser¶: ModelParser instance for structure file.

plddt_file¶: Path to pLDDT data file.

pae_file¶: Path to PAE data file.

path¶: Output directory path.

scores¶: Polars DataFrame of computed scores after run().

Parameters:

structure_file (Path | str) – Path to PDB/CIF model file.
plddt_file (Path | str) – Path to pLDDT numpy file (.npz with ‘plddt’ key).
pae_file (Path | str) – Path to PAE numpy file (.npz with ‘pae’ key).
out_path (Path | str | None) – Output directory path. If None, uses parent directory of plddt_file.

Example

>>> scorer = ipSAE("model.pdb", "plddt.npz", "pae.npz")
>>> scorer.run()
>>> print(scorer.scores)

Initialize the ipSAE scorer.

Parameters:

structure_file (Path | str) – Path to structure file.
plddt_file (Path | str) – Path to pLDDT data file.
pae_file (Path | str) – Path to PAE data file.
out_path (Path | str | None) – Output directory path.
skip_chains (set[str] | list[str] | None) – Chain IDs to exclude from scoring (e.g. glycan or ligand chains). Their pLDDT/PAE tokens are inferred from remaining tokens and dropped before scoring.

__init__(structure_file, plddt_file, pae_file, out_path=None, skip_chains=None)[source]¶

Initialize the ipSAE scorer.

Parameters:

structure_file (Path | str) – Path to structure file.
plddt_file (Path | str) – Path to pLDDT data file.
pae_file (Path | str) – Path to PAE data file.
out_path (Path | str | None) – Output directory path.
skip_chains (set[str] | list[str] | None) – Chain IDs to exclude from scoring (e.g. glycan or ligand chains). Their pLDDT/PAE tokens are inferred from remaining tokens and dropped before scoring.

parse_structure_file()[source]¶

Parse the structure file and extract relevant details.

Runs the parser to read the structure file and classifies chains as protein or nucleic acid.

Return type:: None

prepare_scorer()[source]¶

Initialize the ScoreCalculator for computing scores.

Creates a ScoreCalculator instance with chain information extracted from the parsed structure.

Return type:: None

run()[source]¶

Execute the complete ipSAE scoring workflow.

Parses structure, computes distogram, loads pLDDT and PAE data, runs the scorer, and saves results.

Return type:: None

save_scores()[source]¶

Save scores DataFrame to a Parquet file.

Return type:: None

load_pLDDT_file()[source]¶

Load and scale pLDDT data.

Return type:: ndarray
Returns:: pLDDT array scaled to 0-100 range.

load_PAE_file()[source]¶

Load PAE data from file.

Return type:: ndarray
Returns:: PAE array from the ‘pae’ key in the npz file.

class molecular_simulations.analysis.ipSAE.ScoreCalculator(chains, chain_pair_type, n_residues, pdockq_cutoff=8.0, pae_cutoff=12.0)[source]¶

Bases: object

Calculate model quality scores from structure predictions.

Computes pDockQ, pDockQ2, LIS, ipTM, and ipSAE scores for all chain pairs in a structure.

chains¶: Array of chain IDs for each residue.

unique_chains¶: Unique chain IDs in the structure.

chain_pair_type¶: Dictionary mapping chain ID to type.

n_res¶: Array of residue types.

permuted¶: List of all chain pairs to evaluate.

scores¶: DataFrame of computed scores after compute_scores().

Parameters:

chains (ndarray) – Array of chain IDs.
chain_pair_type (dict[str, str]) – Dictionary mapping chain ID to chain type (‘protein’ or ‘nucleic_acid’).
n_residues (int) – Number of residues per chain.
pdockq_cutoff (float) – Distance cutoff for pDockQ in Angstroms. Defaults to 8.0.
pae_cutoff (float) – PAE cutoff for ipSAE in Angstroms. Defaults to 12.0.

Example

>>> calc = ScoreCalculator(chains, chain_types, n_residues)
>>> calc.compute_scores(distances, plddt, pae)
>>> print(calc.scores)

Initialize the ScoreCalculator.

Parameters:

chains (ndarray) – Array of chain IDs.
chain_pair_type (dict[str, str]) – Chain ID to type mapping.
n_residues (int) – Residue type array.
pdockq_cutoff (float) – pDockQ distance cutoff.
pae_cutoff (float) – PAE cutoff.

__init__(chains, chain_pair_type, n_residues, pdockq_cutoff=8.0, pae_cutoff=12.0)[source]¶

Initialize the ScoreCalculator.

Parameters:

chains (ndarray) – Array of chain IDs.
chain_pair_type (dict[str, str]) – Chain ID to type mapping.
n_residues (int) – Residue type array.
pdockq_cutoff (float) – pDockQ distance cutoff.
pae_cutoff (float) – PAE cutoff.

compute_scores(distances, pLDDT, PAE)[source]¶

Compute all scores for all chain pairs.

Calculates pDockQ, pDockQ2, LIS, ipTM, and ipSAE scores for each permutation of chain pairs.

Parameters:

distances (ndarray) – Pairwise distance matrix between all residues.
pLDDT (ndarray) – Per-residue pLDDT values (0-100 scale).
PAE (ndarray) – Predicted aligned error matrix.

Return type:

None

compute_pDockQ_scores(chain1, chain2)[source]¶

Compute pDockQ and pDockQ2 scores for a chain pair.

pDockQ depends solely on pLDDT, while pDockQ2 depends on both pLDDT and PAE.

Parameters:

chain1 (str) – First chain identifier.
chain2 (str) – Second chain identifier.

Return type:

tuple[float, float]

Returns:

Tuple of (pDockQ, pDockQ2) scores.

compute_LIS(chain1, chain2)[source]¶

Compute Local Interaction Score (LIS) for a chain pair.

LIS is based on a subset of the predicted aligned error using a cutoff of 12 Å. Values range in (0, 1] where 1 indicates perfect accuracy.

Adapted from: https://doi.org/10.1101/2024.02.19.580970

Parameters:

chain1 (str) – First chain identifier.
chain2 (str) – Second chain identifier.

Return type:

float

Returns:

LIS value for the chain pair.

compute_ipTM_ipSAE(chain1, chain2)[source]¶

Compute ipTM and ipSAE scores for a chain pair.

ipTM uses d0 based on total chain pair length and averages over all chain2 residues. ipSAE uses a per-residue d0 based on the count of chain2 residues with PAE below the cutoff for each aligned residue in chain1, averaging only over those valid residues.

Parameters:

chain1 (str) – First chain identifier (aligned chain).
chain2 (str) – Second chain identifier (scored chain).

Return type:

tuple[Any, Any]

Returns:

Tuple of (ipTM, ipSAE) scores.

get_max_values()[source]¶

Extract maximum scores for undirected chain pairs.

Because some scores like ipSAE are asymmetric (A->B != B->A), takes the maximum score for either direction as the undirected score.

Return type:: None

permute_chains()[source]¶

Generate all permutations of chain pairs.

Creates all unique ordered pairs of chains, excluding self-pairs.

Return type:: None

static pDockQ_score(x)[source]¶

Compute pDockQ score.

Formula: pDockQ = 0.724 / (1 + exp(-0.052 * (x - 152.611))) + 0.018

Reference: https://doi.org/10.1038/s41467-022-28865-w

Parameters:: x (float) – Mean pLDDT scaled by log10 of the number of residue pairs meeting pLDDT and distance cutoffs.
Return type:: float
Returns:: pDockQ score.

static pDockQ2_score(x)[source]¶

Compute pDockQ2 score.

Formula: pDockQ2 = 1.31 / (1 + exp(-0.075 * (x - 84.733))) + 0.005

Reference: https://doi.org/10.1093/bioinformatics/btad424

Parameters:: x (float) – Mean pLDDT scaled by mean PAE score.
Return type:: float
Returns:: pDockQ2 score.

static compute_pTM(x, d0)[source]¶

Compute pTM score.

Formula: pTM = 1.0 / (1 + (x / d0)^2)

Parameters:

x (ndarray) – PAE values (scalar or array).
d0 (float) – Distance parameter from compute_d0.

Return type:

ndarray

Returns:

pTM score(s).

static compute_d0(L, pair_type)[source]¶

Compute d0 parameter for pTM calculation.

Formula: d0 = max(min_value, 1.24 * (L - 15)^(1/3) - 1.8)

Parameters:

L (int) – Sequence length (minimum 27).
pair_type (str) – ‘protein’ or ‘nucleic_acid’.

Return type:

float

Returns:

d0 parameter value.

static compute_d0_array(L, pair_type)[source]¶

Compute d0 parameter for an array of sequence lengths.

Vectorized version of compute_d0 for per-residue d0 calculation used in ipSAE scoring.

Parameters:

L (ndarray) – Array of sequence lengths.
pair_type (str) – ‘protein’ or ‘nucleic_acid’.

Return type:

ndarray

Returns:

Array of d0 parameter values.

class molecular_simulations.analysis.ipSAE.ModelParser(structure, skip_chains=None)[source]¶

Bases: object

Parse structure files to extract residue and atom information.

Handles both PDB and CIF format files, extracting C-alpha, C-beta, and nucleic acid backbone atom coordinates.

structure¶: Path to the structure file.

protein_token_indices¶: pLDDT/PAE indices for kept-chain anchor tokens; populated by build_protein_token_indices().

residues¶: List of dictionaries containing residue information.

chains¶: List of chain IDs for each residue.

chain_types¶: Dictionary mapping chain ID to type after classify_chains().

Parameters:: structure (Path | str) – Path to PDB or CIF file.

Example

>>> parser = ModelParser("model.pdb")
>>> parser.parse_structure_file()
>>> parser.classify_chains()

Initialize the ModelParser.

Parameters:

structure (Path | str) – Path to PDB or CIF file.
skip_chains (set[str] | list[str] | None) – Chain IDs to exclude (e.g. glycan/ligand chains whose per-atom token counts in the pLDDT/PAE arrays aren’t recoverable from the structure file alone).

__init__(structure, skip_chains=None)[source]¶

Initialize the ModelParser.

Parameters:

structure (Path | str) – Path to PDB or CIF file.
skip_chains (set[str] | list[str] | None) – Chain IDs to exclude (e.g. glycan/ligand chains whose per-atom token counts in the pLDDT/PAE arrays aren’t recoverable from the structure file alone).

parse_structure_file()[source]¶

Parse the structure file and extract atom/residue data.

Atoms in skip_chains are ignored. Chain order is preserved so that build_protein_token_indices() can later infer the token span of skipped chains by arithmetic.

Return type:: None

build_protein_token_indices(total_tokens)[source]¶

Derive pLDDT/PAE indices for kept-chain anchor tokens.

Each kept chain contributes one token per residue; any run of consecutive skipped chains occupies the token span left over between kept chains. Solvable when skipped chains form at most one contiguous block in chain_order — the default layout emitted by Chai, Boltz, and AlphaFold (polymers first, then non-polymer chains).

Parameters:: total_tokens (int) – Length of the pLDDT array (== PAE dim).
Raises:: ValueError – If skipped chains appear in multiple non-contiguous runs, making per-run sizes ambiguous.
Return type:: None

classify_chains()[source]¶

Classify chains as protein or nucleic acid.

Reads through residue data to assign chain identity based on whether nucleic acid residues are detected.

Return type:: None

NUCLEIC_ACIDS = frozenset({'A', 'C', 'DA', 'DC', 'DG', 'DT', 'G', 'U'})¶

STANDARD_RESIDUES = frozenset({'A', 'ALA', 'ARG', 'ASN', 'ASP', 'C', 'CYS', 'DA', 'DC', 'DG', 'DT', 'G', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'U', 'VAL'})¶

static parse_pdb_line(line, *args, **kwargs)[source]¶

Parse a single line of a PDB file.

Parameters:

line (str) – Line from the PDB file.
*args – Unused, for API compatibility with parse_cif_line.

Return type:

dict[str, Any]

Returns:

Dictionary with atom/residue information.

static parse_cif_line(line, fields, allow_missing_seq_id=False)[source]¶

Parse a single line of a CIF file.

Parameters:

line (str) – Line from the CIF file.
fields (dict[str, int]) – Dictionary mapping field names to column indices.
allow_missing_seq_id (bool) – If True, fall back to auth_seq_id when label_seq_id is ‘.’. Used for non-polymer residues (ligands, glycans) which lack a label_seq_id.

Return type:

dict[str, Any] | None

Returns:

Dictionary with atom/residue information, or None if residue_id is missing and fallback is disabled.

static package_line(atom_num, atom_name, residue_name, chain_id, residue_id, x, y, z)[source]¶

Package parsed line data into a dictionary.

Parameters:

atom_num (str) – Atom index.
atom_name (str) – Atom name (e.g., ‘CA’, ‘CB’).
residue_name (str) – Residue name (e.g., ‘ALA’).
chain_id (str) – Chain identifier.
residue_id (str) – Residue sequence number.
x (str) – X coordinate as string.
y (str) – Y coordinate as string.
z (str) – Z coordinate as string.

Return type:

dict[str, Any]

Returns:

Dictionary containing parsed atom/residue data.