molecular_simulations.analysis.ipSAE module¶
Interface prediction Score from Aligned Errors (ipSAE) module.
This module computes interaction prediction scores from pLDDT and PAE data, adapted from https://doi.org/10.1101/2025.02.10.637595. Supports outputs from structure prediction tools like Boltz and AlphaFold.
- class molecular_simulations.analysis.ipSAE.ipSAE(structure_file, plddt_file, pae_file, out_path=None, skip_chains=None)[source]¶
Bases:
objectCompute interaction prediction Score from Aligned Errors.
Computes various model quality scores including pDockQ, pDockQ2, LIS, ipTM, and ipSAE for structure predictions.
- parser¶
ModelParser instance for structure file.
- plddt_file¶
Path to pLDDT data file.
- pae_file¶
Path to PAE data file.
- path¶
Output directory path.
- scores¶
Polars DataFrame of computed scores after run().
- Parameters:
Example
>>> scorer = ipSAE("model.pdb", "plddt.npz", "pae.npz") >>> scorer.run() >>> print(scorer.scores)
Initialize the ipSAE scorer.
- Parameters:
- __init__(structure_file, plddt_file, pae_file, out_path=None, skip_chains=None)[source]¶
Initialize the ipSAE scorer.
- parse_structure_file()[source]¶
Parse the structure file and extract relevant details.
Runs the parser to read the structure file and classifies chains as protein or nucleic acid.
- Return type:
- prepare_scorer()[source]¶
Initialize the ScoreCalculator for computing scores.
Creates a ScoreCalculator instance with chain information extracted from the parsed structure.
- Return type:
- run()[source]¶
Execute the complete ipSAE scoring workflow.
Parses structure, computes distogram, loads pLDDT and PAE data, runs the scorer, and saves results.
- Return type:
- class molecular_simulations.analysis.ipSAE.ScoreCalculator(chains, chain_pair_type, n_residues, pdockq_cutoff=8.0, pae_cutoff=12.0)[source]¶
Bases:
objectCalculate model quality scores from structure predictions.
Computes pDockQ, pDockQ2, LIS, ipTM, and ipSAE scores for all chain pairs in a structure.
- chains¶
Array of chain IDs for each residue.
- unique_chains¶
Unique chain IDs in the structure.
- chain_pair_type¶
Dictionary mapping chain ID to type.
- n_res¶
Array of residue types.
- permuted¶
List of all chain pairs to evaluate.
- scores¶
DataFrame of computed scores after compute_scores().
- Parameters:
chains (
ndarray) – Array of chain IDs.chain_pair_type (
dict[str,str]) – Dictionary mapping chain ID to chain type (‘protein’ or ‘nucleic_acid’).n_residues (
int) – Number of residues per chain.pdockq_cutoff (
float) – Distance cutoff for pDockQ in Angstroms. Defaults to 8.0.pae_cutoff (
float) – PAE cutoff for ipSAE in Angstroms. Defaults to 12.0.
Example
>>> calc = ScoreCalculator(chains, chain_types, n_residues) >>> calc.compute_scores(distances, plddt, pae) >>> print(calc.scores)
Initialize the ScoreCalculator.
- Parameters:
- __init__(chains, chain_pair_type, n_residues, pdockq_cutoff=8.0, pae_cutoff=12.0)[source]¶
Initialize the ScoreCalculator.
- compute_scores(distances, pLDDT, PAE)[source]¶
Compute all scores for all chain pairs.
Calculates pDockQ, pDockQ2, LIS, ipTM, and ipSAE scores for each permutation of chain pairs.
- compute_pDockQ_scores(chain1, chain2)[source]¶
Compute pDockQ and pDockQ2 scores for a chain pair.
pDockQ depends solely on pLDDT, while pDockQ2 depends on both pLDDT and PAE.
- compute_LIS(chain1, chain2)[source]¶
Compute Local Interaction Score (LIS) for a chain pair.
LIS is based on a subset of the predicted aligned error using a cutoff of 12 Å. Values range in (0, 1] where 1 indicates perfect accuracy.
Adapted from: https://doi.org/10.1101/2024.02.19.580970
- compute_ipTM_ipSAE(chain1, chain2)[source]¶
Compute ipTM and ipSAE scores for a chain pair.
ipTM uses d0 based on total chain pair length and averages over all chain2 residues. ipSAE uses a per-residue d0 based on the count of chain2 residues with PAE below the cutoff for each aligned residue in chain1, averaging only over those valid residues.
- get_max_values()[source]¶
Extract maximum scores for undirected chain pairs.
Because some scores like ipSAE are asymmetric (A->B != B->A), takes the maximum score for either direction as the undirected score.
- Return type:
- permute_chains()[source]¶
Generate all permutations of chain pairs.
Creates all unique ordered pairs of chains, excluding self-pairs.
- Return type:
- static pDockQ_score(x)[source]¶
Compute pDockQ score.
Formula: pDockQ = 0.724 / (1 + exp(-0.052 * (x - 152.611))) + 0.018
Reference: https://doi.org/10.1038/s41467-022-28865-w
- static pDockQ2_score(x)[source]¶
Compute pDockQ2 score.
Formula: pDockQ2 = 1.31 / (1 + exp(-0.075 * (x - 84.733))) + 0.005
- static compute_d0(L, pair_type)[source]¶
Compute d0 parameter for pTM calculation.
Formula: d0 = max(min_value, 1.24 * (L - 15)^(1/3) - 1.8)
- class molecular_simulations.analysis.ipSAE.ModelParser(structure, skip_chains=None)[source]¶
Bases:
objectParse structure files to extract residue and atom information.
Handles both PDB and CIF format files, extracting C-alpha, C-beta, and nucleic acid backbone atom coordinates.
- structure¶
Path to the structure file.
- protein_token_indices¶
pLDDT/PAE indices for kept-chain anchor tokens; populated by
build_protein_token_indices().
- residues¶
List of dictionaries containing residue information.
- chains¶
List of chain IDs for each residue.
- chain_types¶
Dictionary mapping chain ID to type after classify_chains().
Example
>>> parser = ModelParser("model.pdb") >>> parser.parse_structure_file() >>> parser.classify_chains()
Initialize the ModelParser.
- Parameters:
- parse_structure_file()[source]¶
Parse the structure file and extract atom/residue data.
Atoms in
skip_chainsare ignored. Chain order is preserved so thatbuild_protein_token_indices()can later infer the token span of skipped chains by arithmetic.- Return type:
- build_protein_token_indices(total_tokens)[source]¶
Derive pLDDT/PAE indices for kept-chain anchor tokens.
Each kept chain contributes one token per residue; any run of consecutive skipped chains occupies the token span left over between kept chains. Solvable when skipped chains form at most one contiguous block in
chain_order— the default layout emitted by Chai, Boltz, and AlphaFold (polymers first, then non-polymer chains).- Parameters:
total_tokens (
int) – Length of the pLDDT array (== PAE dim).- Raises:
ValueError – If skipped chains appear in multiple non-contiguous runs, making per-run sizes ambiguous.
- Return type:
- classify_chains()[source]¶
Classify chains as protein or nucleic acid.
Reads through residue data to assign chain identity based on whether nucleic acid residues are detected.
- Return type:
- NUCLEIC_ACIDS = frozenset({'A', 'C', 'DA', 'DC', 'DG', 'DT', 'G', 'U'})¶
- STANDARD_RESIDUES = frozenset({'A', 'ALA', 'ARG', 'ASN', 'ASP', 'C', 'CYS', 'DA', 'DC', 'DG', 'DT', 'G', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'U', 'VAL'})¶
- static parse_cif_line(line, fields, allow_missing_seq_id=False)[source]¶
Parse a single line of a CIF file.
- Parameters:
- Return type:
- Returns:
Dictionary with atom/residue information, or None if residue_id is missing and fallback is disabled.
- static package_line(atom_num, atom_name, residue_name, chain_id, residue_id, x, y, z)[source]¶
Package parsed line data into a dictionary.
- Parameters:
atom_num (
str) – Atom index.atom_name (
str) – Atom name (e.g., ‘CA’, ‘CB’).residue_name (
str) – Residue name (e.g., ‘ALA’).chain_id (
str) – Chain identifier.residue_id (
str) – Residue sequence number.x (
str) – X coordinate as string.y (
str) – Y coordinate as string.z (
str) – Z coordinate as string.
- Return type:
- Returns:
Dictionary containing parsed atom/residue data.