HPC Deployment with Parsl ========================= This tutorial covers deploying simulations on high-performance computing clusters using Parsl for workflow management. Overview -------- molecular-simulations integrates with `Parsl `_ to enable: * Running multiple simulation replicas in parallel * Automatic job submission to PBS/SLURM schedulers * GPU allocation across nodes * Fault tolerance and checkpointing Configuration Files ------------------- Create a YAML configuration file for your cluster: .. code-block:: yaml :caption: parsl_config.yaml # Local workstation with multiple GPUs executor: ThreadPoolExecutor max_workers: 4 # PBS cluster example # executor: HighThroughputExecutor # provider: PBSProProvider # account: "myproject" # queue: "prod" # walltime: "24:00:00" # nodes_per_block: 1 # available_accelerators: 4 Using LocalSettings ------------------- For local workstations or simple clusters: .. code-block:: python import parsl from molecular_simulations.simulate import LocalSettings from pathlib import Path # Load configuration settings = LocalSettings.from_yaml("parsl_config.yaml") config = settings.config_factory("/path/to/run_dir") parsl.load(config) # Define the simulation app @parsl.python_app def run_md(path: str, steps: int = 25_000_000): from molecular_simulations.simulate import Simulator Simulator(path, prod_steps=steps).run() return path # Submit jobs for all replicas replica_dirs = list(Path("./").glob("replica_*")) futures = [run_md(str(p)) for p in replica_dirs] # Wait for completion results = [f.result() for f in futures] print(f"Completed {len(results)} simulations") Using PolarisSettings --------------------- For ALCF Polaris supercomputer: .. code-block:: python from molecular_simulations.simulate import PolarisSettings settings = PolarisSettings( account="myproject", queue="prod", walltime="12:00:00", nodes_per_block=10, worker_init="module load cudatoolkit; source activate molsim", ) config = settings.config_factory("/path/to/run_dir") Best Practices -------------- **Organize replica directories** Use a consistent naming scheme like ``replica_001/``, ``replica_002/``, etc. **Set appropriate walltime** Estimate based on system size and simulation length. Add buffer for equilibration and I/O. Ensure you do not ask for more than allowed on your resource as this will crash Parsl with no discernable error. **Use checkpointing** For long simulations, configure periodic checkpoint saving to enable restart from failures. Troubleshooting --------------- **Jobs fail immediately** Check that the worker_init script correctly loads all required modules and activates the conda/virtual environment. Ensure paths are either correct if relative or absolute. Remember that the `run_dir` arg for the config factory is where Parsl will sit in the job, meaning paths are relative to this location. **OpenBLAS threading errors** Set ``OMP_NUM_THREADS=1`` in worker_init to avoid conflicts with MMPBSA's internal threading. **Errors do not propagate into runtime logs** Check in the Parsl output logs. Sometimes errors appear in the `parsl.log`, but more often you will need to look at the error stream located at `00*/submit_scripts/parsl.*.sh.err`. Most python tracebacks end up here. **I have tried everything and now I hate Parsl** Yes, Parsl can be challenging to debug at times. My best advice if all else fails is to try running the code in serial to see what bugs arise. Nearly every time I am convinced Parsl is to blame, it is actually just a bug in the code, an unstable system, etc.