arXiv:2601.11628v1 Announce Type: new
Abstract: Machine learning interatomic potentials (MLIPs) enable efficient modeling of molecular interactions with quantum mechanical (QM) accuracy. However, constructing robust and representative training datasets that capture subtle, system-specific interaction motifs remains challenging. We introduce PANIP (PAirwise Non-covalent Interaction Potential), an ensemble MLIP model built upon the NequIP framework and trained on non-covalent interactions (NCIs) between protein-derived fragments. PANIP is trained using an automated multi-fidelity active learning (MFAL) workflow, in which a representative training subset, termed PDB-FRAGID (PDB Fragment Interaction Dataset), was distilled from an otherwise prohibitively large pool of fragment dimers extracted from the Protein Data Bank (PDB). PANIP retains {omega}B97X-D3BJ/def2-TZVPP-level accuracy and achieves mean absolute errors below 0.2 kcal/mol on out-of-distribution systems, demonstrating excellent transferability across diverse NCI motifs. Compared to the widely used ANI-2x potential, PANIP delivers substantially lower errors, particularly for charged and strongly interacting dimers. Coupled with a fragmentation-based energy decomposition scheme, PANIP estimates protein-ligand binding energies at near force-field computational cost yet QM-level accuracy, enabling its use as a fragment-based scoring function that rivals specialized docking scoring functions.
