Supporting data: Domain-agnostic predictions of nanoscale interactions in proteins and nanoparticles

Saldinger, Jacob; Raymond, Matt; Elvati, Paolo; Violi, Angela

Work Description

Title: Supporting data: Domain-agnostic predictions of nanoscale interactions in proteins and nanoparticles Open Access Deposited

Attribute	Value
Methodology	A detailed explanation of the methods used to generate and process the data can be found in the article “From proteins to nanoparticles: domain-agnostic predictions of nanoscale interactions”, for which a preprint can be found at https://www.biorxiv.org/content/10.1101/2022.08.09.503361v2 The data present in this archive cover (a) all-atom and coarse-grained molecular dynamics simulations, (b) data used to train the machine learning models, and (c) the prediction for the system shown in the paper.
Description	The accurate and rapid prediction of generic nanoscale interactions is a challenging problem with broad applications. Much of biology functions at the nanoscale, and our ability to manipulate materials and purposefully engage biological machinery requires knowledge of nano-bio interfaces. While several protein-protein interaction models are available, they leverage protein-specific information, limiting their abstraction to other structures. Here, we present NeCLAS, a general, and rapid machine learning pipeline that predicts the location of nanoscale interactions, providing human-intelligible predictions. Two key aspects distinguish NeCLAS: coarse-grained representations, and the use of environmental features to encode the chemical neighborhood. We showcase NeCLAS with challenges for protein-protein, protein-nanoparticle and nanoparticle-nanoparticle systems, demonstrating that NeCLAS replicates computationally- and experimentally-observed interactions. NeCLAS outperforms current nanoscale prediction models, and it shows cross-domain validity, qualifying as a tool for basic research, rapid prototyping, and design of nanostructures. Software: - To reproduce all-atom molecular dynamics (MD) NAMD is required (version 2.14 or later is suggested). NAMD software and documentation can be found at https://www.ks.uiuc.edu/Research/namd/ - To reproduce coarse-grained MD simulations, LAMMPS (version 29 Sep 2021 - Update 2 or later is suggested). LAMMPS software and documentation can be found at https://www.lammps.org - To rebuild free energy profiles, the PLUMED plugin (version 2.6) was used. PLUMED software and documentation can be found at https://www.plumed.org/ - To generate force matching potentials, the was used the OpenMSCG software was used. OpenMSCG software and documentation can be found at https://software.rcc.uchicago.edu/mscg/
Creator	Saldinger, Jacob Raymond, Matt Elvati, Paolo Violi, Angela
Depositor	[email protected]
Contact information	[email protected]
Discipline	Science
Funding agency	Other Funding Agency
Other Funding agency	University of Michigan - Bluesky initiative
Keyword	Neural Networks Proteins Dimensionality Reduction Nanoparticles Coarse-Graining
Citations to related material	https://www.biorxiv.org/content/10.1101/2022.08.09.503361v2
Resource type	Dataset
Last modified	03/06/2025
Published	02/08/2023
Language	English Python
DOI	https://doi.org/10.7302/58q6-0q88
License	http://creativecommons.org/licenses/by-nc/4.0/

To Cite this Work:
Saldinger, J., Raymond, M., Elvati, P., Violi, A. (2023). Supporting data: Domain-agnostic predictions of nanoscale interactions in proteins and nanoparticles [Data set], University of Michigan - Deep Blue Data. https://doi.org/10.7302/58q6-0q88

Relationships


This work is not a member of any user collections.

Files (Count: 16; Size: 423 MB)

Title	Original Upload	Last Modified	File Size	Access	Actions
README.txt	2023-02-08	2023-02-08	10.1 KB	Open Access	View Details Download
Dataset_pni_aa_cpxs.tar.gz	2023-02-02	2023-02-02	2.36 MB	Open Access	View Details Download
Dataset_pni_cg_cpxs.tar.gz	2023-02-08	2023-02-08	2.37 MB	Open Access	View Details Download
Dataset_pni_np_labels.tar.gz	2023-02-02	2023-02-02	1.46 MB	Open Access	View Details Download
Dataset_ppi_aa_cpxs.tar.gz	2023-02-02	2023-02-02	67.7 MB	Open Access	View Details Download
Dataset_ppi_cg_cpxs.tar.gz	2023-02-02	2023-02-02	66.5 MB	Open Access	View Details Download
Dataset_props_mol.tar.gz	2023-02-02	2023-02-02	6.82 MB	Open Access	View Details Download
Dataset_props_mol_aa.tar.gz	2023-02-02	2023-02-02	6.76 MB	Open Access	View Details Download
Dataset_metadata.zip	2023-02-02	2023-02-02	3.97 KB	Open Access	View Details Download
MD_AA_g3CHO.zip	2023-02-02	2023-02-02	3.17 MB	Open Access	View Details Download
MD_AA_g3OH.zip	2023-02-02	2023-02-02	29.7 MB	Open Access	View Details Download
MD_CG_6C-g3OH.zip	2023-02-02	2023-02-09	42.4 KB	Open Access	View Details Download
MD_CG_g3CHO.zip	2023-02-02	2023-02-02	83.9 KB	Open Access	View Details Download
MD_CG_g3OH.zip	2023-02-02	2023-02-02	88.6 KB	Open Access	View Details Download
Predictions_pni.zip	2023-02-02	2023-02-02	60.8 KB	Open Access	View Details Download
Predictions_ppi.zip	2023-02-02	2023-02-02	236 MB	Open Access	View Details Download

# Overview
Last Update: 2023/02/03

## Title
From proteins to nanoparticles: domain-agnostic predictions of nanoscale interactions

## Contributors
- Jacob Saldinger
- Matt Raymond
- Paolo Elvati ([email protected], preferred contact)
- Angela Violi ([email protected], alternate contact)

## Funding and Support
- This work was supported by the BlueSky Initiative ("Accelerating the response to biothreats: Machine learning as screening for antimicrobial", University of Michigan College of Engineering, PI: A. Violi).
- We acknowledge Advanced Research Computing, a division of Information and Technology Services at the University of Michigan, for computational resources and services provided for the research.

## Research Overview:
The accurate and rapid prediction of generic nanoscale interactions is a challenging problem with broad applications. Much of biology functions at the nanoscale, and our ability to manipulate materials and engage biological machinery purposefully requires knowledge of nano/bio interfaces. While several protein-protein interaction models are available, they leverage protein-specific information, limiting their abstraction to other structures.
We present NeCLAS, Neural Coarse-graining with Location Agnostic Sets, a general, and rapid machine learning pipeline that predicts the location of nanoscale interactions, providing human-intelligible predictions. Two key aspects distinguish NeCLAS: coarse-grained representations, and the use of environmental features to encode the chemical neighborhood. We tested NeCLAS predictions with challenges for protein-protein, protein-nanoparticle, and nanoparticle-nanoparticle systems, demonstrating that it replicates computationally- and experimentally-observed interactions. Tested on a curated dataset, NeCLAS outperforms current nanoscale prediction models for nanoparticles up to 10-20 nm and shows cross-domain validity.
These results show that our framework can contribute to both basic research and rapid prototyping/design of diverse nanostructures in nanobiotechnology.

## Links
This work is described in further detail in the following articles:
- (preprint) DOI: 10.1101/2022.08.09.503361 (BioArXiv)
- (full) DOI: TBD (Nature Computational Science)

The code is available on CodeOcean at the following links:
- https://codeocean.com/capsule/2149375/tree
- https://codeocean.com/capsule/8157811/tree
both links point to the same code.

---
# Methods
In this work, a variety of techniques were used.
Here we provide additional files that are not present in the accompanying article, code repositories, already available, specifically:
- Molecular Dynamics simulations
- Nanoparticle structures and properties
- NeCLAS predictions

## Methodology
- All-atom MD simulations were performed using NAMD, version 2.14. NAMD software and documentation can be found at https://www.ks.uiuc.edu/Research/namd/

- Coarse-grained simulations were performed using LAMMPS, version 29 Sep 2021 - Update 2. LAMMPS software and documentation can be found at https://www.lammps.org

- Enhanced sampling and free energy calculations were computed using the PLUMED plugin version 2.6. PLUMED software and documentation can be found at https://www.plumed.org/

- Force matching potentials were generated using OpenMSCG. OpenMSCG software and documentation can be found at https://software.rcc.uchicago.edu/mscg/

- Data analysis and data prediction were performed with NeCLAS

---
# Files
The files organized in 4 groups of archives:

**Datasets**: Contains the data used to train the ML models, as well as the pre-processed data for NeCLAS. There are 4 subgroups:

(A) **METADATA** information about how dataset are split, namely:
- **pipgcn_split.yaml**: Contains the train-test split detailed in the PIPGCN paper. The `train` and `test` keys indicate the train and test sets, respectively. Their values are the names of the complexes associated with the given split.
- **ppi_homo.yaml**: Contains a list of SCOP-homologous proteins for every protein in the dataset. Used in leave-one-homology-out tests. Each key is a protein-complex name, and each set of values is a list of proteins that are SCOP-homologous to the given protein.
- **pni_homo.csv**: Contains a list of proteins that are homologous to the proteins in the protein-nanoparticle complexes. There is only one column, so no column name is provided.
- **cpx_classes.csv**: Groups proteins into different classes, which are used to ensure a representative testing dataset. `prot` contains the protein names, `difficulty` ranks the proteins on a difficulty scale from 1 to 3 (as defined by DBD database), and `family` bins each protein in one of three groups (enzyme, antibody, and other interactions).

(B) **PNI** Protein-nanoparticle complexes. Divided in:
(1) **aa_cpxs** complexes that have been coarse-grained based on the amino acids,
(2) **cg_cpxs** complexes that have been coarse-grained using the neural gas method,
(3) **np_labels** labels that indicate if a coarse-grained structure is part of the interaction interface.
In all cases structures are labeled based on the complex ID, "l(eft)"/"r(ight)" nanoparticle (arbitrary distinction to separate the two components), and the geometry is for the b(ound) or (u)nbound configuration.

> E.g., Dasetaset_pni_aa_cpxs.tar.gx/3CYU_r_b.pqr is bound structure of the right nanoparticle in the 2CYU protein-nanoparticle complex, using the amino acids based coarse-grained method.

(C) **PPI** Protein-protein complexes, divided in:
(1) **aa_cpxs** complexes that have been coarse-grained based on the amino acids,
(2) **cg_cpxs** complexes that have been coarse-grained using the neural gas method,
In all cases structures are labeled based on the complex ID, "l(eft)"/"r(ight)" nanoparticle (arbitrary distinction to separate the two components), and the geometry is for the b(ound) or (u)nbound configuration.

(D) **PROPS** Properties used to compute the coarse-grained features for NeCLAS. Divided in:
1) **mol** Maps the pocket descriptors and Hydrogen bond donor/receiver. Note that this archive also contains properties for 1 PSM protein-protein complex and 2 nanoparticle-nanparticle complexes.
(2) **mol_aa** Maps the pocket descriptors and Hydrogen bond donor/receiver.
In all cases structures are labeled based on the complex ID, "l(eft)"/"r(ight)" nanoparticle (arbitrary distinction to separate the two components), and the geometry is for the (u)nbound configuration.

**MD_AA**: Contains all-atom molecular dynamics simulations inputs and selected outputs. Documentation for the input and output formats can be found here:
(NAMD) https://www.ks.uiuc.edu/Research/namd/2.14/ug/node10.html
(PLUMED) https://www.plumed.org/doc-v2.6/user-doc/html/_syntax.html
(OpenMSCG) https://software.rcc.uchicago.edu/mscg/tutorials/lesson-01/README.html
The data is subdivided based on the simulated molecule:

(A) **g3CHO** g3 graphene quantum dot (GQD), decorated with aldehyde groups. The files allow performing a canonical simulation of a cluster of 4 GQD using NAMD.

(B) **g3OH** g3 graphene quantum dot (GQD), decorated with hydroxyl groups. Contains simulation files for:
(1) **Cluster** files needed to perform a canonical simulation of a cluster of 5 GQD using NAMD.
(2) **Force Matching** post-processing of the cluster simulation to obtain potentials via Force Matching method.
(3) **Free Energy** reconstruction simulation using NAMD+PLUMED plugin.

**MD_CG**: Contains coarse grained (CG) molecular dynamics simulations inputs and selected outputs.
Documentation for the input and output formats can be found here:
(LAMMPS) https://docs.lammps.org/Run_head.html
The files are subdivided based on the simulated molecule:

(A) **6C-g3OH** g3 graphene quantum dot (GQD), decorated with hydroxyl groups and 6 cysteine groups, one for each edge.
(B) **g3CHO** g3 graphene quantum dot (GQD), decorated with aldehyde groups.
(C) **g3OH** g3 graphene quantum dot (GQD), decorated with hydroxyl groups.
All archives contain the molecular topology file the input and log files for a minimization (min) and an equilibration (rel) simulation.

**Predictions**: Contains the NeCLAS interaction predictions for:

(A) **PNI** Protein-nanoparticle complexes. Divided in:
- **pw.csv**: Random seeds and AUC^comp for the pairwise NeCLAS model over 250 train-test iterations. `split_seed` indicates the random seed used to select the validation training set and downsampled samples. 'tf_seed_i' indicates the seed used for training the i-th NeCLAS model (out of 10), and the model with the highest validation, AUC is selected. The remaining columns each indicate the AUC for each complex, using the NeCLAS model that achieved the highest validation AUC.
- **npw.csv**: Random seeds and AUC^comp for the non-pairwise NeCLAS model over 250 train-test iterations. `split_seed` indicates the random seed used to select the validation training set and downsampled samples. 'tf_seed_i' indicates the seed used for training the i-th NeCLAS model (out of 10), and the model with the highest validation AUC is selected. The remaining columns each indicate the AUC for each complex, using the NeCLAS model that achieved the highest validation AUC.

(B) **PPI** Protein-protein complexes, divided in:
- **pw_loocv.tar.gz**: Complex predictions for pairwise leave-one-out cross validation (2) **.csv**: Each CSV file contains the pairwise predictions and ground-truth labels for a single protein-protein interaction pair. The `l` and `r` columns denote which pair of coarse-grained subunits (from the left and right structures) are under consideration. `pred` indicates NeCLAS's prediction (between 0 and 1), and `y_true` indicates whether the two subunits interact.
- **pw_split.tar.gz**: Complex predictions for pairwise predictions using PIPGCN's split
- **.csv**: Each CSV file contains the pairwise predictions and ground-truth labels for a single protein-protein interaction pair. The `l` and `r` columns denote which pair of coarse-grained subunits (from the left and right structures) are under consideration. `pred` indicates NeCLAS's prediction (between 0 and 1), and `y_true` indicates whether the two subunits interact.

Update Provenance Log Entries

Download All Files (To download individual files, select them in the “Files” panel above)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to contact us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.