# Overview
Last Update: 2023/02/03

## Title
From proteins to nanoparticles: domain-agnostic predictions of nanoscale interactions

## Contributors
- Jacob Saldinger
- Matt Raymond
- Paolo Elvati (elvati@umich.edu, preferred contact)
- Angela Violi (avioli@umich.edu, alternate contact)

## Funding and Support
- This work was supported by the BlueSky Initiative ("Accelerating the response to biothreats: Machine learning as screening for antimicrobial", University of Michigan College of Engineering, PI: A. Violi).
- We acknowledge Advanced Research Computing, a division of Information and Technology Services at the University of Michigan, for computational resources and services provided for the research.

## Research Overview:
The accurate and rapid prediction of generic nanoscale interactions is a challenging problem with broad applications. Much of biology functions at the nanoscale, and our ability to manipulate materials and engage biological machinery purposefully requires knowledge of nano/bio interfaces. While several protein-protein interaction models are available, they leverage protein-specific information, limiting their abstraction to other structures.
We present NeCLAS, Neural Coarse-graining with Location Agnostic Sets, a general, and rapid machine learning pipeline that predicts the location of nanoscale interactions, providing human-intelligible predictions. Two key aspects distinguish NeCLAS: coarse-grained representations, and the use of environmental features to encode the chemical neighborhood. We tested NeCLAS predictions with challenges for protein-protein, protein-nanoparticle, and nanoparticle-nanoparticle systems, demonstrating that it replicates computationally- and experimentally-observed interactions. Tested on a curated dataset, NeCLAS outperforms current nanoscale prediction models for nanoparticles up to 10-20 nm and shows cross-domain validity.
These results show that our framework can contribute to both basic research and rapid prototyping/design of diverse nanostructures in nanobiotechnology.

## Links
This work is described in further detail in the following articles:
- (preprint) DOI: 10.1101/2022.08.09.503361 (BioArXiv)
- (full) DOI: TBD (Nature Computational Science)

The code is available on CodeOcean at the following links:
- https://codeocean.com/capsule/2149375/tree
- https://codeocean.com/capsule/8157811/tree
both links point to the same code.



---
# Methods
In this work, a variety of techniques were used.
Here we provide additional files that are not present in the accompanying article, code repositories, already available, specifically:
- Molecular Dynamics simulations
- Nanoparticle structures and properties
- NeCLAS predictions



## Methodology
- All-atom MD simulations were performed using NAMD, version 2.14. NAMD software and documentation can be found at https://www.ks.uiuc.edu/Research/namd/

- Coarse-grained simulations were performed using LAMMPS, version 29 Sep 2021 - Update 2. LAMMPS software and documentation can be found at https://www.lammps.org

- Enhanced sampling and free energy calculations were computed using the PLUMED plugin version 2.6. PLUMED software and documentation can be found at https://www.plumed.org/

- Force matching potentials were generated using OpenMSCG. OpenMSCG software and documentation can be found at https://software.rcc.uchicago.edu/mscg/ 

- Data analysis and data prediction were performed with NeCLAS



---
# Files
The files organized in 4 groups of archives:


**Datasets**: Contains the data used to train the ML models, as well as the pre-processed data for NeCLAS. There are 4 subgroups:

(A) **METADATA** information about how dataset are split, namely:
- **pipgcn_split.yaml**: Contains the train-test split detailed in the PIPGCN paper. The `train` and `test` keys indicate the train and test sets, respectively. Their values are the names of the complexes associated with the given split.
- **ppi_homo.yaml**: Contains a list of SCOP-homologous proteins for every protein in the dataset. Used in leave-one-homology-out tests. Each key is a protein-complex name, and each set of values is a list of proteins that are SCOP-homologous to the given protein.
- **pni_homo.csv**: Contains a list of proteins that are homologous to the proteins in the protein-nanoparticle complexes. There is only one column, so no column name is provided.
- **cpx_classes.csv**: Groups proteins into different classes, which are used to ensure a representative testing dataset. `prot` contains the protein names, `difficulty` ranks the proteins on a difficulty scale from 1 to 3 (as defined by DBD database), and `family` bins each protein in one of three groups (enzyme, antibody, and other interactions).

(B) **PNI** Protein-nanoparticle complexes. Divided in:
(1) **aa_cpxs** complexes that have been coarse-grained based on the amino acids,
(2) **cg_cpxs** complexes that have been coarse-grained using the neural gas method,
(3) **np_labels** labels that indicate if a coarse-grained structure is part of the interaction interface.
In all cases structures are labeled based on the complex ID, "l(eft)"/"r(ight)" nanoparticle (arbitrary distinction to separate the two components), and the geometry is for the b(ound) or (u)nbound configuration.

> E.g., Dasetaset_pni_aa_cpxs.tar.gx/3CYU_r_b.pqr is bound structure of the right nanoparticle in the 2CYU protein-nanoparticle complex, using the amino acids based coarse-grained method.

(C) **PPI** Protein-protein complexes, divided in:
(1) **aa_cpxs** complexes that have been coarse-grained based on the amino acids,
(2) **cg_cpxs** complexes that have been coarse-grained using the neural gas method,
In all cases structures are labeled based on the complex ID, "l(eft)"/"r(ight)" nanoparticle (arbitrary distinction to separate the two components), and the geometry is for the b(ound) or (u)nbound configuration.

(D) **PROPS** Properties used to compute the coarse-grained features for NeCLAS. Divided in:
1) **mol** Maps the pocket descriptors and Hydrogen bond donor/receiver. Note that this archive also contains properties for 1 PSM protein-protein complex and 2 nanoparticle-nanparticle complexes.
(2) **mol_aa** Maps the pocket descriptors and Hydrogen bond donor/receiver. 
In all cases structures are labeled based on the complex ID, "l(eft)"/"r(ight)" nanoparticle (arbitrary distinction to separate the two components), and the geometry is for the (u)nbound configuration.



**MD_AA**: Contains all-atom molecular dynamics simulations inputs and selected outputs. Documentation for the input and output formats can be found here: 
(NAMD) https://www.ks.uiuc.edu/Research/namd/2.14/ug/node10.html
(PLUMED) https://www.plumed.org/doc-v2.6/user-doc/html/_syntax.html
(OpenMSCG) https://software.rcc.uchicago.edu/mscg/tutorials/lesson-01/README.html
The data is subdivided based on the simulated molecule:

(A) **g3CHO** g3 graphene quantum dot (GQD), decorated with aldehyde groups. The files allow performing a canonical simulation of a cluster of 4 GQD using NAMD.

(B) **g3OH** g3 graphene quantum dot (GQD), decorated with hydroxyl groups. Contains simulation files for:
(1) **Cluster** files needed to perform a canonical simulation of a cluster of 5 GQD using NAMD. 
(2) **Force Matching** post-processing of the cluster simulation to obtain potentials via Force Matching method.
(3) **Free Energy** reconstruction simulation using NAMD+PLUMED plugin.



**MD_CG**: Contains coarse grained (CG) molecular dynamics simulations inputs and selected outputs. 
Documentation for the input and output formats can be found here: 
(LAMMPS) https://docs.lammps.org/Run_head.html
The files are subdivided based on the simulated molecule:

(A) **6C-g3OH** g3 graphene quantum dot (GQD), decorated with hydroxyl groups and 6 cysteine groups, one for each edge.
(B) **g3CHO** g3 graphene quantum dot (GQD), decorated with aldehyde groups.
(C) **g3OH** g3 graphene quantum dot (GQD), decorated with hydroxyl groups.
All archives contain the molecular topology file the input and log files for a minimization (min) and an equilibration (rel) simulation.  



**Predictions**: Contains the NeCLAS interaction predictions for:

(A) **PNI** Protein-nanoparticle complexes. Divided in:
- **pw.csv**: Random seeds and AUC^comp for the pairwise NeCLAS model over 250 train-test iterations. `split_seed` indicates the random seed used to select the validation training set and downsampled samples. 'tf_seed_i' indicates the seed used for training the i-th NeCLAS model (out of 10), and the model with the highest validation, AUC is selected. The remaining columns each indicate the AUC for each complex, using the NeCLAS model that achieved the highest validation AUC.
- **npw.csv**: Random seeds and AUC^comp for the non-pairwise NeCLAS model over 250 train-test iterations. `split_seed` indicates the random seed used to select the validation training set and downsampled samples. 'tf_seed_i' indicates the seed used for training the i-th NeCLAS model (out of 10), and the model with the highest validation AUC is selected. The remaining columns each indicate the AUC for each complex, using the NeCLAS model that achieved the highest validation AUC.

(B) **PPI** Protein-protein complexes, divided in:
- **pw_loocv.tar.gz**: Complex predictions for pairwise leave-one-out cross validation	(2) **<complex_id>.csv**: Each CSV file contains the pairwise predictions and ground-truth labels for a single protein-protein interaction pair. The `l` and `r` columns denote which pair of coarse-grained subunits (from the left and right structures) are under consideration. `pred` indicates NeCLAS's prediction (between 0 and 1), and `y_true` indicates whether the two subunits interact.
- **pw_split.tar.gz**: Complex predictions for pairwise predictions using PIPGCN's split
- **<complex_id>.csv**: Each CSV file contains the pairwise predictions and ground-truth labels for a single protein-protein interaction pair. The `l` and `r` columns denote which pair of coarse-grained subunits (from the left and right structures) are under consideration. `pred` indicates NeCLAS's prediction (between 0 and 1), and `y_true` indicates whether the two subunits interact.