Atlas

The American Heart Association Protein Binding Atlas Portal is a database of in-silico calculations of protein-ligand interactions, as well as supplemental information for proteins and ligands (for example, “molecular descriptors”).

By providing a large set of predicted human protein-ligand interactions – with a focus on heart-related proteins – this Atlas is intended to provide insight into drug-candidate molecules with regard to both efficacy (on-target interactions) and safety (off-target interactions).

This initial version of the Atlas is based on approximately 11,200 human proteins represented by approximately 13,000 structural models. These structural models are those for which high-quality structures are either available or able to be created by homology modeling and can be optimized to a minimum-energy state through molecular dynamics computer simulation methods.

Interaction calculations – which are in progress, and thus are available with varying degrees of completeness – are provided for an initial set of ligands: Federal Drug Administration-approved drugs, clinical trial and pre-clinical drugs, and compounds from the ChEMBL, Marine metabolite and Foodome datasets.

The protein binding pockets were clustered to identify closely related protein targets and compound cluster associations. See Clustering Protein Binding Pockets and Identifying Protein Drug Interactions: A Novel Ligand-based Featurization Method.

Molecular docking

We use our in-house ConveyorLC program toolchain to automate the docking and rescoring of compounds against each binding site. The ConveyorLC toolchain comprises four parallel programs for protein preparation (CDT1Receptor), ligand preparation (CDT2Ligand), molecular docking (CDT3Docking), and Molecular Mechanics/Generalized Born-Solvent Accessible Surface Area (MM/GBSA) rescoring (CDT4mmgbsa).

The toolchain depends on a number of external libraries, including the Message Passing Interface library, the C++ Boost library, the Conduit library, the HDF5 library and several molecular simulation packages, including Autodock Vina , the AMBER molecular simulation package and MGLTOOLS. Computational results are aggregated and saved in a series of HDF5 files. A few auxiliary tools are included in the toolchain to query and extract data in the HDF5 files.

Over 2 million compounds were selected from two publicly available compound libraries for docking. FDA-approved, clinical trial, and pre-clinical drugs were assembled from the Broad Drug Repurposing Hub.. And about 2 million unique compounds (for example, not including counterions) were used from ChEMBL. Marine and Foodome compounds were evaluated.

The CDT3Docking in ConveyorLC toolchain is based on Autodock Vina (Version 1.1.2) and uses MPI and multithreading hybrid parallel scheme. The docking grids of the binding sites were determined by the protein preparation program in the toolchain. Compounds were prepared for docking in the following manner. SMILES strings and 2D SDF structures were imported into the Molecular Operating Environment (MOE) for removal of salts and metal-containing ligands, protonation states were set to the dominant form at pH 7, 3D structures were created and minimized, and relevant MOE descriptors were calculated.

The final structures were exported from MOE as SDF files. These structures were then further processed by the ligand preparation in the toolchain by utilizing antechamber and the GAFF force field from the AMBER13 simulation package.

Human Proteome Structures

Docking and GBSA energy calculations depend on 3D structural models of the proteins involved. Each of the Atlas’s protein models represents exactly the amino-acid sequence provided in UniProt’s “Reference proteome”. Homology models are created to match each reference sequence using LLNL’s PDBspheres and AS2TS software packages.

Even in cases where a complete structure for a particular protein is available in the Protein Data Bank, there may be amino-acid substitutions between the protein structure and the UniProt reference sequence. In such a case the homology model makes the necessary substitutions and structural refinements to match the UniProt reference sequence. The PDBbind data has been pulled from the 2017 version.

In many cases, no structural model is available currently for a human protein. In these cases the AS2TS software is used to search PDB for proteins having sequence and structural similarities to each human protein of interest. These are used as “templates” to construct a homology model for the human protein.

In fact, many such models, using different templates, are constructed; the best such models are selected based on various quality metrics such as sequence similarity between the UniProt reference sequence and the template sequence, the accuracy (“resolution”) of the template structure, and the fraction of the reference sequence that has sequence similarity to the template sequence (“coverage”).

Homology model structures are given in “biological assembly” form – that is, the monomeric or multimeric conformation that is known or believed to be the functional form of a given molecule. The homology model biological assembly derives from the biological assembly of its structural templates in the PDB biological assembly coordinate files.

The initial homology modeling effort produced 11,681 "high-quality" structural models, where, for a particular UniProt reference protein sequence, up to two structural models (based on different templates) are selected.

The protein homology models are then processed in an energy-minimization calculation. The energy-minimization calculation (1) improves the docking calculation results and (2) is required for GBSA energy calculations. The energy-minimization calculation refines the homology model structure, using atomistic molecular dynamics force-fields and, in most cases, an “implicit solvent” calculation (where water is represented by a continuous medium with a dielectric constant rather than by individual molecules).

The calculation “relaxes” the protein’s conformation to a low-energy state. In some cases – where the implicit-solvent calculation failed to produce a result – an explicit-solvent calculation proved successful. The initial effort successfully completed these calculations for approximately 11,353 of the 11,681 structural models.

Single-point GBSA Rescoring

Protein-compound complexes were rescored using CDT4mmgbsa in the ConveyorLC toolchain. A total of ~95 million poses were rescored for binding sites. Each complex typically has 10 docking poses. CDT4mmgbsa employs a master worker parallel scheme in which the master is in charge of job dispatching to workers. Then each worker performs an MM/GBSA calculation using the AMBER sander program.

The AMBER force field (amberff14SB) was used for the proteins; the apo proteins’ MM/GBSA energies were previously determined in the CDT1Receptor step. Partial atomic charges for the compounds were computed by antechamber using the AM1-BCC method; each compound’s charges were previously calculated by the CDT2Ligand step. An energy minimization – 1,000 steps of steepest descent and 1,000 additional steps of conjugate gradient – was performed on each docked compound-protein complex using the modified generalized Born model of Onufriev, Bashford, and Case with model 2 radii (igb=5) with a nonbonded cutoff of 25 Å.

The MM/GBSA energy of the minimized complex structure was calculated using an infinite cutoff (999 Å) and a protein dielectric constant of 4. The binding affinity was computed by MM/GBSA energy of the complex less the sum of the MM/GBSA energies of the apo protein and the compound.

“Fusion” machine learning model

The Fusion models for Atomic and molecular STructures (FAST) predict the binding affinity of a protein and ligand by combining the predictions of a 3D convolution neural network model (3D-CNN) and a spatial-graph convolutional neural network (SG-CNN) model of the protein-ligand complex.

These FAST models were trained on the PDBbind database of protein-ligand solved structures accompanied by experimentally-determined binding affinity values. See Improved Protein-ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference. SC21 Technical Program High-Throughput Virtual Screening of Small Molecule Inhibitors for SARS-CoV-2 Protein Targets with Deep Fusion Models.

This modeling approach learns binding interaction rules directly from atomic structure representations without the need for hand curation of features intended to capture the mechanism of binding. The 3D-CNN models atoms and their features in a 3D voxel representation. The 3D-CNN representation implicitly models pairwise relationships between atoms through the relative positioning of atoms in a 3D voxel grid but does not predetermine which atomic interactions to represent other than defining a minimum atomic resolution.

This comes at the cost of having to learn a larger number of parameters to represent the grid. In contrast the SG-CNN uses an explicit distance threshold to determine which pairs of atoms to consider in pairwise interactions, with the potential benefit of using fewer parameters in the model.

Fusion models combine different input modalities and learn their feature representations together. Several fusion models have been proposed for the task of video activity recognition by bridging the dimension difference among multiple input modalities (e.g., visual and temporal data). Our fusion model combines the independently trained 3D-CNN and Spatial Graph CNN models.

Data pre-processing and feature extraction. Protein-ligand complex charges and protonation states are solved using UCSF Chimera with AMBER ff14SB for standard residues and AM1-BCC for non-standard residues.

Feature extraction. A common atomic representation was used for input to the structure-based deep learning models. The representation captures the properties of a typical organic molecule (proteins and ligands) and includes:

Element type: one-hot encoding of B, C, N, O, P, S, Se, halogen or metal (9 bits)
Atom Hybridization (1, 2, or 3) (1 integer)
Number of heavy atom bonds (i.e. heavy valence) (1 integer)
Number of bonds with other heteroatoms (i.e. hetero valence) (1 integer)
Structural properties: bit vector (1 where present) encoding of hydrophobic, aromatic, acceptor, donor, ring (5 bits)
Partial Charge (1 float)
Molecule type to indicate protein atom versus ligand atom (-1 for protein, 1 for ligand) (1 integer)
Van der Waals radius (1 float)

The OpenBabel cheminformatics tool (version 2.4.1) was used to extract the features for all binding complexes. Finally, all coordinates are centered by each ligand to produce the spatial representation of the binding complex.

Data. The PDBbind 2016 “general” set (13,308 complexes), including the 4057 “refined” set (4,057 complexes), was used as training date. The “core” set (290 complexes) was held out for testing. The training set was split for a 90/10 training/validation regimen.

3D-CNN implementation. The input volume dimension is N × N × N × C where N is the voxel grid size in each axis (48 in our experiment) and C is the number of atomic features. The volume length in each dimension is approximately 48Å where each voxel size is 1Å, which is sufficient to cover the entire pocket region while minimizing the collisions between atoms.

Each atom is assigned to at least one voxel or more, depending on its Van der Waals radius. In the case of collisions between atoms, we apply element-wise addition to the atom features. Once all atoms are voxelized, Gaussian blur with σ = 1 is applied in order to populate the atom features into neighboring voxels.

SG-CNN implementation. A straightforward definition of a molecular graph considers atoms as nodes and bonds as edges between the nodes. Such a 2D representation may be suitable for the modeling of chemical graphs; however it fails to capture non-covalent interactions (e.g., electrostatic interactions between the ligand and the protein binding pocket) that are necessary to model more complex biological structures.

Atomic Convolutional Neural Networks (ACNNs) allow for the specification of a “local” neighborhood for each atom that is based on Euclidean distance, effectively relaxing the covalency requirement to form edges in the graph representation of a protein-ligand binding complex. We use the PotentialNet architecture, which refines this idea by allowing for an arbitrary number of edge types and applies specialized update rules that are a generalization of Gated Graph Sequence Neural Networks.

Fusion model. Models to combine multiple input sources or different feature representations have been applied to a number of computer vision applications, especially in the presence of multi-modal images or different image sensors. These fusion models benefit from multiple feature representations which are considered complementary to each other.

We use a separate fusion neural network to combine feature representations from two independently trained models (3D-CNN and SG-CNN), each of which has its own strengths and weaknesses. The heterogeneous feature representations that the two models capture can enrich the fusion model’s features.

Among several ways to fuse models, we adopt midlevel to late fusion approaches and coherent fusion. The midlevel approach uses intermediate-layer output activations from the 3D-CNN and SG-CNN models. These are treated as input features passed to a fully connected layer, and the outputs are concatenated with the original input features before the next fully connected layer of the fusion model.

The late-fusion approach averages the final predictions of the CNN models. This approach is simple but effective to combine multiple model predictions.

Whereas the midlevel fusion model trains each of the three constituent models 3D-CNN, SG-CNN and fused layers separately, the coherent fusion trains all layers together. In addition, the coherent-fusion model was trained on PDBBind 2019 (not 2016).

MD-trajectory-average GBSA

For a selected set of Broad lead molecules – with docking scores better than -7.5 kcal/mol – the pose that ranked best by the single-point MM/GBSA calculation was used as a starting point for a molecular dynamics (MD) simulation. Overall average and window average GBSA scores were calculated over a 200-ns simulation trajectory.

The MD simulations were performed using the pmemd_cuda program in AMBER. The General Amber Force Field (GAFF) was used to model the ligands. The ligand-protein complex was solvated into a truncated octahedron of TIP3 water; 50 Na+ ions with a neutralizing number of Cl- ions were added to the solution. The system was energy minimized with 500 steps of steepest descents and 1500 steps of conjugate gradients.

Initial equilibration was performed with NVT dynamics at 300K for 200 ps with positional constraints (K = 1 kcal/mol•Å2) on the CA atoms in residues. Electrostatic interactions were treated using Particle Mesh Ewald (PME) summation. The nonbonded interactions were cut off at 8 Å. Further equilibration was performed with NPT dynamics for 4.8 ns. The pressure was set at 1 atm using a Monte Carlo barostat. The positional constraints were reduced to 0.5 kcal/mol•Å2). Production dynamics was performed for 200 ns without positional constraints. The MM/GBSA energies were calculated using MMPBSA.py.MPI utilizing the Generalized Born model of Onufriev, Bashford, and Case (igb=5) with a dielectric constant of 4.0 on coordinates saved every 20 ps.

Safety and pharmacokinetic property predictions

For each molecule safety and pharmacokinetic property values have been predicted with models generated with the ATOM Modeling Pipeline (AMPL) and Maestro workflow manager. The AMPL software framework automates the machine learning process, including a number of molecular featurization and machine learning tools.

Models for a number of different safety indicators and pharmacokinetic properties were developed based on experimentally determined values from both human and animal assays. Results from 23 of those models are available here. Details describing the BSEP model were published in JCIM.

The following table describes the models/predictions available. The “Model/features” column indicates the type of data used in each model to represent the molecule; “moe” indicates molecular descriptors calculated by either the Molecular Operating Environment software suite, or by RDKit; “ecfp" indicates representation by an extended-connectivity fingerprint calculated from each molecule's chemical structure using RDKit; “graphconv" indicates a spatial-graph convolutional neural network representation derived from each molecule's chemical structure. The “Model type” column indicates whether the model predicts a continuous value (a real number) – “regression” – or a discontinuous (binary classification) value – “classification”.

Category	Model data/assay	Description	Model/features	Model type
Developability	RDKit QED	qualitative estimate of druglikeness	smiles	regression
Developability	RDKit SAS	synthetic accessibility score	smiles	regression
Efficacy	protease	mm/gbsa for SARS-CoV-2 main protease (Mpro)	moe	regression
Efficacy	protease2	mm/gbsa for SARS-CoV-2 main protease (Mpro)	moe	regression
Efficacy	spike1	mm/gbsa for SARS-CoV-2 spike	moe	regression
PK	Human microsomal clearance model 2	log10 human microsomal clearance rate, alternate model	graphconv	regression
PK	Human VDss model 2	log10 steady state volume of distribution in humans	moe	regression
Safety	Aurora kinase A	pIC50 for aurora kinase A inhibition	graphconv	regression
Safety	Aurora kinase B	pIC50 for aurora kinase B inhibition	ecfp	regression
Safety	Bile Salt Export Pump model 2	pIC50 for bile salt export pump inhibition	moe	regression
Safety	Caspase 1	pIC50 for caspase 1 inhibition	graphconv	regression
Safety	Caspase 3	pIC50 for caspase 3 inhibition	graphconv	regression
Safety	Cathepsin K	pIC50 for cathepsin K inhibition	graphconv	regression
Safety	Cathepsin L1K	pIC50 for cathepsin L1 inhibition	graphconv	regression
Safety	CYP3A4_pIC50	pIC50 for cytochrome p450 3A4 inhibition	smiles	regression
Safety	hERG_pIC50	pIC50 for hERG potassium channel inhibition	moe	regression
Safety	Histamine receptor H1	pIC50 for histamine receptor H1 inhibition	graphconv	regression
Safety	KCNA5	pIC50 for Kv1.5 potassium channel inhibition	graphconv	regression
Safety	KCNH2 graph model 2	pIC50 for hERG potassium channel inhibition	graphconv	regression
Safety	Muscarinic acetylcholine receptor M1	pIC50 for muscarinic acetylcholine receptor M1 inhibition	graphconv	regression
Safety	Muscarinic acetylcholine receptor M2	pIC50 for muscarinic acetylcholine receptor M2 inhibition	graphconv	regression
Safety	Muscarinic acetylcholine receptor M3	pIC50 for muscarinic acetylcholine receptor M3 inhibition	graphconv	regression
Safety	SCN5A NaV1.5 model 2	pIC50 for sodium channel NaV1.5 inhibition, alternate model	graphconv	regression

Molecular descriptors

Molecular descriptors are numerical calculations based on the structure and physicochemical properties of a molecule, including elements, formal charges, and bonds. Results from 81 of those descriptors are available here. A number of descriptors have been calculated with RDKit.

For detailed definitions see QuaSar-Descriptor. Brief summaries are provided here.

PEOE - partial charges, Gasteiger method PC+ - total positive
RPC+ - relative (largest single-atom partial charge compared to PC+)
VSA - van der Waals surface area
POS - over positively-charged atoms
PPOS - over atoms with charge > 0.2
FHYD - hydrophobic (atoms with abs(charge) < 0.2)
POL - polar (atoms with abs(charge) > 0.2)

SMR - molar refractivity (polarizability)
SMR_VSAn - sums of surface area for atoms with SMR in various ranges

SlogP - Log of the octanol/water partition coefficient.
SlogP_VSAn - sums of surface area for atom with SlogP in various ranges

TPSA - total polar surface area

Downloads

Results available in the American Heart Association's Protein Binding Atlas Portal include protein structural models, protein-ligand complexes showing docking poses, and protein-ligand complexes showing refined GBSA binding poses (as mentioned above, in the GBSA calculations an additional step of energy minimization is applied to the docking poses). These can be downloaded in PDB-file format (“.pdb”).

Protein Binding Prediction Count Summary

There were 4.3K out of 10K that met the cutoff for efficacy predicted bindings between compounds and their expected protein targets. There were 34K out of 119K that met the potential safety concern cutoff for predicted bindings between compounds and AMPL safety protein targets.

	Compound Count	Protein Count	Protein Binding Prediction Count	Compound Cutoff Count	Protein Cutoff Count	Protein Binding Prediction Cutoff Count	Protein Binding Prediction Cutoff Percentage
Efficacy: Drug On-Target	3,930	1,704	10,222	2,107	922	4,306	42.12%
AMPL Potential Safety Concern and AMPL Safety Protein	26,430	12	119,111	12,726	12	33,989	28.54%

Data Release

1.5 Feb 2024

Compound Group: Broad, ChEMBL

1.4 Nov 2023

Compound Group: ChEMBL, Foodome

1.3 Mar 2023

Compound Group: Foodome

1.2 Dec 2022

Protein Group: EC Number, Reactome and SMPDB pathways, UniProt Disease

1.1 Jun 2022

Compound Group: ChEMBL
Protein Group: UniProt pathway
Protein Binding Prediction Type: Docking, GBSA, Coherent-Fusion

1.0 Apr 2022

Compound Group: Broad, ChEMBL, Marine
Compound Feature: Functional Groups, Chemical Descriptors
Protein Binding Prediction Type: Docking, Coherent-Fusion
Homology model protein structures
Safety and pharmacokinetic property predictions

American Heart Association Protein Binding Atlas Portal for Accelerated Drug Discovery