Building a Global PDB Repository

Building a Global PDB Repository

In this post, we’ll discuss how we manage and process PDBs, a critical chemical data filetype that encodes the 3-D structure of proteins.

What is a PDB?

At Reverie, we use 3D protein structural data for many applications - medicinal chemists evaluating structure-activity relationships, computational chemists running molecular dynamics simulations, machine learning engineers training models with 3D structural data, and numerous other use-cases. This 3D structural information is often stored in .pdb file format. PDB files can contain many types of metadata about protein ligand structures, but we are typically most concerned with the HETATM and ATOM record lines that annotate ligand and protein atoms respectively, which look like:

In order, the first 9 columns specify an atom's type, sequential id, element, residue name, chain, residue id, and x, y, z coordinates. These PDB structures are typically produced using x-ray crystallography, a method in which X-rays are passed through a crystallized protein and the resulting diffraction patterns are used to determine the 3D structure of the underlying protein. However, some protein targets are difficult to crystallize. In these instances, computer modeling techniques are used to hypothesize the protein structure and generate homology models. The largest open-source collection of PDBs is the Protein Data Bank (the .pdb file's namesake) which stores more than 170,000 structures available for public use. Our teams at Reverie leverage a combination of open-source PDB files, in-house homology models, and proprietary crystal structures. For more information on the structure of a PDB file, including explanations of PDB chains, conformations, and residues, RCSB provides a great quickstart guide:

Why store all PDBs centrally?

PDBs in the Protein Data Bank are uniquely identified by a four-digit alphanumeric ID. We'll take 6VN8 as an example. In practice, team members don't always use the PDB versions downloaded directly from RCSB. Some chemists and engineers prefer to download structures from the KLIFS database (an aligned kinase PDB database that we'll talk more about later), and some team members perform other methods of pre-processing that alter the atom coordinates of PDB structures. Ultimately, this can lead to structural data mismatch between projects - the PDB that one engineer refers to as 6VN8 is not guaranteed to be identical to what another engineer calls 6VN8. If you view the Protein Data Bank and KLIFS versions of 6VN8 in PyMOL, despite the structures being identical, it is clear that they are positioned in different areas of 3D space.

Here are a few example cases when it is essential that cross-disciplinary teams are working from identical sources of PDB data:

  • Generating input features for a machine learning model by combining molecular dynamics outputs generated by our computational chemistry team with interaction fingerprints developed by our software engineers.
  • Comparing docked poses from medicinal chemists manually running docking programs to docked poses from our scalable docking processes created by software engineers.
  • Visually analyzing 3D results from any of our machine learning or computational chemistry methods side-by-side.

These situations motivated us to build an internal Reverie PDB cache that contained open-source RCSB PDBs, homology models, and internal crystal structures, and could be uniquely identified by a Reverie PDB ID.

PDB files come with their own set of quirks, making them difficult to automatically process. Different methods of dealing with these quirks contributed to the structural data mismatch between teams. When we set out to build a Reverie PDB cache, we wanted to ensure that cleaning and preprocessing PDBs only had to happen once, automatically, at upload time.  For all of our kinase PDBs, we additionally align the binding pockets to a single reference frame, allowing us to easily compare binding pockets of different structures. In this post, we’ll introduce some of the open source and commercial tools that we use to align and prepare PDBs before uploading them to the cache.

Design goals for the PDB Cache

We had three high-level design objectives for the PDB cache:

  1. The binding sites of all kinase PDBs should be aligned to a single reference frame.
  2. PDBs should be cleaned and processed automatically when uploaded (residue gaps filled in, multiple copies of identical chains removed, C and N termini capped by NME and ACE residues etc.)
  3. Orthosteric ligands should be extracted and saved to separate PDB and SDF files.

Binding Site Alignment

Alignment of all kinase PDB structures to a common global reference frame is a key component of the PDB cache. With aligned structures, we can do things like:

  • easily compare the binding pockets of our target molecules with important off-target structures to form strategies for creating selective molecules
  • look at differences between multiple structures of the same target to understand how different ligands induce conformational changes
  • perform PDB clustering based on protein conformation

We handle the alignment of publicly available v.s. internal PDBs differently.

Publicly Available Structures

For publicly available kinase PDB structures, we retrieve aligned PDB files from the KLIFS database. The KLIFS database contains an up-to-date collection of publicly available kinase PDBs, which have all been structurally aligned based on the 85 residues in their binding site. More information about the KLIFS alignment process and a searchable database of aligned PDB structures can be found on their website:

Proprietary Crystal Structures

At Reverie, we also have access to proprietary kinase crystal structures and custom homology models built by our computational chemists. Since these structures are not available for download from the KLIFS database, we use the biopython (bioPDB) library to align these structures to similar pre-existing PDBs in the cache. Users specify the reference PDB to which their PDB should be aligned, typically a kinase within the same family.

In order to superimpose PDB structures with BioPDB, we must identify which pairs of residues in the two PDBs should be aligned. We use the BioPDB sequence alignment module to align the residue sequences of our new PDB structure to the reference structure.

In PDB structures, amino acid residues are specified by their 3-letter identifier. However, the sequence alignment module works on single-letter identifiers. E.g. ALA-TYR-ALA would become A-Y-A when converted to its single letter modifiers. We define a helper function that takes in a bioPDB structure and returns the single-letter amino acid sequence of the structure and a list of the bioPDB residue objects for each of those amino acids. We'll use this utility in the next section.

import Bio.PDB as biopdb
from Bio import pairwise2
from Bio.PDB.Polypeptide import three_to_one

def _get_amino_acid_sequence(biopdb_structure, chain):
    From a biopdb structure:
    Build a single-letter amino acid residue sequence
    and a list of their corresponding residue objects.
    sequence = []
    residues = []
    for residue in biopdb_structure[0][chain].get_residues():
    	letter_code = three_to_one(residue.get_resname())
    return (sequence, residues)

Next, we use our utility to get the amino acid sequence and align it using the bioPDB pairwise2 alignment module. This will return the two sequences with their amino acids aligned. For example, if our input sequences were ACCGT and ACG, the pairwise2 alignment function would return ACCGT and A_CG_. The last step is to run get_residues_to_align, which simply iterates through the two aligned sequences and, wherever the two amino acids match across aligned sequences, grabs the corresponding bioPDB residue object from each structure.

def get_residues_to_align(ref_structure, align_structure, chain = "A"):
    Given two biopdb structures,
    return ref_residues and align_residues:
    lists of the corresponding biopdb residue objects from
    each structure to align to eachother.
    i.e. align_residues[0] will be aligned to ref_residues[0]
    #get single-letter amino acid sequence of both structures
    ref_sequence, ref_residues = _get_amino_acid_sequence(ref_structure, chain)
    align_sequence, align_residues = _get_amino_acid_sequence(align_structure, chain)
    #use biopdb to get a sequence alignment of the two structures
    alignment = pairwise2.align.globalxx("".join(ref_sequence), "".join(align_sequence))[0]
    alignment_ref = alignment.seqA
    alignment_align = alignment.seqB
    #get corresponding residue objects for identical amino acids in the two structures
    ref_residues, align_residues = get_residues_to_align(alignment_ref, alignment_align, ref_residues, align_residues)
    return ref_residues, align_residues

Now that we can get the lists of corresponding residues to align from two pdb structures, we can load both structures into pdb_parser objects and superimpose one structure onto the other using the bioPDB Superimposer module. The superimposed structure is saved to a new output pdb.

def align_pdbs(reference_pdb: str, pdb_to_align: str, chain: str, output_path: str) -> None:
    Aligns one PDB to another using RMSD minimization on the CA atoms of     conserved residues between the two structures.
    	str reference_pdb: path to PDB that is used as alignment      								reference
        str pdb_to_align: path to PDB that will be aligned
        str chain: pdb chain to align ("A", "B", "C", etc)
        str output_path: path to output aligned pdb
    #Initialize the BioPDB structure objects for both PDBs
    pdb_parser = biopdb.PDBParser()
    ref_structure = pdb_parser.get_structure("reference", reference_pdb)
    align_structure = pdb_parser.get_structure("pdb_to_align", pdb_to_align)
    #find residues that are conserved between the sequence alignment of the two PDBs
    ref_residues, align_residues = get_residues_to_align(ref_structure, align_structure)
    #get list of CA atoms from matching residues
    ref_atoms = [residue["CA"] for residue in ref_residues]
    align_atoms = [residue["CA"] for residue in align_residues]
    #initialize Superimposer object
    super_imposer = biopdb.Superimposer()
    super_imposer.set_atoms(ref_atoms, align_atoms)
    all_atoms_disordered = []
    for residue in align_structure.get_residues():
    	#get all atoms including disordered residues
        for atom in residue.get_unpacked_list():
    #apply the superimposer transformation to all atoms in the align 		 structure
    #save aligned structure to PDB file
    io = biopdb.PDBIO()

The figure below shows the RCSB structures for CAMKII (2V7O and 6AYW) before and after alignment.

Before (left) and after (right) alignment


The aligned PDBs are then passed to the cleaning phase. In this phase, we use a variety of proprietary and commercial tools:

  • enumeration of alternate locations
  • modeling missing loops and residues
  • capping chain breaks
  • resolving disordered and clashing residues
  • ligand and cofactor tautomer enumeration

By doing this PDB preparation at PDB cache upload time, we avoid having to deal with these issues in individual ML and computational data pipelines.

Extracting ligands

The final step of our PDB prep process is identification and extraction of the orthosteric ligand. This occurs in two main steps. First, we use the BioPDB library to filter based on simple heuristics like residue type, size, number of atoms, and molecular weight. If no residues remain after this filtering, we return None, and the PDB is considered an apostructure. Otherwise, we classify the largest remaining residue name as the orthosteric ligand.

import Bio.PDB as biopdb

def get_orthosteric_ligand(path_to_pdb: str) -> str:
    Given the path to a kinase PDB file, determines the orthosteric 		ligand and returns the residue code and id.
    	path_to_pdb str: path to pdb file
    	tuple: Residue name and ID for orthosteric ligand in PDB
    #  Filtering based on simple heuristics. #
    #load pdb into biopdb structure instance
    parser = biopdb.PDBParser()
    structure = parser.get_structure(path_to_pdb, path_to_pdb)
    residues = structure.get_residues()
    #keep hetatm residues
    residues = [residue for residue in residues if				   		(_valid_hetatm(residue.get_id()[0]))]
    #filter out residues with less than 8 heavy atoms
    residues = [residue for residue in residues if sum(1 for _ in  		residue.get_atoms())>8]
    #filter based on molecular weight
    residues = [residue for residue in residues if 50 < _get_mol_weight(residue) < 1500]
    sorted_residues = sorted(residues, key = _get_mol_weight)
    if not sorted_residues:
    	return None
    orthosteric_residue = sorted_residues[-1].get_resname()

Sometimes, multiple ligands with the same residue name exist within a PDB. In order to filter for the true orthosteric ligand, we iterate through the residues with that name and find the one closest to the center of the kinase binding site. Because the binding sites of all of our kinase structures are aligned, we created a function called get_ligand_distance that computes the distance from our universal binding site location to a specified residue.

    # Filtering based on distance to binding site #
    #finds all residues with the name of the orthosteric_residue
    with open(path_to_pdb, "r") as f:
    	multiple_ligand_ids = []
        for line in f:
        	if get_res_name(line) == orthosteric_residue:
    #if multiple ligands with this residue name exist, find closest to 		 binding site
    multiple_ligand_ids_unique = set(multiple_ligand_ids)
    if len(multiple_ligand_ids_unique) > 1:
        multiple_ligand_distances = {res_id: 			_get_ligand_distance(path_to_pdb, res_id) for res_id in 		multiple_ligand_ids_unique}
        best_res_id = min(multiple_ligand_distances, key=multiple_ligand_distances.get)
    	(best_res_id,) = multiple_ligand_ids_unique
    return (orthosteric_residue, best_res_id)

After identifying the ligand that is closest to the binding site, that ligand is extracted to its own PDB file and saved separately in the cache. At the time of download, users can also specify whether to include ligand, water, and other cofactor HETATM records in the PDB, at which point they are added back to the PDB. Since different PDB applications have different assumptions, this is a great way to add customization while maintaining rigidity within the PDB version uploaded to the cache.

UI for PDB Cache Access

We've created a variety of ways for users to download PDBs from the cache:

  • Python Utils
    When building large ML datasets with Python, it is often most convenient for our ML Engineers and Data Scientists to call a function that downloads the correct version of the PDB, with arguments that allow the user to specify which HETATM records to include.
  • Command Line Interface
    We built a command line interface around the python utilities.
  • Streamlit App
    Computational and medicinal chemists doing one-off downloads of specific PDB files can use our simple PDB-download web app. This simple application from our internal suite of Reverie apps was built using Streamlit.

    In an upcoming blog post, we'll explain more about how we use Streamlit to quickly make simple python functionality accessible to our medicinal chemists!


In this post we explained just a few of the tools that we use for PDB processing.  The cleaned and aligned repository of PDBs has made our chemistry and machine learning workflows move quicker by allowing us to spend time focusing on modelling instead of data processing.

We’re hiring!

We’re actively hiring engineers across our tech stack and chemistry team, including Full Stack Engineers, Machine Learning Engineers, and Senior Data Scientists to work on exciting challenges that are critical to our approach to developing life-saving cancer drugs. You will work with a deeply technical YC-backed team that is growing in size and scope. You can read more about us at, and please reach out if you’re interested in learning more.