Infrastructure

Scaling Drug Development with Containerized Machine Learning

This blog post was originally published on the AWS Startups blog here.

At Reverie Labs, we use computation to drive the development of therapeutics for cancer. To do this, we have built substantial cloud-based infrastructure to train machine learning models, deploy models to production, and build and ship internal-facing applications for our chemistry teams. In doing so, we are deeply integrated with an indispensable tool for modern software engineering workflows: Docker. This blog post will shed some light on how we use Docker in various parts of our production workflows.

What is Docker?

Read more about the details here, but Docker is essentially a tool to create immutable containers that run on the underlying Linux kernel. This enables us to build images that have a filesystem with all of the dependencies pre-installed, in the Linux distribution of our choice, regardless of the configuration of the host machine. In other words, if the application works on an engineer’s machine in a Docker container, it should work on any Linux machine with Docker installed. Here are some of the areas where we use Docker and why:

ML Training Containerization

One of the most challenging aspects of modeling is reproducibility - models are often sensitive to small changes in code and hyperparameter configurations. Being able to recreate a static state of the ML code is critical to making experiments repeatable (especially months in the future when we may want to retrain with new data and compare outcomes). Docker enables us to achieve this. At Reverie, all production machine learning pipelines run in a Docker container with the following:

An Ubuntu 16 base image
NVIDIA GPU drivers installed
All cheminformatics dependencies (e.g. RDKit) installed
A Python 3.6 conda environment, with packages specified using a version-controlled environment.yml file
For machine learning: Tensorflow, PyTorch, and sklearn installed

For an engineer to begin training locally, all she has to do is pull the container and run a python script pointed at the AWS S3 path with the training data. Need a fellow engineer to help debug? Have her pull the container. They now have exactly identical dev environments. Want to make sure the container will run in production the same way it does in dev? Just run the code in the container before running the training job.

ML Training Parallelization

Perhaps the biggest benefit we get from using Docker for ML training is the ability to leverage batch queueing tools on cloud providers to parallelize training jobs. To do this, we use AWS Batch and Amazon Elastic Container Service, which can run arbitrary batch jobs in Docker containers. Here is a summary of these tools:

Amazon Elastic Container Service (via AWS Fargate): Enables us to run arbitrary commands in containers, without having to manage the underlying instances. We just specify a container along with how many CPUs and how much memory we want, and AWS runs the container in our AWS VPC with the requested resources.
AWS Batch: A queuing and container management tool that runs containers on specified instance types, enabling us to manage complex workflows like running preprocessing workflows on large CPU instances, and then training workflows on nodes with NVIDIA V100 GPUs.

To train deep learning models, we use a method called Bayesian Hyperparameter Optimization (HPO). We developed an in-house HPO tool that determines parameters for jobs, launches jobs on AWS Batch, and then uses the results from those jobs to determine the next set of job hyperparameters. Our HPO server runs on AWS Fargate, and has IAM permissions to launch jobs on AWS Batch. Each worker on AWS Batch reports scores back the HPO server, which uses that information to pick the next set of parameters. Upon running the pre-specified maximum number of jobs, the HPO server exits and uploads a summary of the experiment to S3, which an ML engineer can further analyze.

Schematic of our Bayesian HPO-based model training system built on AWS ECS and AWS Batch.

ML Inference at Scale

At Reverie, training ML models is only Step 1. Our company is focused around developing our own cancer drugs, and as such we are constantly scoring potential molecules using models, often at enormous scale (millions to billions of prospective compounds). To do this, we use Amazon Elastic Kubernetes Service (EKS) to scalably serve billions of predictions.

Kubernetes is a container-orchestration system that is particularly well-suited for orchestrating long-running web applications (as opposed to AWS Batch, which is better suited for batch jobs). EKS is easy to set up, with helpful guides released by the AWS team. Kubernetes requires a user to specify one or more Docker containers that can be used to set up a scalable application. To serve a model, we create the following Kubernetes entities:

Service: We use a Kubernetes LoadBalancer, which is implemented on AWS as a private Elastic Load Balancer. This enables us to have a single, elastically-scalable URL associated with each model, only accessible inside our AWS environment.
Deployment: Our deployments consist of 1 pod with 3 containers: an nginx reverse proxy, a preprocessing server written using Flask and run with Gunicorn, and a tensorflow_model_server. It is easy to customize the containers in each pod to enable scalability.

This deployment can be autoscaled elastically: under periods of high-load, we can scale up the deployment to have hundreds or thousands of replicas, and EKS will automatically provision more worker nodes to handle the load. This enables us to screen molecules at billion-scale.

Schematic of our internal Kubernetes cluster, built using AWS EKS.

A Bonus: We use a private hosted zone on AWS Route 53 to resolve DNS names for any Kubernetes service. This means that users can provide DNS names (like mymodel.mycompany.com) when setting up their model for inference, and they will resolve to the load balancer’s IP address. How do we do this using Kubernetes? Again, using Docker! We have a simple deployment of the external-dns container that one can find and deploy from here in 5 minutes, and it works out of the box.

At this point, we have hopefully made the case that Docker is a great tool for developing ML software, debugging, training, and running inference at scale. However, we have gotten a variety of other benefits from using this toolkit. Here are a few examples:

Continuous Integration / Continuous Development (CI/CD)

We use Github Actions and CircleCI to orchestrate our CI/CD workflows. Both of these CI/CD tools are Docker-native, making them easy to configure in a Docker-driven environment. Here is a small sampling of the tools we have configured:

On every pull request, a suite of unit tests run that test our software in the same container that will be used in production.
Upon merging the pull request, the development and production environment containers are rebuilt and tagged, allowing users to seamlessly use them in training and inference workflows.
We can run arbitrary code on every pull request in a container. Our internal reveriebot takes advantage of this feature to automate rebase validation, style checking, pull request review assignment, and a variety of other Git processes.
We can manage any of this AWS tooling using infrastructure-as-code software like Terraform, allowing us to integrate changes to the infrastructure itself into our CI/CD pipelines using their official Docker container.

Internal Application Development using Django, deployed on Kubernetes

At Reverie, we are continuously building a suite of internal applications that our engineers, chemists, and computational chemists use to drive the drug development process forward. These tools enable us to view dashboards of our model performance, run computational chemistry workflows at massive scale, get predictions from our models, and generate ideas for new compounds.

These applications are developed using libraries like Django, Flask, Dash, and a variety of others. In order to monitor the complex set of dependencies associated with these tools, we can use Docker to simplify deployment. Specifically, for each set of applications, we have a Docker image that specifies the applications and all of their dependencies. We can use EKS to provision an internal service and deployment just like before, which enables Reverie employees to securely access the tools while connected to our AWS environment. Deploying or updating a new application is as simple as building a container, giving it a DNS name (like mytool.mycompany.com), and deploying it to EKS. We even do these updates directly via our CI/CD pipelines.

We’re hiring!

We’re actively hiring engineers across our tech stack, including Full Stack Engineers, DevOps Engineers, Cloud Architects, and Infrastructure Engineers to work on exciting challenges that are critical to our approach to developing life-saving cancer drugs. You will work with a deeply technical (all engineer and scientist!) YC-backed team that is growing in size and scope. You can read more about us at www.reverielabs.com, and please reach out if you’re interested in learning more.

Scaling Drug Development with Containerized Machine Learning

Read next

Training Transformers for Practical Drug Discovery with Tensor2Tensor

Mapping Chemical Space with UMAP

Scaling TF Training to Large Amounts of Data Part 1: Dealing with Large Datasets