Machine Learning

What I wish I knew about ML infrastructure when I was a researcher

Join Carlos Ledezma, Product Manager at nPlan, on an exciting journey to optimize your Machine Learning projects. Learn how Docker can revolutionize your ML infrastructure, making code portable, reproducible, and easily deployable. Explore practical examples and expert insights to unlock the full potential of Docker in ML engineering. Dive into this blog now to elevate your career in Machine Learning!

Machine Learning in research looks a whole lot different to industry. However, there is a problem that is common to both: scalability. Nowadays we have ML models and datasets that are so large that it would be almost unthinkable to do ML research without a GPU, a high performance computing cluster or cloud infrastructure. As ML practitioners, we have little choice about which infrastructure we will have access to; normally our companies or research groups decide that for us. However, we do have full control of how we want to structure our code for deployment. This post is about how we can make our code infrastructure-agnostic, so we can deploy it in whatever resources we have available and scale it adequately.

This post starts with the motivations and what I see are the main issues that prevent scaling. I will then present Docker as a solution to many of those problems. Finally, I will write a short introduction to putting training code in a Docker image that can be deployed nearly anywhere. If you are only interested in the Docker primer, feel free to skip to the end.

This post talks about how we can use Docker to alleviate many of the infrastructure pains in ML

The non-ML pains of training a ML model

One of the first challenges that ML researchers and engineers will face is training infrastructure. A single GPU will reach maximum capacity quickly and can’t keep up with the amount of training that needs doing. For example, a typical hyper-parameter search needs to train hundreds of models before finding the right one. In order to do a thorough search, we may need several different machines with GPUs and our code needs to run well in all of them, even if the GPUs are different.

Some of us will be lucky enough to have access to cloud computing services or a high performance computing cluster. In this case the challenge changes. Because the training environment is different than the local development environment, it is easy to run into the case that your training code works locally but fails remotely. This can lead to many hours of complicated debugging.

Finally, even when we get our code to run in our GPUs or cloud infrastructure, we face the challenge of reliably reproducing our results. Even if we have set all our random seeds, even changing library versions which update optimisers’ implementations can lead to different results. Hence, as ML practitioners, reproducibility often involves reproducing the whole training environment, as opposed to just knowing which code and hyperparameters were used to run the code. This challenge extends to inference time, as we would like our inference environment to be as close as possible to the training environment so we can avoid training-inference drift.


What is Docker and how does it help in ML?

Docker is a set of tools that allow you to provide platforms as a service. It is quite comprehensive, but just learning how to use the basic units of Docker (images) already helps alleviate the above mentioned challenges.

In order to make ML code infrastructure-agnostic, we package it in a Docker image. This image can then be deployed in Docker containers regardless of the underlying hardware (GPU or CPU), which means our code will run on nearly any computer. Docker also enables deployment to cloud services, thus making it more scalable.


A Docker image is a solution that packages all the dependencies of an application. A Docker application can use these images to create containers that run those images. Images are truly minimal. You will usually start with something like a functional Linux distribution and declare all the software that needs to be installed in order to run your application, Docker will then build your image. Once the image is built, it can then be deployed on a container in any Linux, Windows or Mac computer (warning, the Silicon chips can cause trouble). The image doesn’t need to be re-built, which means that the codebase is effectively “frozen” as it was built.

Right now you may be thinking that even though an image only needs to be built once the process of figuring out everything you need in order to get a blank OS to run a training application may be too much. Luckily, there is an answer to that as well. As Docker has become more popular, more people have been using it and making base images available. All the images that have been created and are officially supported can be found in Docker Hub. You can import a base image into your container, which gives you a head start in declaring everything you need. For example, you can find images with virtually any Python distribution installed and you can find images with CUDA and cuDNN pre-installed as well.

Docker images alleviate the non-ML related challenges mentioned above. First, by using third-party built and tested images, you can reduce the maintenance burden that physical units represent; you can trust that NVIDIA will know how to properly install their drivers. Second, by building and deploying training code through images you can guarantee that the code is running on the same environment no matter where it is deployed. Finally, reproducibility is made easier (albeit not guaranteed) because you can be sure that the code is run with the same package versions every time, regardless of what is happening in the hosts machines. Moreover, by storing versions of your containers you will always be able to reproduce old models without having to do git magic to return your code to its previous state.


Deploying ML training in a Docker container

By this point, I hope you are as excited as I was when I first found out about all the cool things that you can do with Docker. So, to wrap up this post, I will go over a simple example on how you can set up a Docker container for doing training with Python in Tensorflow. Before you start, make sure that you have installed docker. You can also get the code that I will be referencing from this GitHub repo.

The Dockerfiles (image declarations) in the accompanying code were created so that the application can be deployed in Google’s AIP. There are some things that can be skipped if you don’t need that capability. There are two Dockerfiles, one to train on CPU and another to train on GPU. The latter works also on CPU, but requires more setup. So, I will be using the latter for reference.

The first step in the process is going to be the ML training code. The Docker image is agnostic to the training code itself in the same way that Ubuntu would be agnostic to your code. So, for the purposes of this example, the training code is fairly simple, but you can be confident that the remainder works with more complicated training regimes. As you are writing your code you should keep a detailed list of the packages you are using (ideally in a requirements.txt or a setup.cfg file) so you don’t have to trial-and-error them later. Also, keep in mind that your application needs to be able to run as standalone, so don’t assume that you will have access to everything that is in your own computer. You can copy data in and out of the container (I will cover that in a moment), but by this point you can start thinking about storing your results in the cloud so they are more easily accessible. One final thing to bear in mind is that the simplest way to run a container is to do it through the command line; hence, it is convenient to include an argument parser so you can run your training container with all the parameters you want. The arguments used in this example are defined in the get_args function.

Next, it’s time to create an image. You can look at the code snippet without the Google AIP requirements below this paragraph. Before we go into installing anything, we need to decide which base image we will use. For this example I decided to go with Tensorflow 2.7.0, a quick look at the TF compatibility table tells me that I need CUDA 11.2 and cuDNN 8.0. So I can find an image from the nvidia/cuda image repository and use that as a base. Next we need to figure out which python version to get and use it. Again, from the compatibility table we know that we can use Python3.7–3.9, we will use 3.8 for this example. Next, we need to copy the required code into the image, that is a Docker one-liner. Then, we move into the directory we just copied and install everything as an editable package using pip, this will also install all the required python dependencies. Finally, we define the entry-point for our container as the command python3 aip_trainer/trainer.py. The entrypoint is the command that will be executed when the container is run.

Get base image from nvidia/cuda and setup the image as root
FROM nvidia/cuda:11.2.0-cudnn8-runtime-ubuntu20.04WORKDIR /rootUSER root
Install python
RUN apt-get update &&
   apt-get install -y --no-install-recommends python3.8-full &&
   # Install pip    curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py &&
   python3 get-pip.py &&
   python3 -m pip install --upgrade pip
Copy required packages into the image and install in the virtual environment
COPY . aip-trainerWORKDIR aip-trainerRUN pip install -e . --user
ENTRYPOINT ["python3", "aip_trainer/trainer.py"]

That is pretty much it. You have now defined the Docker image which will run your training code in nearly any machine that you can think of. The next few steps are fairly simple. First, you need to build your image. For this, set yourself in the base folder containing the code and run the command:

docker build -f dockerfiles/Dockerfile-gpu -t <image-name>:<label>

You can then run your image in a container by doing:

docker run -t <image_name>:<label> --arg1 val1 --arg2 val2 ...

If you want to run your code in a GPU machine, that machine will need to have the NVIDIA container toolkit installed. You will then be able to accelerate your container by doing:

docker run --runtime=nvidia -t <image_name>:<label> --arg1 val1 --arg2 val2 ...


Going the extra mile

This is only the tip of the iceberg. Docker images are generally useful tools to develop and scale software. Familiarising yourself with this technology can open up many possibilities that will make you a better ML engineer or researcher.

You can supercharge your MLOps using docker images. If you save your images in a Container Registry of your choice and push images with different labels, you can keep track of how all your models have been trained. Cloud training services will offer the possibility of running training with your images, so by becoming familiar and using images you could be scaling your training up by several orders of magnitude.

Once you familiarise yourself enough with Docker, you can expand what you know to other areas of machine learning. For example, images are also used to define custom components for Kubeflow Pipelines, a tool that is widely to create ML pipelines with MLOps in mind. With it, you can use custom components for data ETL, preprocessing, model training and inference.

These are just a two examples of how a ML Engineer can benefit from learning to use Docker. I hope that you found this post inspiring and that it helps you supercharge your career in Machine Learning.