Key Insights From MLOps In Production

MLOps has become an increasingly popular buzzword in recent years, promising to address the hidden challenges involved with running Machine Learning (ML) systems in production. As Google pointed out in their 2015 paper “Hidden Technical Debt in Machine Learning Systems”, and as practitioners would agree, ML code plays a relatively small part in successfully running ML systems in production.

Figure from Sculley et al. 2015, Hidden Technical Debt in Machine Learning Systems.

Figure from Sculley et al. 2015, Hidden Technical Debt in Machine Learning Systems.

MLOps serves as an adaptation of DevOps for ML, to help us successfully manage the complexity of developing and operating these systems. Attempting to cover all MLOps in one article (and in one tool), is a grandiose task, that we will leave to others. Instead, we will try to focus on some more tangible takeaways that might be immediately applicable for (aspiring) practitioners.

At LanguageWire, we have been running ML models in production for a few years, with a focus on Machine Translation, translating hundreds of millions of words in 2021. As a result, we have already had to face a plethora of challenges, from which we will try to offer a couple of concrete and digestible insights. Importantly, we have found that one need not invest in one of the huge “MLOps-as-a-Service”-inspired tools out there to start taking advantage of the practices.

1. Treat your ML code as you would all your other code

ML code is notoriously known for being written by individuals and, as a result, difficult to read, understand and maintain. Naturally, data scientists should be free to conduct experiments locally with messy code in Jupyter Notebooks. However, prior to being used in production, this code should follow the same processes a mature organization uses for all its other projects. It should be reviewed and approved by peers and conform to whatever code standards are set for the team, as validated by the Continuous Integration (CI). Additionally, it should be deployed through Continuous Delivery (CD) to infrastructure provisioned with Infrastructure as Code (IaC). This helps enable short release cycles, and testing in production-like environments as early as possible, without “environment drift”, which can lead to nasty surprises when moving from staging to production.

2. Continuous Training: Keeping models up-to-date, automatically

Beyond CI/CD, in MLOps we also have the concept of Continuous Training (CT). As time goes by, ML models may experience “data drift”, where the data on which they base predictions in production, no longer is in-line with the data on which they were trained. Because of this, it is important to retrain your models, where the appropriate frequency, with which this is done, can vary widely from system to system. Given a high enough frequency and/or enough models, this retraining can eventually end up as a full-time occupation for one or more engineers. CT aims to address this by automatically retraining models at certain intervals or events.

While there are a vast number of ways to tackle CT, we have already had a lot of experience using Apache Airflow and thus, reached for that. In our case, we wanted to automatically retrain, register, and deploy models whenever a certain amount of new data was available in the dataset used to train them. Therefore, we designed a workflow orchestrated by Airflow that did the following:

In our case, one of the central tools for this approach, other than Airflow, was MLFlow. MLFlow is one of the more ubiquitous tools within ML that we use for all logging related to training, and to serve as a model registry, keeping track of the different models, their versions, and their validation/test set performance.

A crucial factor to consider in the context of CT is how to retain comparability between model versions, given a growing dataset. It is important to ensure that the test set of our model grows along with the training data to reflect changes in the underlying data distribution in test scores. At the same time, it is important that test data stays disjoint from training data over time. To achieve this, we use a hash-based approach to data splitting, ensuring that the different splits both grow and stay disjoint over time.

The approach to CT depicted above has allowed us to keep all our ML models up to date with minimal manual work, and a workflow that we can easily change and adapt to new projects.

Key advantages of Continuous Training:
– Minimizes/avoids data drift
– Avoids repetitive engineering tasks, e.g., manually having to retrain/evaluate/deploy
– Requires, and thus enforces, good practices surrounding automation and validation

3. Serving ML models with Kubernetes

While many cloud services exist solely for the purpose of serving ML models, we found that many were impractical to use in the context of text-to-text models that rely on tree or graph-based search algorithms at inference, like beam search. Additionally, lots of organizations already know how to deploy their applications in pure Kubernetes, so why introduce yet another tool or wrapper (which is what several of the other offerings are).

If your models perform adequately on CPU, you will typically have few issues following one of the countless tutorials on model serving using your ML framework in combination with the latest Python web development framework. Pairing this with your organization’s (hopefully) existing experience with deploying to Kubernetes, and you will have few problems running your models in production.

Should your models, however, require accelerators like GPUs? First off, beware of both costs and availabilities with your public cloud provider. Secondly, you will want to add a node pool of VMs with the appropriate drivers and accelerators to your cluster, as well as the corresponding device plugin. This can require some effort but is a well-documented endeavor.

Additionally, if you then want to only schedule GPU-accelerated workloads in this new node pool, you can configure it with appropriate taints. The taints, in combination with appropriate tolerations and node affinities for your deployment, can ensure both that your node pool is used only for your ML workload, and that your ML workload only gets scheduled in said node pool.

Assuming a node pool tainted with: “kubectl taint nodes node1 purpose=ml:NoSchedule” or even better, using your preferred IaC tool, you would modify the deployment associated with your ML workload with the toleration and affinities as shown below:

apiVersion: apps/v1
kind: Deployment
spec:
template:
    spec:
      tolerations:
        - key: purpose
          operator: Equal
          value: ml
          effect: NoSchedule
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: purpose
                    operator: In
                    values:
                      - ml
...

With the node pool and deployments configured, we can deploy all our accelerated ML workloads in our cluster, the same way we would any other application, including zero downtime rolling updates (granted appropriately configured liveness/readiness/startup probes).

4. Horizontal pod autoscaling in the presence of batch jobs

A substantial part of the predictions we serve at LanguageWire are from batch jobs exposed through webhooks. These batch jobs typically rely on a message broker like RabbitMQ, to keep track of all the jobs enqueued.

For scalability in Kubernetes, we often use Horizontal Pod Autoscaling (HPA). The most common target metrics for HPA that are also supported out-of-the box in Kubernetes are CPU and memory utilization. While these can work in the context of a queue-based, batch job architecture, we find that these are not ideal indicators for triggering horizontal scaling.

What we instead wanted the scaling to be based on was the size of the queue with jobs associated with a particular deployment. Luckily, Kubernetes supports custom metrics for HPAs, and there exist great tools such as Kubernetes Event-driven Autoscaling (KEDA), that can be installed in your cluster to easily integrate custom metrics, like RabbitMQ queue length.

Queue-based horizontal scaling, paired with automatic node provisioning (cluster auto-scaler), which is easily accessible in the fully managed Kubernetes services offered by the larger public clouds, can result in a cost efficient, yet highly scalable deployment of your inference runtime.

KISS

In summary to all the above, we have found that a large set of the MLOps practices can be tackled without the need for introducing huge “platform-like” services. With such services you often introduce an immense amount of complexity to your stack, risk vendor lock-in, and may require a completely disjoint approach to CI/CD from how you deploy the rest of your applications. Certainly, there are situations where the pros of doing so outweigh the cons. Nonetheless, we have found that the simplicity with which we have approached MLOps has allowed incremental adoption, reuse of existing CI/CD, and the flexibility to easily reshape our approach to the rapidly evolving domain of Machine Learning.

Author: Emil Lynegaard Machine Learning Team Lead, LanguageWire

Bio: Emil leads the machine learning efforts in LanguageWire, and has specialized in natural language processing. He enjoys applying the best practices from software engineering to the machine learning projects, and is very fond of Python and Elixir.