Reliably releasing new versions of software to production is a central concern for DevOps and SRE teams. When a new version becomes available, the release process needs to tackle the following basic questions.
1) Does the new version satisfy service-level objectives (SLOs)?
2) Will it maximize business value?
3) If the new version meets the release criteria, how can it be safely rolled out to end users? If not, how can it be safely rolled back?
Mature (commercial) solutions exist for release automation in the context of web and mobile apps. However, in the cloud-native context, the goal of releasing reliably and consistently is made harder due to several challenges, including:
- a) heterogeneity in application frameworks, metrics/observability backends, and ingress/service-mesh technologies for traffic management,
- b) the need to incorporate both application and business metrics while evaluating new versions in a principled manner, and
- c) the need to comply with organization-specific mandates and practices, including CI/CD/GitOps practices during releases.
In this article, we describe Iter8, an open-source AI-driven release engineering platform for Kubernetes-based applications. The key innovation in Iter8 is the notion of an experiment, a Kubernetes custom resource that is declaratively specified and used for orchestrating the testing and release of a new app version. Iter8 experiments break up a release task into cleanly decoupled subproblems, namely, evaluating app versions using well-defined metrics-based criteria, determining the best version (winner) using statistically rigorous algorithms, progressively rolling out the winner to end users, and promoting the winner at the end of the experiment as the current stable version.
Iter8 experiments offer unparalleled flexibility in terms of how these subproblems are addressed, and how their solutions are mixed and matched. This flexibility in turn enables SREs to define Iter8-embedded CI/CD/GitOps pipelines for release automation that best fit their organizational needs with minimal disruption. In the rest of this article, we examine the core challenges associated with the release process, and how Iter8 approaches them.
Core Challenges & Iter8 Approach
Diverse Kubernetes app frameworks: Kubernetes is remarkable for its extensibility. Developers have the freedom to package and deploy apps in Kubernetes using a wide variety of mechanisms, including its core resources such as Deployments and StatefulSets, or serverless frameworks such as Knative, Kubeless, and Fission, or ML model-serving frameworks such as KFServing and Seldon, or as a complex distributed application in the form of a Helm release consisting of multiple templates and sub-charts. A release engineering tool that is tightly coupled to a single mechanism, e.g., Deployments, will not work across different organizations or even within a single organization that needs to deploy using a mix of different frameworks.
In Iter8, an app is defined very broadly: an app is any entity that can be deployed on Kubernetes, can be versioned, and for which version-specific metrics can be collected. Enabling Iter8 to work with a new app framework is as simple as adding a few RBAC rules, and does not require changes to Iter8’s core code. This loose coupling enables Iter8 to easily integrate and experiment with any app framework including Kubernetes services and deployments, Knative serverless applications, and KFServing ML model deployments.
Version evaluation: The best version of your app is the version that should be considered stable and promoted to serve end-user requests. But how do you define the best version (winner) of your app?
In Iter8 experiments, SLOs are specified in the form of metrics along with acceptable limits on them. In a pure SLO validation experiment, if the latest version satisfies all the SLOs, it is deemed the winner. Iter8 also enables A/B testing experiments with a reward metric. In these experiments, the version that maximizes the reward (typically, a business metric such as user engagement, conversion rate, or revenue) is deemed the winner. Further, in Iter8’s A/B/n experiments, it is possible to compare three or more versions and pick a winner. Iter8 also enables hybrid testing where both reward and SLOs can be specified. In hybrid testing, among the versions that satisfy objectives, the version that maximizes the reward is declared as the winner.
Metrics collection: Service meshes like Istio and Linkerd, and app frameworks like Knative, KFServing, and Seldon can often be configured with a Prometheus add-on, which enables the collection of system metrics like request counts, error counts, and latency. Despite the ubiquity of Prometheus, metrics collection during experiments can be challenging due to the following reasons:
1) dev, test, and staging clusters may not run a Prometheus instance
2) the app under question may not receive enough end-user traffic, causing metrics to be unavailable or sparse (i.e., statistically insignificant)
3) business metrics are generally not collected using time-series DBs like Prometheus; organizations use diverse backends such as Elastic, New Relic, Sysdig, Hive, or Google Analytics (for full stack apps) for collecting them.
Iter8 offers a built-in metrics-collection mechanism for dealing with the first two challenges. This mechanism sends (synthetic) requests to the app and collects version-specific performance metrics that can be used as part of Iter8’s experiments. Thus, there is no need to configure an external DB such as Prometheus to use Iter8. Iter8’s custom metrics framework solves the third challenge. It enables users to integrate any RESTful metrics backend with Iter8. This is a powerful feature that works across query languages or storage formats, and can be used to connect Iter8 with a variety of RESTful metric backends like Prometheus, Elastic, New Relic, Sysdig, Hive, MongoDB, and Google Analytics. For learning and testing purposes, Iter8 also provides the ability to mock metrics.
Traffic engineering: Service meshes like Istio and Linkerd, and ingress controllers like NGINX or Traefik enable a number of features for routing application traffic. The release engineering tool should be loosely coupled with the networking infrastructure, so that the former can leverage any traffic engineering features made available by the latter during the release process.
In Iter8, Kubernetes resource specs such as horizontal pod-autoscaler (HPA), Istio virtual service, or Linkerd traffic split co-exist alongside Iter8 experiment spec. This enables Iter8 to take full advantage of the underlying auto-scaling/networking/ingress/service-mesh infrastructure. For example, Iter8 experiments can involve mirrored/shadowed traffic sent to a dark-launched version of an app; or a canary release with traffic segmented across stable and canary versions using request attributes, cookies, or a fixed-%-based traffic split.
Progressive traffic shifting: A popular strategy for progressive rollouts is to incrementally shift traffic towards the winning version during the course of the release experiment. Iter8 enables this strategy through a statistically rigorous AI-driven approach called multi-armed bandit. This approach optimally trades off the need to explore each version in order to collect metrics, and the need to exploit the winning version by aggressively shifting traffic towards it. As part of the experiment, the user can further control this traffic shifting behavior by limiting the maximum amount of traffic a candidate version is allowed to receive at any point during the experiment, and limiting the maximum possible increment in traffic for the candidate version within a given time interval.
CI/CD/GitOps: What is the best way to promote the winning version at the end of an experiment? Should Iter8 directly manipulate Kubernetes resources in the cluster? Should there be a slack notification so that version promotion can be managed manually? Should there be a GitHub pull request, so that GitOps operators like ArgoCD or FluxCD, or GitHub Action workflows, or other Git-triggered workflows like Tekton pipelines can take control, after a human operator approves and merges the pull request?
There is no single best answer. Different teams may prefer different approaches for version promotion based on organizational requirements and the mix of CI/CD/GitOps tooling that they are familiar with. Iter8 accommodates all these approaches through its powerful task runner framework, which enables running a variety of notification, metrics collection, readiness checking, and scripting tasks during various stages of the experiment.
DevOps and SRE teams that wish to take their Kubernetes apps, CI/CD/GitOps, and Day 2 operations to the next level are faced with a pressing need for principled release engineering tools that
- i) evaluate app versions, identify the winning version, and progressively shift traffic towards the winning version using statistically rigorous algorithms
- ii) incorporate well-defined SLOs and business reward criteria as part their evaluation
iii) safely and reliably promote the winner to production
- iv) integrate with diverse app frameworks, metric/observability backends, and traffic engineering (ingress/service-mesh) technologies
- v) play well with the constantly evolving landscape of CI/CD/GitOps tooling and organization-specific requirements.
Iter8 addresses this need.
Try your first Iter8 experiment using this 5 min quick start tutorial.
Alan Cha, Software Engineer, IBM Research
Hai Huang, Research Scientist, IBM Research
Michael Kalantar, Senior Software Engineer, IBM Research
Fabio Oliveira, Research Scientist, IBM Research
Srinivasan Parthasarathy, Research Scientist, IBM Research
Sushma Ravichandran, Software Engineer, IBM Research
Alan is a Staff Software Engineer at the IBM T. J. Watson Research Center, New York, USA. He is a core contributor to the Iter8 opensource project and passionate about API management.
Hai is a Research Scientist at IBM T. J. Watson Research Center, New York, USA. His research interests include operating systems, energy and power management, large-scale systems management, software testing and anomaly detection. He is also interested in new systems challenges in the field of Cloud Computing.
Michael is a Senior Software Engineer at the IBM T. J. Watson Research Center, New York, USA. He has been a contributor to many cloud computing initiatives and is interesetd in DevOps and cloud-native computing.
Fabio is a Research Scientist and Manager at the IBM T. J. Watson Research Center, New York, USA, where he leads the Cloud-native Computing and Analytics team in the Hybrid Cloud Platform Research Department. Fabio has co-founded the Iter8 opensource project and has worked on several projects related to cloud computing and microservices, both internally at IBM and externally. Fabio has presented his work in opensource conferences, such as KubeCon and OSCON, in community meetups, as well as in academic conferences.
Sri is a Research Scientist at IBM T. J. Watson Research Center, New York, USA. He is passionate about topics at the intersection of Cloud-Native Computing and AI. A co-founder of Iter8, Sri has presented multiple talks & demos at prestigious venues including KubeCon, open-source community meetups, and top peer-reviewed academic conferences.
Sushma is an Advisory Software Engineer at the IBM T. J. Watson Research Center, New York, USA. She is a core contributor to the Iter8 opensource project and passionate about cloud-native computing.