At CircleCI, our approach to Kubernetes (k8s) deployments may seem familiar. We initiate the workflow, generate the image, construct the Helm chart, and deliver the chart to k8s. From this point, k8s takes command with its rolling update. Although this process works just fine, we wanted to introduce progressive deployments internally. The challenge was how to achieve that with minimal disruption to our teams.
Upon reviewing several options, we decided to move forward with Argo Rollouts. In this post, I’ll discuss what we considered when making our decision and why Rollouts stood out to us.
1. Canary Strategy
Argo Rollouts provides a fundamental traffic routing solution without the necessity to insert a service mesh into our cluster. Though the traffic routing percentages might be coarse, it wasn’t a deal-breaker. We could always explore introducing a service mesh later for more fine-grained control of traffic weights.
2. Flexible & Powerful Analysis Runs
Our goal is to eliminate toil around releasing & validating changes. Using Analysis Runs — a feature that allows automated metrics assessment — will help achieve this. It empowers us to make the release process more efficient and error-free, bolstering our deployment practices.
3. Experimentation Without Live Traffic Exposure
This capability offered by Argo Rollouts was very appealing in terms of risk management. Being able to run experiments while mitigating fallout gives us a safety net, leading to better, more reliable deployments.
4. Open-source Contribution
Argo Rollouts being open-source is a significant factor. We appreciate the opportunity to contribute and improve the tool for the community at large. Prioritizing open-source solutions allows us to share our expertise with the community and facilitates easier adoption of industry best practices.
5. Releasing Changes in Steps
Argo Rollouts offers the ability to incrementally introduce changes. For instance, we can allocate 10% of traffic to a new feature, let that soak for 15 minutes and assess, escalate to 50%, wait for 20 minutes, and finally shift 100% of traffic. Concurrently, background analysis runs check for container restarts or error logs. If anything crosses our redline, the release will be canceled, we will revert to our stable pods, and a notification is dispatched via Slack. This systematic approach has bolstered our confidence in deployments, reducing the strain on engineers and enhancing our operational stability.