Context
The next revolution in application deployment after virtualization has been containerization. Kubernetes was instrumental in container orchestration and making it accessible to an end application developer. To make things easier for service owners, common functionality was required for application owners based on the Twelve-Factor app principle. The Sidecar pattern became the de facto way to provide common functions such as secrets and logs.
Dynamic injection of side cars is a common pattern in K8s, but there’s no established method to continuously monitor the injection process. We discovered this gap when critical workloads in one of our clusters faced significant downtime due to bug in a mutating webhook. We couldn’t find an available open source testing solution to handle this scenario, so we created a generic synthetic mutation testing framework to verify the continuous availability of webhooks. In this blog, we discuss the implementation and our journey to open-sourcing this framework.
Problem
Sidecars or init containers get injected by webhooks at pod creation time. So, failing deployments aren’t always noticed until the next pod is created, reducing service availability. Such failures surface only when the service owner is deploying their services to different environments or due to events that can lead to pod creation such as patching or eviction. If the webhook is down during these events, service availability can be severely reduced.
Solution
The Kube Synthetic Scaler regularly scales deployments that have opted in, down to 0 and back up to the original number of replicas at a regular cadence, and it checks if the deployment is healthy after scaling up. This health check after scale-up operation ensures that the injected init/sidecar containers aren’t breaking and that the mutation is successful.
Usage: We recommend that webhook owners create a test deployment that uses the webhook with the opt-in annotation and deploy it along with the actual webhook across all environments. This ensures that you can catch issues across all environments without scaling critical workloads.
Components:
- Controller – The Synthetic Scaler, built using the kubebuilder framework, is a control loop that watches deployments that have an opt in annotation. It regularly scales these deployments (usually test deployments) down to 0 and back up to the original number of replicas at a specified interval.
- Controller calls the k8s API, and scales down the deployment by patching the replica count to zero (code snippet link)
- err := r.scaleDeployment(ctx, log, deployment, replicaZero, &replicaCount)
- Controller calls the API and does a health check by looking at the pod ready status in the deployment object (code snippet link)
- // check if deployment is available after scale up within maxScaleUpTime
- availability, err := r.getDeploymentStatusWithDeadline(ctx, log, req.NamespacedName.Namespace, req.NamespacedName.Name)
- Metrics Provider – The scaler emits Prometheus-style metrics on scaling operations and health of deployments after the scaling operation.
- Alert Manager – We rely on the Prometheus alert manager to configure alerts on deployment failures after the scale up operation.
- Controller calls the k8s API, and scales down the deployment by patching the replica count to zero (code snippet link)
How to use the framework
- Deploy the Synthetic Scaler in your cluster. A public Docker image and Helm 3 chartis available from the code repo.
- Run the below command to install the deployment and related helper RBAC in the namespace that you select.
- helm upgrade –install kube-synthetic-scaler helm/kube-synthetic-scaler –namespace <anynamespace>
- Set up your target test deployments and insert these annotations.
- sfdc.salesforce.com/enable: Signals to the scaler that this deployment is to be scaled
- sfdc.salesforce.com/duration: Tells the scaler how often to scale up and down
- Track your deployment availability using metrics.
DeploymentAvailability = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: “deployment_availability”,
Help: “Specifies whether a namespaced deployment is available or not after scale up”,
}, []string{“ns”, “deployment”})
See it in action
GIF: https://drive.google.com/file/d/1bJdxf2VsHYNfLQO1-Yx6aSqjMB85899w/view?usp=sharing
How the synthetic scaler and testing framework helped
The Synthetic Scaler and our generic synthetic testing framework for continuous monitoring helped us to detect and triage issues immediately in development and production environments. For example, we dectected an issue with a webhook service that’s used for validating integrity of docker images. The issue involved a bad code change, which failed mutation and blocked new nodes from coming up, and also led to autoscaler alerts.
Extending the framework
The current framework ensures that the deployment is available after webhook mutation, but doesn’t actually check whether the webhook is functionally working as expected. This framework can be extended by creating a parallel deployment, which does a functional check of the mutated pod. For example, if the mutation caused a cert to be injected, the helper pod can do a curl command to check the validity of the cert.
Conclusion
With the adoption of Kubernetes as a mainstream orchestration platform, using mutating webhooks to provide out-of-the-box functionality is now widely adopted. With this framework, you can test any webhook’s availability in a generic way. Other cluster administrators can use the framework to increase trust and availability.
We encourage you to go to the GitHub repo, try out our tool, add feature suggestions, and give us your feedback.
Prabh Simran Singh, Director of Software Engineering, Salesforce
Sanya Nijhawan, Software Engineer, Salesforce
Andy Chen, Software Engineer, Salesforce