Kubernetes Day-Two Ops Are Bleeding Platform Teams Dry | Hong Wang, Akuity

Argo CD powers 60% of Kubernetes deployments, but day-two ops and environmental promotion at scale are bleeding platform teams. Hong Wang explains Akuity's approach.

By Monika Chauhan 6 days ago

0

Argo CD now manages 60% of all Kubernetes deployments globally, but as enterprises scale to thousands of clusters and millions of deployments, a brutal operational reality emerges: the bottleneck isn’t deployment—it’s day-two operations, environmental promotion, and managing infrastructure chaos at scale. AI-generated code has made engineers 3x more productive, but that productivity translates into 3x more releases hitting production, and platform teams are drowning in alerts.

The Guest: Hong Wang, Co-Founder and CEO at Akuity

Key Takeaways

Argo CD captured 60% adoption across Kubernetes clusters; Kargo now handles multi-environment promotion at scale
AI infrastructure workloads drove 10x deployment frequency growth in 2024—platform teams need new automation strategies
Akuity hit 43 million deployments across 100+ customers by embedding AI-powered SRE capabilities into GitOps workflows
Edge computing at scale: Kubernetes clusters now run in paint stores, baseball parks, and commercial airplanes
AI makes engineers 3x more productive, but human-defined runbooks ensure deterministic SRE automation

***

In a recent TFiR interview, Swapnil Bhartiya spoke with Hong Wang, Co-Founder and CEO at Akuity, about the evolution of GitOps adoption, the operational challenges of managing AI infrastructure workloads at scale, and how AI-powered SRE capabilities are shifting platform engineering workflows.

The Origin Story: From Intuit to 60% Kubernetes Adoption

Wang co-created the Argo CD open source project ten years ago to solve a practical problem at Intuit: deploying applications to Kubernetes clusters at scale. The project introduced GitOps—treating configuration as source code stored in Git—and became the de facto standard for Kubernetes deployment automation.

Q: What led you to create Akuity?

Hong Wang: “We created the open source project to solve a very practical problem as a GitOps solution to deploy to Kubernetes clusters at Intuit. As the community grew larger, we had more enterprise customers coming to us, saying, ‘We want you guys to help us because we want to have exactly the Intuit experience at a large scale to manage many clusters.’ That’s why we decided there was a business opportunity. We could help others. Let’s create this company called Akuity to be an enterprise software delivery platform for everyone.”

Akuity reached five years in operation with over 100 customers, 43 million deployments, and AI-powered interview capabilities built directly into the platform. The company’s growth reflects the broader Kubernetes adoption curve: according to CNCF surveys, Argo CD now powers 60% of all Kubernetes deployments globally.

GitOps Won—Now What? The Multi-Environment Promotion Challenge

Argo CD solved single-environment deployment, but enterprises managing hundreds of microservices across dev, staging, and production environments faced a new bottleneck: orchestrating promotions across those environments. Akuity introduced Kargo, an open source promotion engine, to address this gap.

Q: What’s the difference between Argo CD and Kargo?

Hong Wang: “Argo CD is focused on single-environment deployment. It doesn’t care about multiple environments—how they’re promoted or orchestrated together. Kargo is focused on orchestration. If you deploy something to dev, how do you move the same thing to staging and then into production? Kargo is always doing that. And now we’re also supporting not just Kubernetes—we’re supporting Terraform, VMs, and serverless environments. We’re providing a more generic promotion engine for people to manage their infrastructure.”

Wang noted that Kargo adoption has accelerated rapidly, with attendees at KubeCon Europe 2026 consistently mentioning they’re actively using the platform for multi-environment workflows. The focus on governance and controlled rollouts resonates with enterprises scaling cloud-native infrastructure.

AI Infrastructure Workloads: The 10x Deployment Frequency Surge

Wang identified AI infrastructure as a primary growth driver for Akuity’s deployment volume. CoreWeave, one of the largest AI hyperscalers, uses Akuity to manage customer clusters, and the company saw 10x growth in deployment frequency in 2024 alone.

Q: What’s driving the 43 million deployment milestone?

Hong Wang: “Kubernetes is getting everywhere. You can hear a lot of stories about OpenAI—what is used to manage the infrastructure is actually Kubernetes, managing thousands of nodes. Our biggest customer is CoreWeave, an AI hyperscaler, and they’re using us to deploy to their customer clusters. As they’re scaling up, and as AI has more and more applications to be deployed, we saw a 10x growth last year in deployment frequency. There are so many clusters we’re managing, so many apps we’re managing, and every app is being deployed more frequently.”

He contrasted traditional enterprise workloads with AI inferencing loads, which require GPU-attached clusters and introduce unique deployment patterns. Platform teams are adjusting their delivery pipelines to accommodate both traditional microservices and AI model deployment workflows.

Edge Computing at Scale: Kubernetes in Paint Stores, Airplanes, and Baseball Parks

Wang shared several edge computing use cases that illustrate how far Kubernetes adoption has spread beyond traditional cloud environments.

Q: Can you share some unexpected use cases?

Hong Wang: “We have a customer that’s a paint store with franchises—1,000 stores across the United States. They run a small Kubernetes cluster in each store to power the checkout system, advertisement system, ordering system. They want to centrally manage those applications across the continent. So if today is Black Friday and I want to give a discount to all the US West states, rather than calling every store, they push a configuration to all the clusters in the West region. The clerk or staff doesn’t even do anything—automatically, they just scan. Oh, it’s 50% off today. Great.”

Wang also highlighted Major League Baseball as a customer, running Kubernetes in every ballpark, and commercial airlines running clusters on newer airplanes for entertainment systems, ordering systems, and non-mission-critical applications.

Q: Kubernetes in commercial airplanes?

Hong Wang: “In commercial airplanes, newer ones are now running Kubernetes clusters. We have customers using our software to manage applications inside commercial airplanes. It’s the same story—it’s a flying edge location. On the airplane, you have entertainment systems and ordering systems—non-mission-critical systems. Those systems need to be maintained, upgraded, and monitored. They’re running Kubernetes; they’re using our software to manage the fleet.”

AI Makes Engineers 3x More Productive—But Who’s Managing the Infrastructure?

Wang observed that AI-powered coding tools are fundamentally reshaping platform engineering workflows. Engineers are now 3x more productive, which means 3x more releases hitting production—and 3x more operational burden on SRE and platform teams.

Q: How are platform teams adapting to AI-generated code?

Hong Wang: “AI is making every engineer more efficient. Originally, if you had five engineers, they could fix five issues a week or deliver 20 features a week. Right now, AI is making every engineer three times more efficient. So we see more applications being built, more changes being released to the cluster, and to the runtime. We see a lot of pressure shifting—there is more demand for automation, more demand for deployment, and teams are deploying things to production more frequently. That’s why we see the growth.”

He emphasized that AI is creating both challenges and opportunities: while it increases operational burden, it also enables new automation capabilities that can help SRE teams manage the increased workload.

AI-Powered SRE: Agentic Automation with Human-Defined Runbooks

Akuity embedded AI capabilities directly into Argo CD and Kargo to help platform teams manage the increased operational complexity. The system uses human-defined runbooks to ensure deterministic results rather than allowing AI to improvise solutions.

Q: How does AI-powered SRE work in practice?

Hong Wang: “We built an agentic experience to help you with troubleshooting, remediation, and verification. It’s a closed-loop process. We know when something has happened, we know what to look at to get to the bottom of it, we know what the right solution is to fix something, we can take action automatically, and we can verify the result.”

Wang shared a real-world example: during an AWS DynamoDB/DNS incident, Akuity’s infrastructure was affected. An engineer woke up at 1 AM to fix an image pull backoff error by switching Docker registries. Rather than staying up all night to fix subsequent incidents, the engineer wrote a two-line human-readable runbook: “Symptom: image pull backoff error from Docker registry A. Solution: override registry A to registry B.” The AI system autonomously resolved 25 additional incidents overnight using that runbook—no further human intervention required.

Q: Why use human-defined runbooks instead of letting AI improvise?

Hong Wang: “We’re not seeing that AI will take over all the work. The action to take is not AI coming up with some random idea; it’s coming from the runbook, which is written by humans or certified by them. We want AI to give us more deterministic results. That’s why, when we design our AI SRE capability, we want that runbook. We want humans to get involved—to provide their input and preferences, which are then passed as additional context to the AI to make things happen.”

The Future: AI as Autopilot for Well-Understood Problems

Wang distinguished between AI handling well-documented, repetitive issues versus novel troubleshooting scenarios. The philosophy at Akuity is to use AI for autopilot on known problems while keeping humans in the loop for complex or unfamiliar situations.

Q: Where do you see AI fitting into SRE workflows long-term?

Hong Wang: “In this particular case I shared, it’s a well-known, well-documented issue. You’re not learning anything new from resolving it. If you already understand the problem well, why not allow AI to operate on autopilot and fix it autonomously? On the other hand, we believe the human provides substantial value here—by determining the action to take. We want more deterministic results from AI, not random ideas.”

He also noted that his own engineering team is using AI daily, with engineers reporting that they feel more empowered and less blocked: “Every day they talk about AI. They’re so impressed. It makes their lives much better and easier. They feel they are no longer blocked by needing to code something themselves. It’s easier to run through all ideas, get a prototype, build a POC, and eventually get to the finish line. It’s getting much smoother now.”

Read Full Transcript & Technical Deep Dive

You may also like

Why Crossplane Fails at Kubernetes Scale: Julian Fischer, anynines | TFiR

By Monika Chauhan20 hours ago

Cloud Native

Your Container Images Ship With 300 Vulnerabilities Before You Deploy | John Morello, Minimus

By Monika Chauhan21 hours ago

Cloud Native

AI Bare Metal Complexity: Why Speed Kills AI Deployments | Rob Hirschfeld, RackN | TFiR

By Monika Chauhan1 day ago

AI Infrastructure

CRA Compliance: What Manufacturers Must Do Before December 2027 | TFiR

By Monika Chauhan4 days ago

Security

What Is MITRE ATT&CK and How Should CISOs Actually Use It? | Steve Winterfeld, Akamai | TFiR

By Monika Chauhan4 days ago

Security

Split-Brain Explained: The Application-Level Failure Multi-Region Can’t Prevent | Philip Merry, SIOS Technology | TFiR

By Monika Chauhan5 days ago

Cloud Native