Author: Richard Hartmann, Office of the CTO, Grafana Labs
Bio: Richard “RichiH” Hartmann is a member of the Office of the CTO at Grafana Labs, a Prometheus maintainer, OpenMetrics founder, OpenTelemetry member, and member of the CNCF governing board and various committees.
When it comes to historical events, the tech industry tends to have a short memory.
But certainly much will be made of the 10-year anniversary of Kubernetes, whether it is recognized in June 2024 to honor the date it was first released, or perhaps sooner, at this year’s KubeCon Chicago, to honor its genesis at Google.
Let’s take a closer look at the arrival of Kubernetes, how it challenged prior mental models for human reasoning with distributed systems, and how new requirements are likely to keep pushing the boundaries of observability as a discipline.
Right Time, Right Place for a New Distributed Operating Model
Cloud computing was in a much different place in 2013 when Brendan Burns, Craig McLuckie and Joe Beda dreamed up Kubernetes. As Burns wrote: “The notion of orchestration, and certainly container orchestration existed in a few internet scale companies, but not in cloud and certainly not in the enterprise.”
Kubernetes wasn’t just created in a Petri dish at Google. There were some really interesting pre-existing things happening in the industry at that time, that were effectively the Agar that made K8s such a right-time, right-place phenomenon.
Enterprises were really getting comfortable with virtual machine deployment models, but the overhead of running a bunch of VMs was becoming more and more of a pain. There was intense interest in the potential of containers, but teams were struggling to capture their full value at scale. Apache Mesos had created a basic familiarity with the concept of container orchestration, but was too complex for the average sysadmin or developer. And the one feature every Linux user envied FreeBSD for, FreeBSD jails, was starting to be replicable through kernel namespaces and cgroups.
Kubernetes built on top of these trends that were all rowing in a similar direction – effectively a new mental model for how distributed teams can work off a shared infrastructure through APIs with far less requirement for team coordination. While horizontal scalability was often the shiny object that led enterprises towards Kubernetes (the same basic technology which runs the world’s most popular services like Google Search, YouTube, Gmail, etc. at Google through Borg) – I think the largest benefit of cloud native is really how it frees up individual teams to implement against each other’s microservices APIs and thus talk less to each other, with less synchronization required.
The New Model Which Unhid Complexity (by Design)
For distributed teams, Kubernetes was a way to evolve out of the anti-patterns which bogged down distributed teams’ pace of software delivery.
Kubernetes really changed how people thought about how to design and deploy services. It provided the operating model that made it possible to coordinate all of these smaller functions that could run independently on their own, with standard interfaces that distributed teams could share to access common services.
But this benefit didn’t come for free.
The smallest viable size of a service used to be far larger than what it is today, when you had to run full VMs. But those VMs, and by extension the ability to run classic Unix workloads and even monoliths, also abstracted away a lot of complexity. Every server had its own name, every firewall had its own name, every web server had its own name. HTTP servers were designed to have someone sitting in front of them to change the configurations on a specific machine. With cloud native and Kubernetes, the idea is to automate as much as possible, and deploy to thousands of nodes with no human supervision.
Before Kubernetes, nuanced troubleshooting domains like networking and databases had people dedicated to them. They were the subject matter experts at understanding the gory details of failure domains like ingress configurations for networking, or sharding and tuning for databases.
If there’s one criticism of Kubernetes, it’s that the abstraction layer is a bit too low. All of a sudden software engineers are expected to know how to deploy and operate workloads. And operators needed more software engineering knowledge than usual.
A new mental model
The previous generation of monitoring tooling relied on a model of hosts; physical or virtual. The time from needing a new machine, to a new server being deployed was usually in the range of a few days to a week. Someone needed to approve it, order it, mount it into a rack, cable it, create the VLAN and routing, install an operating system, and only then the real work started. Many of the interfaces were physical, most of the config bespoke, and thus functionally impossible to fully automate. Put differently, there was less pressure to automate even the reasonably automateable steps as there could not be a holistic solution anyway.
With Kubernetes, if you need (re)start 1,000 pods, you can do so within minutes. Outages, redeployments, etc are dealt with automagically as long as the cluster has enough resources. But it also means more of the inner workings of scheduling are exposed, and you can’t deal with them manually any more. No matter if you’re used to a GUI or CLI, you won’t change the configuration of a thousand machines by hand.
The modern observability movement is focused on the ability to ask questions on the fly, to acquire more knowledge about the services, to understand what’s happening in the system deeply enough to diagnose what’s wrong quickly.
In the prior monolithic and Unix/Windows world, much of the system complexity was hidden behind existing interfaces. Cloud native deliberately broke up those old service boundaries to enable new operating models, but also exposed this previously well-contained complexity.
Prometheus is the default monitoring and alerting open source project in CNCF and the wider cloud native ecosystem. Built independently of Kubernetes, but both having each other in mind in their fundamental design. Prometheus is born out of the same Borg and Borgmon lineage as Kubernetes itself, and has had a fantastic 10-year run enabling this new mental model. It made it not only feasible for the first time to emit and parse telemetry data across cloud native Kubernetes environments outside of hyperscaler in-house projects, it also arguably made it easy to work with observability data at scale for the first time.
On top of all of this sits Grafana. When you have everyone using the same tools and dashboards, every investment into this shared tooling is an investment into your whole organization’s ability to understand and reason about what makes and costs money. And because these two bedrock observability tools are both open source, the institutional and individual knowledge that developers working with Prometheus and Grafana gain is highly transferable to most other companies. Simply look at most recent job postings by pretty much everyone except competing vendors.
What’s Next in the Evolution of Observability?
It will be interesting to see how today’s mandates for cost and efficiency will translate into the future of observability. The days of functionally zero interest rates and free money are over, and so too are the days of grabbing market share at any cost.
Companies care a lot more about both CapEx and OpEx, and are taking a much closer look at their costs, cloud and otherwise, and opportunities for optimizations. Observability has long been focused on operational data, but there is a rich surface area for new, evolved insights into systems, and an opportunity to answer not only the question of how but also of why in new ways at all levels of any given company.