Thinking about my lessons learned over a career working with infrastructure and the twists and turns the technology has taken, I’ve come to a stark realization: operations will never escape its current limitations without a seismic reimagining. The current Ops model that hinges on the hard work and expertise of cluster or infrastructure admins cannot scale. Therefore, Ops will always be a hindrance to innovation, unless we meaningfully rethink it and adopt a new approach.
The “Ops Lock-in” Challenge
It’s a fact that developers’ demands will always surpass what Ops is able to provide. This leaves organizations with two bad options: invest an much money as possible in recruiting and retaining a sprawling Ops headcount to try to keep up with developers’ needs, or accept severe limits to the pace of innovation and solution delivery.
Ops teams may hold a lofty conceptual position as the masters of production, but in practice most Ops teams are overwhelmed and playing from behind, putting out fires as issues arise and just trying to keep up. Also, because Ops teams are responsible for everything, they often lack the specific insights needed to fix issues on their own. This forces Ops to functionally serve as a human incident router, working to determine the personnel or team that is equipped to resolve particular issues. This “Ops lock-in”, where the Ops team is simply unable to match the pace required for innovation and therefore acts as a drag on development, is the key reason why Ops demands a reimaging.
Ops and Kubernetes complexity
Ask developers about how Kubernetes ultimately impacts their ability to deliver software and innovate, and you’ll hear two different opinions. Some developers will champion the benefits of Kubernetes as providing a generic and reliable platform for deploying software. However, another group of developers – often those that have faced production problems, and especially those whose issues were of their own making – strongly believe that simplicity is essential to workability. Complexity is absolutely a major risk factor to development processes, and for all its many virtues, Kubernetes is complex. The nuances of Kubernetes require major resources to navigate, and success requires a dedicated platform team if value is to be derived before stakeholders’ goodwill is exhausted.
Ask Ops, and you’ll get another more grounded perspective. Ops team members are the ones who face the fallout when issues are deferred by architecture and delivery teams: they’re where the buck stops. It’s the Ops team that works in the middle of the night when those issues come to a breaking point, and there isn’t much in the way of a feedback loop that incentivizes workability among other teams to make life for the Ops team easier.
Ops personnel with this experience tend to enjoy the fact that Kubernetes divides workloads from infrastructure, more clearly separating out the root causes of issues and enabling push back upon the responsible parties. The standardization around how workloads in Kubernetes are packaged, run, and monitored also reduces key pain points. That said, Kubernetes complexity has downsides for Ops as well: the environment is challenging to maintain, and opaque in a way that can obscure both major security risks and make issues maddening to troubleshoot. The strain of managing Kubernetes is exacerbated by the weight of Ops’ full duties, as it’s currently structured.
Distributed Ops: placing Ops duties within development teams
I envision a future in which Ops realizes the full potential of a combined engineering approach. Following the model used for QA, Ops capabilities ought to be embedded within development teams. Today’s software engineers certainly possess Ops skillsets to some degree, as those capabilities are now all but required to be effective. Organizations should enable developers with a continuous operations platform that empowers them with self-service options, effectively deploying and operating their services with as little Ops team intervention as possible.
This fundamental shift would reinvent the Ops team to serve not as a roadblock to innovation defined by its personnel and resource limitations, but now as a force multiplier that enhances efforts wherever it can. Ops cease to be the masters of production, spread thin reacting to issues and trying to live up to their responsibility for everything. Instead, development teams take on the burden of responsibility for the services they develop. In this structure, development teams are equipped with the tools required to fully own and command their systems in an end-to-end manner, from the code level up to operating solutions in production.
With developers empowered by self-service Ops capabilities, Ops itself is positioned to serve in the same capacity as a product team. Ops plays the role of delivering products that enable development teams to fully own their services. This means a focus on directly providing infrastructure automation, deployment automation, configuration management, logging, monitoring, and production tools. In this proposed structure, development teams are now the acknowledged experts in their services, and the responsible parties when issues arise. Meanwhile, Ops makes sure that those teams have all the tools they need to diagnose and resolve those issues successfully.
In this way, both application services and infrastructural platforms like Kubernetes are put into the hands of the experts best equipped to deal with issues and most responsible for outcomes in their areas of knowledge. The result is a more effective model that greatly reduces friction between Ops and development teams, while accelerating development and innovation.
To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon NA Virtual, November 17-20.