ArticleCloud Native ComputingDevelopers

What is Data Orchestration and Why is it Important?


The purpose of this article is to explain and define a relatively new concept of Data Orchestration and why it is important to Kubernetes. We will outline four concepts that are foundational to data orchestration. But let us first take a step back and outline why data orchestration is important and meaningful to Kubernetes.

Albert Einstein once said that there is nothing difficult about the General Theory of Relativity in itself. What is difficult is that it requires an entirely new way of thinking. With those words, he captured the essence of a paradigm shift, changing physics forever. Kubernetes follows the same course, in the sense that it requires an entirely new way of thinking; a new way of looking at how applications are developed and delivered. Kubernetes has untethered applications from the limitations of its underlying infrastructure, allowing applications to become portable, automatically scale, and release valuable system resources after it has finished its task. We no longer have to worry about runtime library conflicts between development, staging, and production environments.

Businesses have lost millions in revenue – some went out of business – due to the inability of statically provisioned applications to scale sufficiently fast to handle incoming requests. Thanks to the paradigm shift represented by Kubernetes, applications can be delivered wherever needed and easily scale to handle almost any onslaught of traffic. Kubernetes has had a profound impact on the entire information technology chain, from development to deployment. Old and slow waterfall development has yielded to the speed and flexibility of agile development. Slow and reactive application delivery has given way to orchestration that allows instant delivery of applications at planet scale. Kubernetes allows us to develop and deliver applications faster, with greater savings and quality. Kubernetes has profoundly changed how we conduct business.

The Missing Piece in Orchestration: Data

Kubernetes delivers applications on a planet-scale that can easily be moved, run anywhere, and only live as long as they are producing value. This ephemeral quality of applications is a key concept that has made Kubernetes so successful. Yet, as it has matured Kubernetes is now the most popular container orchestration platform for both ephemeral and stateful applications, consuming and generating data that has persistence.

Applications cannot perform their mission-critical tasks without data. Whether ephemeral or stateful, applications contained within Kubernetes Pods does not change this equation. To point out the obvious, it is relevant and important that data cannot be ephemeral if it is to be valuable.  If it is valuable, then it must persist and be protected, performant, secured, and available to others.

As it is now clear, provisioning data is a distinctly different discipline than provisioning compute for applications. Kubernetes introduced a plug-in, the Container Storage Interface (CSI), to make it simple and straightforward for applications to store and retrieve data. This has made it possible for third parties, storage vendors as well as open-source solutions, to provision storage to Kubernetes applications. Consistent with its vision, the Kubernetes project also developed the notion of Persistent Volumes (PV) and Persistent Volume Claims (PVC) to simplify the provisioning of storage. This created mechanisms that decouple the what from the how or, in other words, the consumption of data storage from the underlying infrastructure. It is a highly useful disaggregation that turns infrastructure into reusable resource pools for applications. It has opened the door for storage solutions to join the paradigm shift. How have storage vendors and open source solutions responded to this challenge? For the most part, the response has been to bolt on existing legacy storage approaches. Why would anyone want to introduce to Kubernetes the unsolved ills that have plagued legacy storage approaches for years?

Four unsolved data management challenges remain:

  • Decoupling data infrastructure
  • Reducing data gravity
  • Declarative objectives
  • Planet-scale portability

The Kubernetes paradigm shift deserves a new approach to managing data, one that goes beyond traditional storage, orchestrating data by decoupling it from the limitations of the underlying infrastructure. This approach would reduce data gravity by assimilating its smallest constituent parts (metadata) and providing planet-scale portability. This approach to delivering persistent data to applications is called data orchestration. Let us take a closer look at how data orchestration would solve these four challenges.


The first step in the data orchestration journey is the decoupling of data from the limitations of the underlying infrastructure. Decoupling infrastructure removes legacy storage ills such as silos, out-of-control copies creating data sprawl, the lack of business-level controls, and downtime due to forklift and software upgrades. To orchestrate data, we must first liberate data from its infrastructure limitations.


Data gravity is another major obstacle in data orchestration. The topic often comes up and is generally assumed to be an intractable problem constrained by the laws of physics, but what if there were a more elegant solution? A data orchestration solution should have the ability to assimilate the metadata of unstructured storage silos bypassing data gravity. This would effectively tear down the iron-clad walls between disparate storage solutions. By first decoupling silos and then assimilating their metadata the walls between siloes could be treated as a single pool of resources.


Epistemology, the science of knowledge, outlines two distinct areas: the what and the how. Knowing what something is versus how to accomplish it. This important distinction is not restricted to a classroom discussion on symbolic logic. It pertains directly to how we leverage technology to accomplish business objectives. Objectives are declarative statements that define the desired end-state through the metadata without having to make infrastructure changes.

Declaring the intent of data (i.e. its desired end-state) is vastly simpler than having to define every single step to be taken (imperative policies) to accomplish the desired end-state. By removing silo boundaries and data gravity in previous steps, it would be possible to leverage the smallest constituent parts of data, namely metadata, to institute declarative policies that replace cumbersome and error-prone imperative policies. The result is greater agility, control, and efficiency.


This is the crowning jewel in data orchestration. The subsequent steps of removing silo boundaries, overcoming data gravity, and declaring desired business outcomes would pave the way for the final challenge in data orchestration: how to make data portable. A true global file system allows underlying infrastructure to be abstracted and consumed as a universal resource pool for block, file, and shared storage on a planet-scale.


Data Orchestration relies on the subsequent steps of decoupling data from infrastructure, metadata assimilation, leveraging declarative statements to accelerate business outcomes, and mobilizing it to make it ubiquitously available to all applications and workloads. Decoupling data from infrastructure frees data from legacy storage limitations. It unites disparate storage silos into cohesive resource pools. Metadata assimilation removes data gravity that ultimately makes data easier to manage and distribute according to business intent. Objectives, in turn, allows you to declare the desired end-state without having to figure out every single step to be taken.

A complete data orchestration solution should have rich custom metadata options, such as tagging, descriptors, and classification. We should expect a data orchestration solution to deliver non-disruptive data mobility on a fine-grained level from one vendor to another, from one data center to another, from data center to public cloud, or between clouds. Kubernetes deserves a data orchestration, mirroring its vision, to free data from infrastructure silos, remove data gravity, and accelerate business outcomes.

To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon NA Virtual, November 17-20.