Why GitOps at Scale needs Hydration

Author: Artem Lajko, Head of Platform Engineering at iits-consulting

Bio: Artem Lajko, certified CNCF Kubestronaut and Head of Platform Engineering, specializes in Kubernetes scalability and GitOps-driven workflows. He is the author of Implementing GitOps with Kubernetes and an IT freelancer writing for various publishers. As a Platform Engineering Ambassador, he supports companies and the community in adopting Platform Engineering, Internal Developer Platforms, and related technologies. Passionate about Open Source, he helps organizations choose the right tools, driving tech adoption and innovation.

When we talk about GitOps at scale, we are not referring to managing a handful of clusters. Instead, we are talking about managing 100, 1,000, or even more clusters.

GitOps makes it possible to manage an entire fleet of data plane or workload clusters in a consistent way. New applications can be deployed, kept up to date, and operated over time, including day-2 operations. Because most things are defined declaratively, many teams aim for “everything as code,” including infrastructure provisioning and cluster creation. This also enables self-service models for internal Kubernetes-based platforms.

The obvious question is how such a level of scale can be achieved with relatively small teams and something that was hard or impossible even five to eight years ago, despite heavy automation. The answer has two parts. Tooling has improved significantly, and GitOps, since around 2017, has introduced operational patterns that reduce manual effort.

Typical GitOps-at-Scale Setups

In practice, many platform teams operate a central configuration hub like a repo. This hub contains definitions for third-party tools that are deployed to workload clusters and can be customized per cluster.

Technically, this is often implemented as a catalog built with Helm umbrella charts. These umbrella charts act as wrappers around third-party Helm charts, such as cert-manager, and provide default values and best-practice configuration.

This catalog is commonly connected to a GitOps engine like Argo CD using ApplicationSets. Cluster generators and labels are used to drive deployments with “Label based Deployments”.

When a new cluster is created, it is registered with a central Argo CD instance that acts as a hub or control plane. A Secret containing the cluster credentials is created. Labels are attached to this Secret. Based on these labels, Argo CD generates Applications from ApplicationSets and deploys the matching components — for example cert-manager — to all clusters with the corresponding labels.

Importantly, the deployed applications are not fully static. Values can be overridden per cluster and per application. This allows a high degree of flexibility while keeping a centralized control model.

Using this approach, 10, 100, or even 1,000 clusters can be managed by simply adding labels and providing cluster-specific values for every application or add-on. It supports both large-scale rollouts and day-2 operations, including adding new applications later. Logical labeling and sharding help keep the system manageable by mapping it to your organization’s domain architecture.

Where Complexity Creeps In

The same setup that enables scale also introduces complexity.

Umbrella Helm charts follow the “Don’t Repeat Yourself” (DRY) principle, but they do so by cascading configuration through multiple layers. In a simple case, configuration can be overridden in at least three places:

the upstream provider Helm chart (for example cert-manager)
default or best-practice values in the umbrella chart
customer- or cluster-specific values

On top of that, ApplicationSets can override values inline as well.

This makes failures harder to reason about. During upgrades, problems may occur because only Helm dependencies are compared, not the fully rendered manifests. If Argo CD then reports an error that is unrelated or poorly surfaced — due to error handling or GUI limitations, which makes debugging difficult.

In such cases, teams often have to reproduce Argo CD’s behavior locally by manually running the same templating steps like:

helm template . \
--name-template cert-manager \
--namespace cert-manager \
--kube-version 1.34 \
--values <path>/managed-service-catalog/helm/cert-manager/values.yaml \
--values <path>/customer-service-catalog/helm/pe-org/cert-manager/values.yaml \
--include-crds

Although the error eventually shows up in the rendered manifests, finding where it originates requires understanding which layer introduced it.

In theory, one could fix the issue directly in the cluster using tools like kubectl edit. In a GitOps setup, however, such changes are short-lived and will be overwritten by the next reconciliation.

Why Hydration Became Necessary

These challenges led to the emergence of tools like Source Hydrator and patterns such as source hydration. The idea is to fully render the manifests ahead of time and store the generated output, for example in a dedicated directory. The GitOps engine then applies exactly what it sees, following a “What You See Is What You Get (WYSIWYG)” model.

This approach improves debuggability because the effective manifests are explicit. It also helps with GitOps at scale. At the same time, hydration shifts some of the complexity from runtime to the deployment workflow. If one global change causes regenerated output across many clusters, pull requests can become very large and hard to review. Hydration can also increase repository size significantly, which may create Git performance issues in CI-Pipeline if the generated output is stored in the same repo without careful structure and tooling.

Git is a system of record for human intent, optimized for collaboration and versioning of source files over time. Its underlying model and protocols are file-based and history-oriented, which makes Git a poor fit for storing and distributing large volumes of generated, highly redundant artifacts. When hydration produces thousands of static manifests, it becomes necessary to reconsider where these outputs are stored and how they are delivered. In such cases, artifact-focused backends such as OCI registries can be a better fit, as they are designed for immutable, content-addressed distribution at scale rather than human-centric change tracking.

Without careful tuning, a central GitOps hub that manages many applications across many clusters — often in a high-availability setup — has natural performance limits. According to Codefresh, Argo CD in high-availability mode can comfortably support, without additional tuning, approximately:

1,500 applications
14,000 Kubernetes objects
50 clusters
200 developers

From customer projects, we have seen that simply switching from on-the-fly templating to pre-hydrated manifests can increase the number of manageable applications by an order of magnitude. Numbers around 15,000 applications are achievable when manifests are hydrated, because the repo-server no longer has to render complex templates for every reconciliation cycle. This is a key reason why hydration becomes an important tool for GitOps at scale, even though it comes with trade-offs that must be managed.

Lessons Learned

These insights come from real-world experience at iits while building and operating our platform, KumoOps. The goal is not only to manage numerous clusters, but to do so across multiple cloud providers, with many customizations and combinations of GitOps topologies.

At the same time, the platform must meet operational excellence requirements and comply with the regulatory constraints common in Europe. Scaling GitOps successfully in such environments requires more than just adding clusters — it requires reducing the amount of work the GitOps engine must do at reconciliation time, while also keeping repository growth, observability, and governance under control.

Why GitOps at Scale needs Hydration

What Is an AI Factory? Rob Hirschfeld of RackN Defines the Term | TFiR

Autonomous Workers Replace IT Agents for L3/L4 Tickets | Nenshad Bardoliwalla, ServiceNow | TFiR

What Is an AI Factory? Rob Hirschfeld of RackN Defines the Term | TFiR

Autonomous Workers Replace IT Agents for L3/L4 Tickets | Nenshad Bardoliwalla, ServiceNow | TFiR

You may also like

Why HA Failover Fails: Overlooked Application Dependencies and Untested Runbooks | Matthew Pollard, SIOS Technology | TFiR

Why AI-Generated Code Needs a Cloud Sandbox to Be Trustworthy | Waldemar Hummer, LocalStack | TFiR

Why Cloud Spend Now Drives Company Valuation | Peter Maloney, Azul | TFiR

Why Enterprises Should Stop Building AI Infrastructure Themselves | Richard Borenstein, Mirantis | TFiR

How to Build Safe, Production-Ready Kubernetes Clusters at Scale | Corey McGalliard, Akamai Cloud | TFiR

Why AI Agent Logs Are Not Enough and How to Get Cryptographic Proof | Yaron Schneider, Dapr | TFiR