Guest: Rob Hirschfeld (LinkedIn)
Company: RackN
Show Name: An Eye on AI
Topic: AI Infrastructure
Kubernetes has become the backbone of modern infrastructure, powering everything from microservices to edge applications. But when it comes to AI, the story changes dramatically. In this clip, Rob Hirschfeld, CEO and Co-Founder of RackN, unpacks how Kubernetes is being used for AI workloads — and why these deployments differ so profoundly from traditional cloud-native systems.
Many organizations assume Kubernetes offers universal flexibility: add nodes, scale up, tear down. But in AI training and inference, that model breaks down. Each GPU-driven machine is tied to a specific configuration, networking path, and data storage structure. As Hirschfeld explains, “In AI, you can’t just throw out a node and spin up another one — these are dedicated clusters with specific purposes.”
Unlike ephemeral workloads, AI clusters are physical, interconnected systems where every node matters. They mix InfiniBand, fiber channel, and Ethernet networks, each handling different aspects of data movement. Training jobs generate massive peer-to-peer traffic, and storage demands require frequent checkpointing and consistent performance across all nodes. The result: a highly sensitive topology where even small configuration changes can cause major downtime.
This is where automation becomes essential. RackN’s Digital Rebar automates the wiring, tagging, and lifecycle management of each node, ensuring that clusters can be rebuilt or reset quickly without losing critical configurations. In essence, it bridges the gap between Kubernetes orchestration and the physical infrastructure layer beneath it.
Hirschfeld points out that AI clusters can’t afford the inefficiencies of cloud-style resource pooling. In a typical public cloud model, you might maintain an idle pool of compute nodes for elasticity. But with GPUs costing thousands of dollars each, idle hardware is wasted capital. “It’s simple math,” says Hirschfeld. “The gear is way too expensive for you to have an idle pool of AI machines waiting to be allocated.”
Instead, AI environments follow a “whole-cluster” deployment approach: build the cluster, train the model, reset the environment, and redeploy. Each node remains tightly managed throughout its lifecycle, often re-imaged and patched in place rather than replaced. This operational model demands precision and automation that most DevOps pipelines weren’t built to handle.
What makes this shift significant is how it redefines the boundary between cloud-native and bare metal. Kubernetes may orchestrate the jobs, but tools like Digital Rebar orchestrate the infrastructure itself — bringing physical systems into the same automation loop. The result is an architecture that combines the agility of cloud workflows with the determinism of data center control.
As AI adoption grows, this distinction becomes critical. Teams that approach AI infrastructure as “just another Kubernetes deployment” risk hitting scaling walls and cost overruns. Those that adapt — embracing infrastructure-aware automation — will gain speed, consistency, and far better utilization of their expensive hardware.
Hirschfeld’s insights reveal the next phase of platform engineering: uniting software orchestration and infrastructure automation under a single operational model. The future of AI operations isn’t about replacing Kubernetes; it’s about making it aware of the physical reality beneath it.





