AI Infrastructure

Why AI Factory Deployments Keep Failing — And How RackN Is Fixing That

0

Guest: Rob Hirschfeld (LinkedIn)
Company: RackN
Show: The Agentic Enterprise
Topic: Agentic AI

Enterprises are spending billions on AI infrastructure, but the hardest part isn’t the models or the GPUs — it’s getting the bare metal right.


📹 Going on record for 2026? We're recording the TFiR Prediction Series through mid-February. If you have a bold take on where AI Infrastructure, Cloud Native, or Enterprise IT is heading—we want to hear it. [Reserve your slot

Rob Hirschfeld has seen it up close. As CEO and Co-Founder of RackN, he’s worked with some of the most ambitious AI factory deployments in the industry, and the pattern is consistent: urgency creates chaos, silos create blind spots, and teams that can’t replicate their own success are just getting lucky.

The term “AI factory” is gaining traction for a reason. It’s not just a cluster or a workload — it’s the entire physical plant purpose-built to produce AI results. Whether that’s racks of training gear for model builders or inference engines for production workloads, an AI factory represents a standardized, reproducible footprint. “Instead of an AI cluster or AI workstation or AI workload, the going term seems to be AI factory,” Hirschfeld explained. “It talks about the mechanical physical plant you need to accomplish AI, versus running a model or doing inference or having an API.”

What makes AI factories fundamentally different from traditional data center deployments is the density of complexity. Multiple GPUs per server, multiple smart NICs, highly tuned networking topologies — every component is a potential point of failure. Hirschfeld described a situation where his team spent over a week troubleshooting an issue that turned out to be smart NICs independently grabbing DHCP addresses and interfering with the provisioning network. “People insisted they knew all of the components,” he said. “But when you have this many pieces and parts in a server, you have a lot more potential conflicts, a lot more places for conflicting configuration.”

The pressure to move fast makes all of this worse. Hirschfeld invoked a Marine Corps maxim that cuts to the heart of the problem: “Slow is smooth, smooth is fast.” His teams see the opposite play out constantly — customers taking delivery of racks without knowing in advance what hardware is coming, racing to get systems online, and skipping the process discipline that would actually accelerate delivery long-term. “The urgency to get these systems up and running actually undermines teams’ ability to troubleshoot, take step-wise approaches, and put in systems and processes that then speed them up.

One of the most persistent self-inflicted wounds is organizational: platform teams — the people running Kubernetes, managing AI ops — assume they can also handle bare metal provisioning. They can’t, at least not without paying a steep tuition in time and delays. “We’ve seen a lot of derailed pilots where teams are taking months and months to get started because they’re learning bare metal at the same time they’re trying to get their pilot running,” Hirschfeld said. The same dynamic is playing out in VMware migration projects. Teams that resist bringing in bare metal expertise watch their timelines blow out. Teams that lean in and adopt repeatable processes are the ones that actually win.

The core advice Hirschfeld offers to enterprises about to deploy AI infrastructure is deceptively simple: make sure you can replicate success. “If you can’t replicate what your successful outcome is — flush everything, start over, and go through those steps again — then you actually don’t know why you were successful. You just got lucky.” He even offered a test for executives: ask your team to reset the system to zero and walk through the entire process from scratch. If they look nervous, you don’t have a repeatable deployment. You have a demo.

RackN’s answer to this complexity is Digital Rebar — battle-tested bare metal automation workflows that handle the full spectrum of layer-zero infrastructure work: inventory, network topology configuration, firmware updates, OS installation, security bootstrapping, and more. The platform is designed to handle whatever hardware shows up — Dell, Supermicro, Quanta — without requiring custom code or heroic effort on the fly. “Our customers don’t even know what gear is going to show up,” Hirschfeld noted. “You don’t have time to say, wait, I’m not ready for the Dell gear yet.”

Looking ahead, Hirschfeld sees agentic AI as critical — not as something RackN builds into its own product, but as a layer that enterprises will manage themselves. RackN’s role is to ensure those agents can interact with data center infrastructure reliably. “Enterprises are going to live or die on the quality of their agent responses and the guardrails they put together,” he said. That means building MCP integrations and prompt dictionaries so that enterprise AI agents can issue high-level commands to Digital Rebar and get deterministic, reliable results — not stochastic guesses.

The AI infrastructure race is real. But the winners won’t be the teams that move fastest. They’ll be the ones that build the processes to move repeatably.

Danielle Cook: Akamai’s 2026 Predictions on AI Inference, Kubernetes, and Distributed Cloud

Previous article