Guest: Rob Hirschfeld (LinkedIn)
Company: RackN
Show Name: An Eye on AI
Topic: AI Infrastructure

As organizations race to deploy large AI models, they’re discovering that the biggest bottleneck isn’t GPU availability—it’s operational complexity. In this conversation, Rob Hirschfeld, CEO and Co-Founder of RackN, joins Swapnil Bhartiya to break down how AI infrastructure introduces new levels of scale, cost, and unpredictability to operations—and what can be done about it.

RackN’s recent case study illustrates the point. A major service provider needed to offer AI training clusters as a service to internal teams. Each cluster consisted of 64 machines with complex storage and network configurations—an investment worth hundreds of millions of dollars. Before automation, resetting these systems between training runs could take a week, leading to $150,000 in idle depreciation. With RackN’s Digital Rebar, that reset now takes just 90 minutes, including a full wipe, reconfiguration, and patching cycle.

Hirschfeld explains that while AI operations share DNA with traditional DevOps, the difference lies in scale and sensitivity. GPU clusters combine InfiniBand, fiber channel, and Ethernet networks in ways that make topology management far more demanding. BIOS settings, storage, and network interconnects all need precision to ensure data security and performance. Every node must be rebuilt into its exact position within the cluster, making automation not optional but essential.

Kubernetes also plays a growing role in AI operations—but not in the way most expect. Instead of disposable nodes, AI workloads rely on tightly coupled clusters where machine location, networking, and storage must remain fixed. In Hirschfeld’s words, “You can’t just ask for a new training node. These are purpose-built clusters that need to run continuously.”

The conversation goes beyond technology into the economics of automation. Time to value is now the true metric of ROI. The ability to patch, reset, and repurpose hardware quickly directly translates to financial efficiency. Vendor flexibility—supporting both NVIDIA and AMD, InfiniBand and Ethernet—has become another key cost lever.

Hirschfeld concludes with advice for companies starting their AI journey: stop worrying about buying the “perfect” hardware. Focus on repeatable automation and iterative processes. In a world where systems evolve every quarter, agility is the only lasting advantage. He also foresees a merging of AI ops with platform engineering—where infrastructure automation and AI system management become inseparable disciplines.

RackN’s mission, as Hirschfeld puts it, is to democratize AI infrastructure management, making it as efficient and accessible as cloud-native operations. As AI scales from hyperscalers to enterprises, automation will determine who can compete effectively—and sustainably.

Here is the edited Q&A of the interview:

Swapnil Bhartiya: AI is transforming how we build and deliver software, but behind every model and every pipeline is an operational challenge that is unlike anything we’ve seen before. From GPU clusters to data orchestration, the ops for AI work is forcing teams to rethink infrastructure automation, cost, and scalability. I’m joined once again by Rob Hirschfeld, CEO of RackN, to unpack what makes AI operations unique and how organizations can prepare for this new wave of complexity. I saw your case study about AI Clusters as a Service. Can you recap what that’s about and what problem you were solving for the customer?

Rob Hirschfeld: It really is a stunning study in how expensive and complex AI infrastructure has become. We were pulled in to help a major service provider offering internal training clusters as a service. Each 64-machine cluster represented a quarter-billion-dollar investment. We reduced their full reset time—from scrubbing disks and patching systems to rebuilding networks—from a week to 90 minutes using Digital Rebar automation. That’s about $150,000 of idle time saved per reset, not even counting lost productivity.

Swapnil Bhartiya: RackN works with customers who deal with complex operations. But if we just look at AI, what makes AI operations so uniquely challenging compared to traditional DevOps or cloud-native workloads?

Rob Hirschfeld: Fundamentally, it’s still bare-metal operations—but the complexity and cost are amplified. AI systems mix InfiniBand, fiber channel, and Ethernet in hybrid topologies that are far more sensitive. Every node has strict BIOS and NIC configurations that must be precisely rebuilt after resets. It’s not different work, but the pressure, price, and pace are dramatically higher.

Swapnil Bhartiya: Is it more about scale, unpredictability, hardware, data pipelines, privacy, or compliance?

Rob Hirschfeld: All of the above. The AI builders we work with are moving at breakneck speed. One customer can now onboard new clusters in four hours, fully tested and ready for training. That level of velocity is crucial when every idle minute of GPU time costs thousands.

Swapnil Bhartiya: You mentioned Kubernetes being used in AI but not in the way people expect. Can you explain how AI workloads are different from traditional Kubernetes use cases?

Rob Hirschfeld: In AI, Kubernetes orchestrates training jobs, but unlike cloud-native apps, these nodes aren’t interchangeable. They’re tightly bound to specific storage and networking configurations. You can’t just spin up a replacement node because every machine’s placement and topology are critical.

Swapnil Bhartiya: Let’s talk about the economics of automation. How does it impact total cost of ownership and ROI?

Rob Hirschfeld: The biggest cost isn’t hardware—it’s opportunity loss. Automation reduces downtime, increases utilization, and allows organizations to mix vendors and technologies for cost optimization. Being vendor-agnostic across GPU and network layers is becoming a competitive advantage.

Swapnil Bhartiya: What advice do you have for companies starting to build AI infrastructure?

Rob Hirschfeld: Don’t get stuck on hardware decisions. Invest in automation and repeatable cluster builds. The tech evolves so fast that agility matters more than any single purchase. Build resilient processes, not static systems.

Swapnil Bhartiya: How do you see AI ops evolving in the next few years?

Rob Hirschfeld: It will merge with platform engineering. But there’s also a missing conversation—AI infrastructure operations. We need to normalize how to automate, patch, and manage AI clusters. We’re also exploring how AI can assist in operations through LLMs and agentic workflows.

Swapnil Bhartiya: Finally, what role do you see for RackN in this AI-driven world?

Rob Hirschfeld: We’re helping democratize AI infrastructure. Our automation tools shorten deployment cycles, save energy, and help teams move faster. AI will eventually be everywhere, and our goal is to make running AI infrastructure as accessible as running traditional systems.

Automation Is the Missing Link in AI Infrastructure Operations | Rob Hirschfeld, RackN

TFiR Weekly: Europe, AI, and the Business of Open Source

How Egen Uses AI to Help Investigators Detect Fraud Faster | Prasanna Venkatesh & Kartik Kumar

TFiR Weekly: Europe, AI, and the Business of Open Source

How Egen Uses AI to Help Investigators Detect Fraud Faster | Prasanna Venkatesh & Kartik Kumar

You may also like

AI Writes Code, But Who’s Managing the Infrastructure? GitOps Has the Answer | Hong Wang, Akuity

Zero-Day Threats Don’t Wait for Antivirus: AI Predicts Malware Before Execution | Dr. Aqib Rashid, Glasswall | TFiR

Multi-Cloud Fragmentation Is Creating Governance Blind Spots | | Dirk Alshuth, emma | TFiR

Near-Zero Downtime Patching With HA Clustering: Dave Bermingham, SIOS Technology | TFiR

AI Infrastructure Complexity Is Costing Enterprises Millions—Mirantis Has a Fix | TFiR

Why AI Agents Fail on Real Business Data | Michel Tricot, Airbyte | TFiR