As AI workloads become the new normal across cloud infrastructure, platform engineering must evolve to meet the moment. Observability tools, security practices, and infrastructure models built for traditional apps are proving insufficient — especially when AI workloads involve GPU-based processing and opaque models. In a conversation with TFiR, Saiyam Pathak, Principal Developer Advocate at vCluster Labs, outlined where the cracks are showing — and what’s being done to fix them.
AI Observability: Tracing the Untraceable
“AI is a black box,” Pathak said. “A request goes into the model, and you get an output — but what happens in between is hard to trace.”
📹 Going on record for 2026? We're recording the TFiR Prediction Series through mid-February. If you have a bold take on where AI Infrastructure, Cloud Native, or Enterprise IT is heading—we want to hear it. [Reserve your slot
To address this, the OpenTelemetry community has created a new special interest group (SIG) to focus on AI-specific Observability. These efforts aim to standardize how we trace and measure LLM interactions, model performance, and AI pipeline behavior.
“We’re still developing the foundations,” Pathak noted. “But platform teams are already building infrastructure with future AI workloads in mind.”
GPU Vulnerabilities and the Return of the VM Debate
One of the more concerning trends: rising GPU-level vulnerabilities. Pathak referenced “NVIDIA Escape,” a vulnerability tied to the NVIDIA container toolkit that allowed attackers to break out of containers and access host nodes.
“These aren’t theoretical,” he said. “It’s happened before. It will happen again. And every time, people panic — and some even suggest moving back to VMs.”
The escape vectors are particularly concerning in AI/ML pipelines where privileged access to GPUs is common. It’s forcing teams to rethink their trust boundaries.
A Mitigation Strategy: Virtual Node Isolation
vCluster Labs offers a countermeasure: vNode.
“Even if there’s a container escape,” Pathak explained, “you only get access to a virtual node — not the real host. It reduces the blast radius.”
This type of compartmentalization is increasingly necessary as AI workloads scale and run in shared environments. vNode assumes failure will happen — and builds for resilience.
A Community Still Finding Its Standards
Observability and security standards for AI systems are still in flux. Pathak noted that various CNCF and OpenTelemetry projects are still working toward consensus.
“There are new SIGs forming just to figure out how to trace an AI request from a user, through an LLM, to the output,” he said.
The industry is still early in defining how platform teams should manage and observe AI-native systems.
Multi-Tenancy Still Front and Center
While AI Observability and GPU security are top of mind, Pathak reaffirmed that multi-tenancy remains the underlying concern.
“Everyone wants multi-tenancy — but they also want flexibility. That’s what we’re building with vCluster,” he said.
By combining flexible tenancy models with strong security primitives like vNode, vCluster Labs is attempting to future-proof Kubernetes infrastructure for what’s next.
Conclusion: Platform Engineering Has a New Mandate
What was once a DevOps function is now turning into a forward-looking discipline. Today’s platform engineering teams must secure infrastructure that supports opaque models, expensive GPUs, and unpredictable LLM behavior — while making it observable, scalable, and cost-efficient.
As Pathak put it, “Things will break. The question is: what happens when they do?”





