GPU Costs Are Killing AI Budgets—Volcano’s Unified Scheduling Cuts Waste | Jesse Stutler, Volcano

Volcano 1.14 introduces unified AI scheduling for Kubernetes with multi-scheduler architecture, AgentCube for AI agents, and Katana for LLM inference—cutting GPU costs through intelligent resource management.

By Monika Chauhan April 8, 2026

0

Running AI training jobs, LLM inference workloads, and bursty AI agent sessions on the same Kubernetes cluster is a financial trap. The problem isn’t deployment—it’s wasted GPU capacity, fragmented resource allocation, and inefficient scheduling that treats every workload the same. Enterprises are paying for idle compute while simultaneously struggling with latency spikes and resource contention.

Volcano 1.14 is evolving from a batch scheduling tool into an AI-native unified scheduling platform designed to handle the full AI lifecycle without burning through cloud budgets. With its new multi-scheduler architecture, topology-aware scheduling, and intelligent routing for inference workloads, Volcano addresses the operational and financial pain points that standard Kubernetes schedulers can’t solve.

The Guest: Jesse Stutler, Maintainer at Volcano

Key Takeaways

Volcano 1.14 introduces multi-scheduler architecture with dynamic sharding for batch and latency-sensitive AI agent workloads
GPU cost reduction comes from higher utilization through topology-aware scheduling and colocation strategies
AgentCube provides Kubernetes-native infrastructure for bursty, short-lived AI agent sessions with warm pools and session-aware routing
Katana delivers production-ready LLM inference with KV cache awareness, prefix caching, and speculative decoding support

***

[expander_maker]

In a recent TFiR interview, Swapnil Bhartiya spoke with Jesse Stutler, Maintainer at Volcano, about the evolution of Volcano from a batch scheduling tool to an AI-native unified scheduling platform capable of handling training, inference, and agent workloads on the same Kubernetes cluster.

What Is Volcano?

Volcano is an open-source scheduling platform for Kubernetes that originated in batch workload management, including AI training, high-performance computing (HPC), and big data processing. As AI workloads have grown more complex, Volcano is moving beyond batch scheduling to become a unified platform that handles AI training, inference, and agent workloads simultaneously.

Q: What is Volcano?

Jesse Stutler: “Volcano is an open scheduling platform for Kubernetes. It originated from the batch workloads, including AI training, HPC and big data. But AI workloads are becoming more complex today, and not just include AI training, but also AI inference and agent workloads all running together. So Volcano is moving beyond the batch to becoming a unified scheduling platform today.”

Volcano 1.14: Architectural Evolution for Unified AI Scheduling

The release of Volcano 1.14 marks a significant architectural shift, introducing a multi-scheduler architecture designed to handle both batch workloads and latency-sensitive AI agent sessions without compromising performance or resource efficiency.

Q: What are the biggest changes in Volcano 1.14?

Jesse Stutler: “The biggest changes in Volcano 1.14 is the architectural shift. Volcano is moving beyond batch to a unified scheduling platform for AI agent and AI workloads. First we introduce a multi-scheduler architecture. It is not just including a scheduling controller, but also a dedicated agent scheduler for latency-sensitive workloads. Second, we enhance the topology-aware scheduling for hyper-node level bin packing and subgroup level support. It will enhance the AI training and inference to increase its performance. We have also enhanced the colocation features to support the generic OS and to support cgroup v2, to support CPU throttling. Together, all these features help Volcano to support scheduling for the full AI lifecycle.”

GPU Cost Reduction Through Intelligent Scheduling

GPU costs represent the largest line item in AI infrastructure budgets, and much of that spend comes from wasted capacity. Volcano addresses this by increasing cluster utilization through topology-aware placement and colocation strategies that reduce idle time and fragmentation.

Q: How does Volcano help organizations save money on cloud costs, especially GPU workloads?

Jesse Stutler: “Volcano helps reduce costs by increasing the utilization for the clusters. In AI clusters, a lot of the cost comes from wasted capacity—idle GPUs, fragmented placement, and inefficient communication. Volcano will place those communication-heavy workloads in the same network domain, which improves training and inference efficiency. Second, with colocation, it will increase the deployment density in the same cluster to all run together. So this together, the result is higher utilization, less idle time, and better use of infrastructure.”

Katana: AI-Native Inference Routing for Kubernetes

Deploying and serving LLM inference workloads on Kubernetes has historically required stitching together multiple tools and custom routing logic. Volcano’s Katana platform provides a unified inference layer with built-in support for KV cache awareness, prefix caching, and advanced inference patterns like disaggregated prefill and decode.

Q: How does Volcano make it easier to deploy and use AI models, especially LLM inferencing?

Jesse Stutler: “Volcano makes AI models easier to use because AI today is not just for simple scheduling, but for intelligent routing, inference serving, KV cache awareness, and support for patterns like prefill and decode disaggregation. Volcano provides the scheduling foundation for this, and Katana is built on top of it to become a Kubernetes-native AI inference platform. It includes KV cache awareness routing, prefix routing, and also supports mainstream inference frameworks such as vLLM and SGLang. It supports both standard inference and the prefill-decode disaggregation. Together with Volcano’s scheduling and topology-aware scheduling, this will help users run AI models more efficiently and production-ready on Kubernetes.”

AgentCube: Kubernetes-Native Infrastructure for AI Agents

AI agent workloads present a unique challenge—they are bursty, short-lived, and latency-sensitive, which standard Kubernetes schedulers struggle to handle efficiently. AgentCube introduces session-aware routing, warm pools, and pre-provisioned environments to deliver fast startup times and efficient resource allocation for AI agents.

Q: Can you talk about AgentCube and how it helps AI developers manage these bursty, short-starting agent workloads?

Jesse Stutler: “AgentCube is a Kubernetes-native platform for AI agent workloads. It is designed for short, interactive session-based workloads, which standard Kubernetes cannot currently handle efficiently. AgentCube improves this by using session-aware routing and reusable warm pools with pre-provisioned environments for much faster startup. It also provides out-of-the-box SDKs, CLIs, and richer ecosystem integration, so AI developers can use this tool to more easily use AI agent workloads on Kubernetes.”

Multi-Scheduler Architecture with Dynamic Sharding

Running multiple schedulers on the same Kubernetes cluster introduces coordination challenges. Volcano’s dynamic node sharding mechanism allows the batch scheduler and the agent scheduler to coexist without resource conflicts, allocating nodes dynamically based on workload characteristics and CPU utilization patterns.

Q: You’ve introduced a multi-scheduler architecture with dynamic sharding. What problem does that solve?

Jesse Stutler: “The multi-scheduler needs to cooperate together. Currently, Volcano has built a new scheduler called the agent scheduler specifically for AI agent workloads with fast scheduling. The existing batch scheduler needs to schedule batch workloads. So this multi-scheduler running together on the platform needs to cooperate. We use a dedicated dynamic node sharding mechanism to allow them to work together. It can monitor the CPU utilization. If the utilization is higher than 18%, AI agent workloads are more preferred because they use fragmented resources. So we can allocate that node to the agent scheduler. If the utilization is below 18%, we can allocate it to the batch scheduler. This allows them to run together.”

[/expander_maker]

You may also like

How Kubernetes 1.36 Handles GPU Scheduling, DRA, and Kubelet Security | Ryota Sawada, Kubernetes | TFiR

By Monika Chauhan17 minutes ago

AI Infrastructure

The DIY AI Infrastructure Tax Enterprises Keep Underestimating | Richard Borenstein, Mirantis | TFiR

By Monika Chauhan2 days ago

AI Infrastructure

Agentic AI Is the New Attack Surface: Steve Winterfeld of Akamai on OWASP’s Top 10 and AI Threat Defense | TFiR

By Monika Chauhan5 days ago

Security

When Does Self-Hosted AI Actually Make Sense? Rob Hirschfeld of RackN Gets Practical | TFiR

By Monika Chauhan5 days ago

AI Infrastructure

AI Agents Are Breaking Observability — Snowflake’s Jeremy Burton on What Comes Next | TFiR

By Monika Chauhan6 days ago

AI Infrastructure

Your HA Backup System Has Hidden Gaps — SIOS Technology’s Trey Isaac Explains How to Find Them | TFiR

By Monika ChauhanMay 20, 2026

Cloud Native