Guest: Jonathan Bryce
Organization: CNCF
Show: 2026 Predictions
Topic: AI Infrastructure
The AI industry has spent the last two years obsessing over training — building ever-larger models, assembling massive data sets, and racing to benchmark supremacy. But according to Jonathan Bryce, Executive Director of the Cloud Native Computing Foundation (CNCF), that era is giving way to something more operationally demanding: inference at scale.
“Training is the lab where AI models are created,” Bryce explains. “But once you have a model, you have to actually serve it—make predictions, answer questions, and make that intelligence available to your team, your applications, and your integrations.” For Bryce, 2026 is the year inference becomes the most critical area of AI investment for most companies around the world, and the CNCF ecosystem is at the center of solving it.
From the Lab to the Factory
The analogy Bryce keeps returning to is a simple one: training is the lab, inference is the factory. The challenge is that most organizations are still equipped for the lab. They know how to train a model; they do not yet know how to run one reliably in production at the scale that real business value demands.
“This is not something that people are as comfortable and knowledgeable with as they are with running databases or application servers,” Bryce says. “We have many years of experience with those technologies. Inference is still catching up.”
That gap is precisely where cloud native infrastructure becomes essential. The CNCF community has spent a decade solving the hard problems of production systems — how to deploy, secure, observe, and scale complex distributed workloads. Inference, Bryce argues, is just the latest and most demanding version of that problem.
What CNCF Projects Are Stepping Up
CNCF now hosts over 200 projects, and Bryce sees a clear convergence happening between the existing toolset and the new demands of AI workloads. Kubernetes remains the foundation, but it is evolving — Dynamic Resource Allocation (DRA) is a key example, enabling more sophisticated management of GPUs and accelerator hardware that does not fit the traditional server model.
OpenTelemetry is becoming indispensable for understanding inference system behavior. “How much am I actually utilizing these GPUs that I invested millions of dollars into? How efficient are my queries against this model?” These are the questions observability must answer, and OpenTelemetry is the project that is addressing them.
On the inference engine side, Bryce highlights vLLM as a project gaining serious traction. “It’s the one that Hugging Face has really started to standardize on.” He also points to LLM-D as a key project for orchestrating inference engines at scale, and to HAMI — a CNCF sandbox project — as an innovative solution that allows organizations to slice GPUs, divide single units across training and inference workloads, and mix different GPU architectures within a single cluster.
The Scaling Opportunity Is in Software
One of the most striking observations Bryce offers is that the biggest performance gains in inference over the past year have not come from better hardware — they have come from deployment architecture. “We’ve seen improvements in performance that are many multiples on the same hardware architecture,” he says. “This is how you get the most out of your GPUs and make sure your AI investments don’t put you under water.”
That insight reframes the infrastructure conversation. Organizations that want to compete on AI in 2026 do not necessarily need to buy more GPUs — they need smarter software that extracts more value from the ones they have.
Specialized Models and the Operations Bridge
Bryce also sees a shift in the model landscape itself. The race for the single largest, most general model is giving way to a proliferation of specialized models — smaller, cheaper, faster, trained on domain-specific data sets. “The specialized models that have been trained on custom data sets are dramatically cheaper and faster to answer questions that they know about,” he says. Workflows like KubeFlow will play an increasingly important role as organizations continuously retrain and refine these models as new data becomes available.
This trend also creates a new kind of professional gap — a need to bridge AI expertise and operations expertise. Bryce envisions an emerging discipline where engineers who understand inference engines and engineers who understand production reliability systems move toward each other. “People from the AI world are going to come this way a little bit, and we’re going to get patterns that are much more rich around how to take something like vLLM, scale it horizontally, route really intelligently — so common queries can be cached and usage cut down.”
Open Source as the Infrastructure of Sovereignty
For Bryce, the question of whether AI will remain open is settled — at least at the model layer. Open weight models now perform close to proprietary ones and are widely available for enterprise use. But he argues that operational knowledge remains concentrated inside the large labs, and that the open source community is the mechanism by which that knowledge gets democratized.
The example he offers from KubeCon Atlanta 2025 makes the point concretely: an OpenAI engineer discovered that small changes to Fluent D — a CNCF project — could cut a major performance bottleneck in half for ChatGPT’s infrastructure. They committed the fix upstream. Everyone benefits.





