Enterprises are spending billions on AI infrastructure based on assumptions about which models, hardware, and architectures are winning in production. Most of those assumptions are wrong. The gap between the AI narrative circulating in boardrooms, analyst reports, and media and the actual workloads running on distributed GPU infrastructure across the globe has widened to a chasm — and companies operating on bad assumptions are losing ground to those who aren’t.
The signal is clearest at the infrastructure layer. AI cloud platforms running real production workloads across hundreds of thousands of developers have a view of actual model adoption, GPU utilization, inference patterns, and application categories that no survey can replicate. That ground-level data is now being made public — and it tells a story that few executives expected to hear.
Qwen, Alibaba’s open-source model family, has overtaken Meta’s Llama 3.1 as the most widely deployed self-hosted large language model. NVIDIA’s Blackwell B200 GPU, widely written off as cold and under-demanded as recently as mid-2025, saw utilization skyrocket after firmware patches unlocked time-to-first-token speeds ten times faster than the previous Hopper generation. And the most sophisticated enterprises are not betting on a single AI model — they are building intelligent model routing architectures that blend closed-source and open-source models in real time based on task context, latency requirements, and cost targets.
This is the story that Runpod’s inaugural 2026 State of AI Report was designed to tell. Built from anonymized platform traffic and GPU utilization data across 183 countries, the report moves beyond hype to document the infrastructure patterns defining the current era of AI deployment. Runpod, one of the leading AI-native cloud platforms — sometimes called a NeoCloud — serves a developer base that recently passed 750,000 users, ranging from academic researchers at Stanford and Berkeley to enterprise customers including Zillow, whose virtual home staging runs on Runpod infrastructure.
The implications for technology buyers, infrastructure architects, and AI product teams are significant. What models you deploy, what hardware you choose, and how you structure your inference architecture will determine whether you are building on the right foundation — or spending the next 12 months rebuilding.
The Guest: Brennen Smith, CTO at Runpod
Key Takeaways
- Qwen has overtaken Llama 3.1 as the most deployed self-hosted LLM on Runpod; Kimi K2 is rising rapidly as enterprises optimize for token cost and fine-tuning control
- NVIDIA Blackwell B200 demand surged after firmware patches achieved time-to-first-token speeds 10x faster than Hopper — a threshold the human brain perceives as instantaneous
- The most successful AI deployments are not single-model — they use AI-powered model routing architectures that fan out to multiple specialized models in parallel
- Agent-driven compute is now a measurable workload category on Runpod; the platform’s own supply agent manages on-call infrastructure operations autonomously
- Small AI models making micro-decisions at high frequency — on CPU or lightweight GPU — represent a major underexplored frontier for production engineering teams
***
In this exclusive interview with Swapnil Bhartiya at TFiR, Brennen Smith, CTO at Runpod, discusses the findings from Runpod’s 2026 State of AI Report, the ground-level realities of open-source model adoption across modalities, the unexpected Blackwell demand surge, how leading enterprises are building model routing architectures, the rise of agent-driven compute, and where AI infrastructure is heading over the next two quarters.
What Runpod Is and Why Its Data Matters
Runpod occupies a category of infrastructure providers increasingly referred to as NeoCloud or AI cloud — platforms purpose-built for AI workloads rather than adapted from general-purpose cloud architectures. Understanding what Runpod does and who it serves is essential context for interpreting the data it collects.
Q: Can you tell us a bit about Runpod and what you folks do?
Brennen Smith: “Runpod is one of the leading NeoCloud or AI cloud providers. Our focus is really on building the best AI experience for developers. This ranges from our Pods products all the way out to production-level inference. We’ve built many custom engines that allow us to scale AI workloads from a small single GPU all the way out to thousands of GPUs around the globe. We have customers ranging from small hobbyists and education facilities — we work very closely with Stanford and Berkeley — but we also work with many large companies. A good example is Zillow; a lot of their virtual staging that you see on their website runs through Runpod. Overall, what we focus on is providing high-quality AI compute for cutting-edge research and production workloads around the globe.”
The 2026 State of AI Report: Purpose and Methodology
The decision to publish a ground-level report on AI workloads came directly from market demand. Enterprises, researchers, and policymakers were all asking the same question — what is actually being used in production? — and very few companies had the visibility to answer it with data rather than speculation.
Q: What was the idea behind the State of AI Report?
Brennen Smith: “This is something we get asked about every single day: how are people taking AI to production? Businesses are wondering what exactly is the status quo — what are their competitors doing, what is the benchmark in the industry? Researchers want to know what models are performing best, and by performing I mean market adoption, not just benchmark scores. We realized that we kept getting these questions, and there aren’t that many companies in the world who can answer them. Most can talk at a high level, but very few are actually running these types of workloads. So we decided to put this report together. We have an in-house data team — an incredibly talented group of PhDs and data scientists. They went through our data in aggregate, analyzed it, and found some really exciting insights. We plan to do this on a semi-annual basis every six months. That’s what I’m really excited about — this now sets the baseline, and tracking the changes over time is when things will get really interesting.”
Open-Source Model Adoption by Modality: What the Data Shows
The report covers model adoption across four primary modalities: image generation, video generation, text-to-speech, and large language models. Each tells a different story about the current state of the open-source AI ecosystem.
Q: The headline of your report is that the market looks nothing like the narrative. What does your ground-level data actually tell us about how the open-source model ecosystem is evolving?
Brennen Smith: “It all depends on the modality. For image models, Stable Diffusion remains one of the preeminent models being selected. Flux is definitely up and coming — it’s emerged as a challenger — but even still, people get good results out of Stable Diffusion and the tooling is much more proven and stable. So Stable Diffusion remains king, but I do expect Flux to start moving higher and higher up the stack. I would not be surprised if someday it takes it over. It’s a similar story with video generation, except WAN is the primary driver — that’s the gold standard, the reference. I don’t expect it to change necessarily. There might be something new that comes up, but at this point WAN is by far the strongest one. With the recent investment announcements, the state of funding in that space may change, so I would see that as maybe an opportunity for a new entity to come in and put something together. But for now, WAN remains the most popular one on our platform.”
Brennen Smith: “Text-to-speech is actually a really fascinating one. We see quite a bit of it, especially in the Middle East. We have a data center in the UAE, and the primary workload being operated there is text-to-speech models. XTTS is the primary model of choice. Many of these workloads are either translation services or call center-type applications — Dubai is a hub for innovation, and a lot of tech companies are centralized around communication-layer use cases. XTTS is still by far the most common one we see. And finally, let’s get to LLMs. Qwen is the most popular one — that was a really interesting finding for us. We assumed Llama 3.1 would probably be the most popular. It wasn’t. Our top model is Qwen. Since this report came out, I’ve personally seen Kimi K2 going up rapidly in our data. I expect that to be on the leaderboards next time we release this report. That’s been interesting, especially with the revelation that Cursor was heavily reliant on Kimi K2. I’ve received a lot of requests from CIOs and CTOs who are looking to reduce their token costs — they realize they may be able to get around 80% of the capability of Cursor at a fraction of the cost, or they can take that model, fine-tune it themselves, and actually get a better outcome because it has better context on their specific company and use case.”
Brennen Smith: “The most important detail to take away is that these models are being used in production. They are serving millions of requests every single day. They are load-bearing for actual business workloads. If there’s anything exciting here — everyone has been asking how soon will money actually be made from AI — for a lot of companies, that answer was quite a while ago. We’re well past the innovation stage. People should be monetizing these models at this point.”
Closed-Source vs. Open-Source: The Real Competitive Dynamic
The debate between open-source and closed-source AI models is often framed as a binary choice. The production reality is far more nuanced — and the most successful companies have moved past the either/or framing entirely.
Q: You only track open-source. How far behind are open-source models versus closed-source models like Claude, GPT, or Gemini?
Brennen Smith: “I would look at it in terms of what overall outcome you’re trying to achieve. For many business cases, a model like Claude Opus is actually too heavy for the need. Take speed and time-to-first-token, for example — token velocity is incredibly important. You don’t necessarily need a heavy closed-source model. You might be able to get away with a much lighter open-source model, do some fine-tuning on top of it, and get your time-to-first-token down to whatever call center requirements demand. I wouldn’t look at it as brute-force capability. Where we see people doing really interesting things is blending models together. We have the ability at Runpod to use both closed-source and open-source models — obviously closed-source running on separate infrastructure — but what companies are doing is building a dual-context basis. They’ll pull in information from the closed-source model to synthesize a high-level response or to identify user intent, and then feed that into a fine-tuned open-source model running on Runpod.”
Brennen Smith: “So I don’t see everyone looking for the silver bullet model. In reality, the most successful companies are building their own model routing architectures, and they’re building AI intelligence into that model routing architecture — fanning out to a number of different models, each tuned to whatever the particular context or case might be. It’s not a one-size-fits-all approach. The clever players are using many, many different flavors of models.”
How Enterprises Are Actually Deploying Open-Source Models
A persistent question in enterprise AI is whether organizations are running models themselves or consuming them through APIs. The answer depends on where the company wants to invest its differentiation — and Runpod’s infrastructure gives it a unique window into how that choice plays out in practice.
Q: When it comes to open-source models, are most enterprises running them themselves or consuming them via APIs?
Brennen Smith: “That comes down to where you want to do the orchestration — where you put your special secret sauce. In every business, you have to decide where to invest, where to put your team’s effort and your dollars. Some companies are going the true API-only route: I have a model hosted somewhere, I call against it, I get outputs and responses. That’s fine. What we are seeing though, especially in our Instant Clusters product — much larger thousand-GPU clusters — is companies actually fine-tuning these models. That’s where they want to bring in their business context. I wouldn’t say the two approaches are at odds with each other. Many people pick Runpod because it gives that flexibility. Some are doing their own custom vLLM tweaks, bringing in their own inference engines with very cleverly designed proprietary optimizations. Others already have systems running internally and want to lift and shift, or do dev and test locally and ship to the same stack in production.”
Brennen Smith: “Think about AWS — they offer everything from bare metal via EC2 all the way to completely abstracted APIs. The ability to offer all of them is where I see the most value. Rather than a prescriptive approach, giving businesses the right tools to decide where they sit on the spectrum — and of course, we’ll be happy to help them move up or down that spectrum as they need.”
The Report’s Biggest Surprise: Blackwell Demand Explodes
One of the most commercially significant findings in the report was a dataset that, if publicly visible earlier, could have informed major infrastructure investment decisions. The story of NVIDIA’s B200 Blackwell GPU demand surge is a case study in how quickly infrastructure fundamentals can shift.
Q: Was there anything in the State of AI Report that surprised you or your team?
Brennen Smith: “The biggest one — and for anyone who read this report in time, you could have made a lot of money — is B200s. If you rewind back, B200s were coming online around May or June of last year. If you look at their utilization profiles, it was not great. They were sitting very cold. Contract prices were extremely low, and frankly, everyone was really concerned about whether they’d be getting a return on investment. We saw in our data that all of a sudden demand was skyrocketing upward. It was due to a number of factors: there were software compatibility issues that were fixed, there were a couple of key firmware patches that NVIDIA released, and then finally, people were able to see that on Blackwell, time-to-first-token is ten times faster than Hopper. That’s a material difference. For the human mind, anything that’s a hundred milliseconds or less is roughly speaking instantaneous. Hopper was not able to achieve that for time-to-first-token. Blackwell is. Once you unlock those three capabilities, demand skyrocketed. We were able to leverage that and secure quite a bit of B200 capacity quickly. That was the biggest surprise for us — everyone was saying Blackwell is cold, B200s are not in demand. But the data said absolutely the opposite.”
What Runpod Wants Developers to Understand
With over 750,000 developers on the platform, Runpod has a vantage point on where developer behavior is heading — and where there is meaningful untapped leverage that most teams are not yet taking advantage of.
Q: What are the things you wish your users and developers would understand so they could get more advantage out of AI?
Brennen Smith: “Let’s talk about layers of the cake. We passed 500,000 developers and we’re actually approaching 750,000 on the platform now — hats off to our marketing and growth teams. In my mind, the most important thing — and I’m sure everyone knows the infamous Steve Ballmer developer speech — every single day I ask: what can we do to make the developer experience better? What can we do to make it easier for developers to get online and innovate faster? Pace of innovation is going to be the primary gating factor right now. In any industrial evolution, the first area of the curve is who can innovate fastest, who can gain edge there. There will come a point where the focus shifts to efficiency — squeezing edge out of optimization rather than raw innovation. We’re not there yet.”
Brennen Smith: “We’re seeing a tectonic shift from humans writing code to agents writing code. We are seeing a major uptick in agent-driven usage on Runpod. I’ve had the team invest very heavily in making Runpod the best place for agents to operate. How do you market to an LLM itself? From a developer’s perspective, as you’re building tools and setting up your own toolchains: think about how to actually make agents flow together with the larger ecosystem. It’s not just about having an agent bang out some code. How do you have it start operating your stack? At Runpod, a large portion of our production stack is operated by an agent we call Supply Agent. That allows us to punch way above our weight in terms of humans needed for on-call operations. I would really recommend taking a holistic perspective — how do you leverage agentic tools not just to generate code, not just to run things, but actually as part of the operational ecosystem? Those who figure that out are set for great success.”
Regulation, Geopolitics, and the Hardware Layer
The conversation moved to the macro forces shaping AI infrastructure — from regulatory frameworks in the EU to hardware embargoes targeting China, and the broader question of whether fragmentation at the model and infrastructure level is a problem or an opportunity.
Q: Where do you see things heading across all the layers of AI — hardware, inference, frameworks, models — and what would the ideal state look like for developers?
Brennen Smith: “Let’s start with the geopolitical side — it’s the fun one. Regulation is required. Without regulation, civilization does not function. But there are two ways to do it: you try to slow things down and gatekeep, or you are innovative in how you regulate a new emerging opportunity. The countries aligning their government organizations around positive outcomes, driving those positive outcomes through regulation, will have a much better result than those who try to gatekeep. AI is here. We’re not going to stop that tide. The matter now is making sure regulations are guardrails and bumpers rather than roadblocks.”
Brennen Smith: “On hardware embargoes — yes, they cause a temporary delay, but there’s one company I would never bet against, and that’s Huawei. There are a number of companies in China that are incredibly capable, with the mental power, the capital, and the ability to do vertical integration that is unmatched. Those regulations are actually forcing ingenuity. It’s fascinating to follow. Even without the regulations, you run into the same dynamic with economics. Amazon acquired Annapurna Labs — an incredibly talented team. Apple has Apple Silicon deployed in data centers with Apple Private Confidential Compute. GCP has their own system. Constraints bring ingenuity. When you have unlimited resources, you end up in the world of the Pentium 4 Prescott era — very inefficient, just throwing more cores and more wattage at problems. That’s when AMD was able to swoop in with a better product. I’m not saying NVIDIA is sitting on their laurels — they’re absolutely moving fast and doing incredible things — but they’re being pushed by labs across the world.”
Brennen Smith: “To your question about fragmentation: with fragmentation comes opportunity. For Runpod, it’s enormous opportunity to build the best tools for our customers. We support AMD. I have good connections with the leadership team at Qualcomm. We have deep connections with Cerebras. We’ll always keep our North Star on building the best developer tools for engineers, developers, and business — and who knows what accelerators that might bring?”
Forecasting the Next Two Quarters of AI Development
Asking for AI infrastructure forecasts in years is outdated. The pace of change has compressed the planning horizon to quarters — and sometimes months. Brennen Smith shared where he sees the most significant momentum building.
Q: What forecasts do you have for the next couple of months — where are things heading?
Brennen Smith: “The most telling part of that question is that you asked me to forecast the next couple of months. Normally in this scenario, you’d ask about the next couple of years. In our industry, we talk in months and quarters. That tells you everything — it’s happening so fast. DeepSeek was a watershed moment for Runpod. The board always asks me to forecast and plan. I can’t plan for DeepSeek moments. They happen, and they’re becoming more and more common every single day. What I foresee is a greater convergence — a reduction in fragmentation. Capability that’s sitting over here and capability sitting over there starts to get brought closer and closer.”
Brennen Smith: “There are going to be two layers of innovation. The first is in the models themselves — those will continue to improve. But if OpenCall proved anything, it’s that some of the biggest greenfield blue-ocean opportunity is in the tooling and the glue. How you bring these models together is getting to the point where the models are good enough. How you bring them together is now the most important thing. Vibe hunting, guaranteed someone’s going to come up with a new standard tool. Curl is the obvious example — it became the foundation of all communication on the internet. I could see that happening with AI: foundational glue tools, not AI themselves necessarily, but orchestrators of AI tools and agents. How do you blend that together? That’s the exciting space.”
Brennen Smith: “The other one I’m really excited about is small AI. You think about large AI — LLMs — as the tool for complicated problems. But how do you use small AI models to make small decisions? Small decisions are still just as important. You might be making a thousand of them an hour or per second in certain infrastructure scenarios. It could run on a CPU, a small GPU, a sidecar accelerator. Using that as part of the normal flow of code — just like a ternary operator or a Boolean, a small call out to a small LLM — and then tying those small models against big mega-models. We use small models for traffic routing and management at Runpod. That’s just one idea and an obvious one. The glue code, the frameworks, how things are brought together — that’s going to be the really interesting space for the next two quarters.”
The Excitement vs. Production Gap: What Developers Are Actually Deploying
One of the most persistent dynamics in enterprise AI is the gap between what organizations say they are building and what actually reaches production. Smith offered a perspective that reframes the question entirely — and argues that the real unlock in AI is not technical but human.
Q: What is the biggest gap you see between AI excitement and actual production deployment?
Brennen Smith: “A mentor of mine said a long time ago: money can’t buy brilliance. We’re seeing that to a huge extent now. The best success people are having — one of the parts I love most about AI — is that you no longer need to be a software engineer to get an exceptional technical outcome. AI is one of the most democratizing forces in the world. The classic view is that to build a successful business you need the hacker, the hipster, and the hustler. AI now covers the hacker and the hipster. You can do design in AI. You can do the code in AI. What it really puts forward now is the human spirit. How much drive do you have? How strong is your idea? Someone given a huge cluster of B200s or early access to a new chip does not mean a good outcome anymore. It’s now a challenge of the mind — human ideas and the grit and perseverance to make them happen.”
Brennen Smith: “I can’t remember the exact company name, but it was an individual — I think him and his brother — who created a $200 million run rate Ozempic company, just the two of them and a suite of AI tools. That’s a perfect example of the human mind being unlocked by AI. Every time someone says AI is going to destroy civilization or destroy humanity, what I would actually say is it unlocks humanity. It makes it so that we are now able to unconstrain the goals and ideas we’ve had for so long. Patrick Kennedy wrote a fascinating piece on the use of AI in a hardware store — very generic brick-and-mortar retail — using it for shoplifting detection, popularity analysis, dwell-time mapping in the store, store layout optimization, and it was helping a small mom-and-pop store achieve better revenue margins. There’s a lot to doom and gloom in this industry, but I think that’s the wrong approach. The right approach is: I can now do things I’ve always wanted to but have never been able to. Sky’s the limit. What do you do with it? What do you harness it for? That’s the best part about AI — and frankly what I love to see and support every single day at Runpod.”





