AI Infrastructure

Why Centralized Cloud Fails for AI Inference and How to Fix It | Ari Weil, Akamai | TFiR

0

Production AI inference is a distributed, bursty, latency-sensitive workload. Centralized cloud architectures designed for training cannot serve it effectively. As GPU utilization climbs in a single region, queuing delays grow exponentially and batching decisions that worked at low load begin adding hundreds of milliseconds to time to first token.

In this interview on TFiR, Ari Weil, VP of Product Marketing at Akamai, breaks down the findings of Akamai’s 2026 State of AI Inference Survey and walks through how Akamai is addressing the inference gap through distributed GPU orchestration, proximity-based routing, and the AI grid initiative built in partnership with Nvidia.

Guest: Ari Weil, VP of Product Marketing at Akamai
Show: TFiR

Here is what every platform engineer and AI infrastructure architect needs to know.

Technical Deep Dive

Q: What is the 2026 Akamai State of AI Inference Survey and what did it set out to measure?

Ari Weil, VP of Product Marketing at Akamai, explains that the survey was designed to capture how technologists across industries and geographies were assessing their own readiness for AI, what challenges they anticipated, and where distributed infrastructure could help. The goal was to understand what kept practitioners up at night as AI adoption moved from experimentation into production across multiple sectors. The survey surfaced a significant gap between where inference workloads are running today and where practitioners believe they need to run.

“We wanted to get a better feeling from technologists across the globe and across industries how they were really thinking about the readiness that they felt themselves, how their teams and their organizations felt about AI, what sort of challenges they were going to be facing.” — Ari Weil, VP of Product Marketing, Akamai

Q: Why does moving from AI pilots to production inference require a completely different infrastructure approach?

Weil draws a sharp distinction between training and inference: training is a concentrated, predictable, throughput-optimized activity, while inference is distributed, bursty, and latency-sensitive. Enterprises that assume production inference will behave like training but faster are building on a flawed premise. Every inference request is a real person or agent waiting for a response, which means any added latency directly degrades the user or agent experience in a way that training jobs never expose.

“Training is a concentrated activity. It’s predictable. But in the case of inference, it’s exactly the opposite. You’re dealing with a distributed, bursty and really sensitive workload that is completely dependent on the user asking a question and expecting a response.” — Ari Weil, VP of Product Marketing, Akamai

Q: Why is model size selection a critical infrastructure decision for inference workloads?

Weil argues that large frontier models with hundreds of millions of weights are often the wrong choice for domain-specific inference tasks. Smaller, purpose-built models reduce hallucination risk because they carry less irrelevant information, consume fewer compute resources, and respond faster. Matching model size to the scope of the question being answered is an architectural decision with direct infrastructure cost and latency implications.

“You’d be much better off having fewer weights, smaller models, something that you can ask a domain-specific question to and not worry so much about hallucinations because too much information was inside of the model.” — Ari Weil, VP of Product Marketing, Akamai

Q: Why is latency a structural architecture problem for AI inference rather than a tuning problem?

Weil explains that latency in AI inference is the sum of two components: round-trip network time from user to inference endpoint, and token streaming time for the response. Neither component can be tuned away after the architecture is chosen. If the infrastructure is geographically distant from the target user population, physics sets a floor on round-trip time that no software optimization can overcome. Treating latency as a post-deployment tuning problem rather than an architectural constraint is one of the most common and costly mistakes enterprises make.

“It’s not a tuning problem to make something that might be too slow go faster. It might be a fundamental characteristic of the architecture you’re building. You can’t architect your way out of physics.” — Ari Weil, VP of Product Marketing, Akamai

Q: Why do enterprises default to single centralized cloud regions for inference and what does the survey say about that gap?

Weil points to inertia as the primary driver: teams default to familiar deployment targets like AWS US-East because that is where existing application code lives. The survey quantifies the resulting gap: 60% of practitioners say proximity to the user is critical for inference, yet 46% of inference workloads are still running in a single centralized cloud region. At the maturity extreme, 77% of companies running core business workloads identify proximity as critical, while fewer than 14% of those workloads are actually deployed in a distributed architecture.

“When it’s your core business workload, 77% of companies said that proximity is critical. And I think that really speaks to when you start figuring out the business impact of the workload that you’re building, that’s where people are starting to figure out, unfortunately, too late in the process.” — Ari Weil, VP of Product Marketing, Akamai

Q: How does GPU saturation cause non-linear latency degradation in centralized inference deployments?

Weil describes a compounding failure mode: centralized inference architectures assume latency remains roughly constant as load increases, but this assumption breaks down under GPU saturation. As utilization climbs, queuing delays grow exponentially, and batching strategies that performed well at low load begin adding tens or hundreds of milliseconds to time to first token. Geographic routing delays are then layered on top of queuing delays, creating a dual constraint that reveals the limits of a centralized architecture under real production load.

“Your queuing delays start to get exponentially larger and batching decisions that used to work at low loads start adding tens or hundreds of milliseconds to that time to first token. Your load is going to reveal the sort of architecture that you chose.” — Ari Weil, VP of Product Marketing, Akamai

Q: What is the Akamai AI inference orchestrator and what five routing axes does it optimize across?

Weil describes Akamai’s inference orchestrator as a load-balancer-like system built specifically for the multi-dimensional requirements of distributed inference workloads. The orchestrator routes requests across five axes: time to first token, minimizing latency to the fastest available GPU resource; cost per token, routing to the most economical available capacity for the workload; model affinity, pinning workloads to the right model and switching when a closer or better-fit model would serve better; available capacity and power headroom; and pure geographic proximity or routing around infrastructure problems. The orchestrator was demoed at GTC San Jose and is planned for demonstration again at an event in Berlin in October.

“Those load balancing characteristics are all going into the Orchestrator and we think that software and the design we are building now will help to really take advantage of all of the different carrier networks out there to start scaling AI.” — Ari Weil, VP of Product Marketing, Akamai

Q: What is the Nvidia AI grid and how does Akamai fit into it?

Weil explains that the AI grid is a concept Akamai and Nvidia developed together, premised on the idea that realizing the full potential of the third scaling wave of AI requires distributed compute backed by telco-scale global network distribution. Akamai’s contribution is its 28-year-old geographically distributed network of smaller air-cooled data centers, which are well suited to inference workloads that do not require the same power, cooling, or cabinet density as training clusters. The partnership involves Nvidia’s Bluefield DPUs for fast switching, the RTX Pro 6000 GPU line, and forward planning around the Vera Rubin architecture. Weil names T-Mobile as another telco partner in the initiative and calls for additional carriers to join.

“If we can start routing the workloads and have this version of an orchestration capability available across multiple telco networks, then we have the ability to really create that grid at a global scale.” — Ari Weil, VP of Product Marketing, Akamai

Q: How does Nvidia’s NeMo and open-source AI stack fit into the distributed inference architecture?

Weil references Nvidia’s NeMo as an attempt to harness the momentum around open-model frameworks while adding enterprise-grade security guardrails and governance capabilities. The broader Nvidia hardware and software stack creates a layered architecture where some activities execute in software and others execute directly on hardware, with the Bluefield DPU enabling fast switching that complements software-layer orchestration. This layered approach is central to how Akamai and Nvidia are thinking about efficient inference routing at scale.

“There’s really this incredible ability to do certain amounts of your activities in software and certain amounts in the hardware, and in some cases to have that synergy between the really quick switching times you can achieve on the Bluefield architecture.” — Ari Weil, VP of Product Marketing, Akamai

Q: Where does the centralized cloud model break down for inference heavy workloads and how does AI sovereignty factor in?

Weil argues that the centralized cloud was architected for an era when compute was a destination you traveled to, driven by the concentration of GPU cards, power, and storage. The agentic era inverts this: compute needs to come to the user. He notes that while sovereignty and privacy concerns in EMEA are a logical assumption for driving distributed deployments, the survey data shows the majority of businesses demanding distributed inference are actually located in North America and Asia-Pacific, specifically the United States, China, India, and Japan, driven primarily by scale and latency requirements rather than regulatory mandates.

“In the agentic era, we need something different. We need companies that are going to bring the compute closer to you.” — Ari Weil, VP of Product Marketing, Akamai

Q: How will training and inference workloads separate across different cloud infrastructure types by 2030?

Weil describes a bifurcating cloud ecosystem where hyperscalers and providers like Oracle OCI continue building centralized AI factories with heavy GPU hardware such as H100s, H200s, and L40s optimized for batch training workloads, while broadly distributed networks like Akamai’s build out globally distributed inference infrastructure suited to air-cooled, smaller data center footprints. He frames the emerging challenge as a capacity planning and vendor roadmap alignment problem: businesses need to match their inference scaling requirements to providers whose geographic build-out roadmaps align with where their users are. By 2030, Weil expects this dynamic to have fundamentally reshaped the cloud ecosystem, with specialized distributed inference providers emerging from the current field.

“It’s going to be what is the specialization you need, who’s got the network and the ability to deploy that specialization based on where you need it, and how can you partner with them early enough to ensure their delivery roadmap aligns to yours.” — Ari Weil, VP of Product Marketing, Akamai

Q: How should enterprises approach governance when AI inference runs across distributed environments?

Weil frames governance as inherently situational, varying by industry maturity, regulatory environment, and risk tolerance. Heavily regulated industries including banking, insurance, trading, government, and education face specific compliance burdens around audit trails, logging, and non-deterministic outputs that make autonomous AI pipelines particularly challenging. He identifies two distinct governance dimensions: compliance guardrails tied to regulatory attestation requirements, and risk management guardrails governing how much autonomy AI is allowed in a concept-to-production pipeline. Weil references NIST frameworks and MITRE ATT&CK mapping as starting points and notes Akamai’s global services team works with customers on architecture, testing scenarios, scaling, and penetration testing as they move to production.

“I don’t believe that we have well-established frameworks for this, but there have been frameworks like using NIST, like mapping things to MITRE ATT&CK, to really understand where you feel like you need a given set of checks and balances.” — Ari Weil, VP of Product Marketing, Akamai

Q: Which industries are early adopters of distributed AI inference and why?

Weil identifies media and technology as the clearest early adopters, driven by the need to process high-fidelity video streams at 4K and 8K, perform transcoding, anomaly detection, speech-to-text and text-to-speech across multiple languages, and distribute derivative content globally with minimal latency. Gaming is a second sector, where techniques developed for neural-network-driven immersive environments and early metaverse work are being repurposed for physical AI and robotics use cases in healthcare, manufacturing, and defense. Retail and commerce round out the early adopter set, where agentic web experiences and personalized recommendation engines are driving rapid adoption of edge computing and serverless-style inference patterns.

“People that have cut their teeth on creating early versions of the metaverse and immersive gaming are now starting to think about how they can optimize assembly lines and healthcare, and thinking about even modern warfare and the way that AR and VR are being brought into the real world.” — Ari Weil, VP of Product Marketing, Akamai

Resources & Documentation

  • Akamai, distributed cloud platform for AI inference, security, and content delivery
  • Nvidia Bluefield DPUs, data processing units enabling fast hardware-layer switching for AI workloads
  • Nvidia RTX Pro 6000, GPU line referenced for distributed inference deployments
  • Nvidia Vera Rubin architecture, next-generation GPU architecture referenced in AI grid roadmap planning
  • Nvidia NeMo, framework for building enterprise AI applications with security guardrails and governance capabilities
  • NIST AI frameworks, referenced for AI governance and risk management mapping
  • MITRE ATT&CK, adversarial tactics framework referenced for AI security governance and checks and balances

***

👇 Click to Read Full Raw Transcript

Swapnil Bhartiya: Akamai just released their 2026 State of AI Inference Survey that reveals a looming challenge. Modern enterprise infrastructure is lagging behind the demand for AI inference. As AI evolve beyond pilots and chatbots and we are seeing it in market already, inference becomes a critical infrastructure hurdle. Centralized cloud architectures are modern restraint. They’re pushing the enterprise airline escape towards a hybrid approach. So how can businesses bridge this gap? To talk about that, we have with us today Ari Weil, VP of Product Marketing at Akamai. Ari, it’s great to have you back on the show.

Ari Weil: It’s great to be back on the show. It’s good to see you, Swadna.

Swapnil Bhartiya: Yeah, it’s my pleasure. Before we deep dive into some of the challenges that enterprise face, let’s talk about this survey. Talk a bit about the survey and some of the major findings.

Ari Weil: The industry is coming to grips with AI and we think about what you need to build, how you need to think about scaling it, securing it, and ultimately how enterprises are going to engage a company like Akamai to ask us how we can help fit into their overall architecture. We wanted to get a better feeling from technologists across the globe and across industries how they were really thinking about the readiness that they felt themselves, how their teams and their organizations felt about AI, what sort of challenges they were going to be facing. And basically, you know, as we think about the adoption of this critical capability across a number of different sectors, what were the things that basically kept them up at night? What were they the most excited about? And ultimately where could distribution help them? And we got some really, really interesting answers.

Swapnil Bhartiya: Can we talk a bit about. Of course we can go with some of your findings, but I have one query before we go into your. One is that when we look at, of course these days, most of the time pilots fail, but when AI moves from push pilots and chatbots into real production applications, what changes there where enterprises have to start thinking about inferencing architect differently than just doing some pilot projects?

Ari Weil: Well, I think the biggest thing that we deal with a lot is as the industry is thinking about moving away from just purely leveraging AI factories. Because a lot of the early practitioners, a lot of the early scaling of AI and something that Jensen Huang talks about even at his GTC conferences for Nvidia is that the centralization of compute so that you could get access to the cards that you needed. You had access to the power and the space and the ultimate ability that you needed to scale to the token consumption that a lot of the frontier LLMs have been leading the industry with people just assumed that as they started to move to production, it was going to be like training, but faster. And that’s actually not true. It’s not true that if you can just train your model, then you’ll be able to serve it to people at scale. And there are a number of different reasons that we’ve discovered this. One reason is the training of a large language model and having, you know, hundreds of millions of weights, for example, that are inside of this model is that many times people don’t need all of that wealth of knowledge to be readily available inside of the model. In some cases, you’d be much better off having fewer weights, smaller models. Something that you can ask a domain specific question to and not worry so much about the sorts of things like how long it would take the courier to return or if there would be hallucinations because too much information was inside of the model that you’re accessing. But that’s not really the main problem. The main problem when we think about going from training to actually leveraging something for inference is that you’re solving for a completely different infrastructure problem. And when you treat them the same way, your AI program is likely to stall or fail. And the reason is training is a concentrated activity. It’s predictable. You have a certain latency tolerance that you typically will build into a project. When you, you are training a centralized model, you create the dense CPU cluster you need, you let the queries run, you let the training algorithms run, and you optimize them for throughput for a certain amount of time or to reach a certain capability that you’re trying to train for. But in the case of inference, it’s exactly the opposite. You’re training for a distributed, bursty and really sensitive sort of a workload that is completely dependent on the user asking a question and expecting a response. So every response today is either a real person or a real agent that’s waiting for an answer. And every time that you add latency to that equation, you can see that, you know, businesses are experiencing the sort of challenges that they experienced in client server interactions because the way that you’re interacting with an inference based chat experience, in most cases it is still a chat medium is expecting that instantaneous result and you just can’t get it when you plan for that centralized model.

Swapnil Bhartiya: So in other ways latency becomes kind of defining. Now you also mentioned weights and that is tr, like for example, this is totally unrelated, but if you look at application, you really don’t want a heavy monolith application to do a small task, right? It consumes a lot of system resources. So I think that same idea even I personally, even for a lot of things, automated things, you just need a smallest model. Actually it will be much more efficient, less hallucination than, you know, 120B. I mean, I run a lot of things locally also. They’re powerful. But you don’t need to strap a V12 engine to a shopping cart, right? You just need a tiny thing like that. So I think the more people, sometimes people realize that bigger models, more powerful models, but that is wrong. Now let’s talk about latency and earlier when I was asking about some of the major findings of the survey, let’s talk about latency first, because this is also something which in the ballpark of Akamai, what you focused on with your footprint and presence. Why is latency becoming such a defining issue for enterprises when it comes to AI inference? And how can a kamai address that challenge?

Ari Weil: I think answering that question really comes down to asking yourself what use case and what set of users are you expecting to serve with your AI application? And I think once you can answer that you know distinctly and uniquely well, then you really have to start treating this challenge like any other sort of an infrastructure and architectural challenge. So, for example, if when you build your proof of concept, you are attempting to serve an answer to people in a given region, then you should probably be thinking, if that is going to be my target demographic, then I need to make sure that I have infrastructure and failover capabilities to uniquely serve that region. However, what we find is still because of the pervasiveness of deployments and the way that people tend to think about a couple of centralized locations to deploy their models on hyperscalers in the US and even in places around the globe. Amazon US east is a very heavily used data center. It’s a default for many people to think about. It’s where they deploy a lot of their code. The problem is, before you start thinking about privacy and sovereignty and how your data needs to be routed and even latency concerns, you have to ask yourself, are the people that I’m serving going to be served successfully and effectively from US east or Virginia in the United States? And if the question is no, you need to stop right there and then think about how am I going to be architecting for the constituency that I’m trying to reach? Beyond that, though, I do think that people wait too long to think about scaling events, or whether it’s scaling up infrastructure capacity for additional CPU or Memory headroom, for example, or GPU headroom. Or if it’s scaling out when you need to get closer to users, people have a tendency to do this in the wrong order. They think they’ll build the proof of concept, engage their target audience, get to a certain number of critical massive users and then start to scale out. And the problem is you might never hit that scaling opportunity if you can’t provide the sort of experience that people expect. And so I think the thing is you’re not really thinking about latency in the right way. It’s not a tuning problem to make something that might be too slow go faster. It might be a fundamental characteristic of the architecture you’re building, in which case you have to think you can’t architect your way out of physics. If I have to get a round trip time from a location to a destination, you can test that without your full AI proof of concept. Then you add in how long does it take the question that I’m answering, the tokens that I have to stream back to get a response for the query that I’m building and add that to whatever your round trip latency time is. And in many cases that is taking some of the lessons that we’ve learned from building the last 15 to 20 years of the Internet and paying those forward to think about again client server architectures and networking times. Because from the latency perspective, it really is milliseconds that matter and the speed of light is going to get in the way of those milliseconds.

Swapnil Bhartiya: So very true. It’s actually a kind of when it comes to AI, the whole we can talk use the term observability here if possible. I think it may open new set of tools also to measure that. So what is happening? What kind of tools, what kind of resources are available not only to kind of measure latency, but other factors. Because sometimes what happens is that I was talking to somebody, I think Edu company and they are coming with a new whole new modern networking switch. They have a lot of, you know, investment there because they’re like the thing is you have all the GPUs and a kamai server sitting there. You know, you have powerful machine that you’re. But networking, it’s not designed for this kind of, you know, workload. So we need a new kind of switches new kind of thing. So talk a bit of what kind of resources are available and where do you also see as AI workloads will be more demanding when it comes to latency and other factors, how you’ll see this sector evolving itself.

Ari Weil: We have been working closely with Nvidia to help push forward this concept of an AI grid, something that, you know, Nvidia believes very strongly. Something that we talked about together at the GTC San Jose conference earlier this year was this notion that to really achieve the full benefits and to realize the potential of this third scaling wave of AI, AI was going to need to be distributed and we were going to need the collective intelligence and network distributions of the telco providers globally to help to provide that scaling. And what we’ve been working on with them is how to take things like their Bluefield DPUs, how to take their RTX Pro 6000 line of GPUs, how we’re thinking forward to the future of what they’re enabling with their Vera Rubin architecture. And how do you distribute across a hybrid cloud environment the sorts of orchestration resources that are required one to just make good with all of the existing network capacity that exists out there to get from a core location out to a gpu, wherever it might be deployed. You don’t need specialized Nvidia hardware to do that when you just need raw compute. And sometimes that’s all companies are looking for today, is just that raw access to compute power. The other thing that we think about is within the Nvidia ecosystem, they’ve done a great job of building out both open source and open capabilities in their architecture. If we think about like as an example, Nemo Claw is something that they’re trying to do to seize that wave or harness that wave of enthusiasm and sort of excitement around openclaw, but to add things like security guardrails and other governance capabilities to make it safer to scale that capability, especially for enterprise use cases. But then if you think about what they’re building into their hardware and software stack, there’s really this incredible ability to do certain amounts of your activities in software and certain amounts in the hardware, and in some cases to have that synergy between the really, really quick switching times that you can achieve by doing things, for example, on the Bluefield architecture and the DPUs, to what we’re able to do from an orchestration capability in the software perspective. And so where Aquamai is excited and where we’re bringing a capability now to market is with this notion of an orchestrator for the AI grid. And that orchestrator is designed to basically do load balancer like things, but across the spectrum of what you might need to run an inference workload. So if you think about that, it could be how do I get you to the fastest time to first token for the workload that you’re running. If your business objective is that you need to start streaming tokens as quickly as possible, then part of the Orchestrator code is to route you to available GPU resources on the network that you’re accessing as quickly as possible to maintain time to first token at the lowest possible latency possible. I said possible too many times. The other things that we think about are how much do you have for your overall workload? So what is the cost of the overall token workload that you are going to be streaming and can we route you based on again the business logic for the query to an area where you will receive ultimately the best value or cost per token for what you’re attempting to run? In some cases you might be willing to trade off latency and cost per token in your business equation and so we can route you to that. Another one is model affinity. We do talked a little bit about not necessarily always needing the largest model. You don’t always need quote unquote the smartest model. You need the model that’s right for your workload. And Akamai is building out a capability that we actually demoed at the prior GTC and we’ll be demoing again in Berlin in October to show how we can route you to the best model and then keep you pinned to that model when the use case calls for it, but then switch you if either going to another model or going to a closer model might serve the workload better. And then from there we also just think about pure proximity or the overall scaling capability or the scaling limit that you need to reach for your workload. So if you think across those five axes, is it the time to first token? Is it the cost for the token? Is it the amount of available capacity and power that I need to route to? Do I need to get you and keep you on the right model or am I ultimately just trying to optimize this for routing around problems? Those load balancing characteristics are all going into the Orchestrator and we think that that software and the design that we are building now is something that will help to really take advantage of all of the different carrier networks out there to start scaling AI. And we’re hoping that more telco partners like ourselves and T Mobile will join the AI grid sort of movement with Nvidia and start making this available across more multi cloud and hybrid architectures. Because at the end of the day, if we’re deploying our architectures in containerized environments, those containerized Environments hold the promise of multi cloud scale. And if we can start routing the workloads and have this version of an orchestration capability that I outlined at a very high level, available across multiple telco networks, then we have the ability to really create that grid at a global scale.

Swapnil Bhartiya: The interesting thing is that with all this AI, whether we are looking at inferencing or I mean that’s what we focus on. We are not training models to companies. Cloud is going to player, not going to play. It is actually the foundation of it. Can you talk about where does the traditional centralized cloud model start to break down for inference heavy workloads? Where we also need to look at decentralized approach. Also we hear a lot about new clouds these days, especially in Europe because of also we can also throw the whole AI sovereignty, digital sovereignty that is going on in Europe a lot, but it will also become global phenomena.

Ari Weil: Yeah, well, I think look, going back to the survey results that we saw, 60% of the practitioners that we surveyed said that proximity to the user is critical. But we also saw that in 46% of cases, their inference workloads are still running in a single centralized cloud region. And I think that gap is where we are starting to see people really learning some tough lessons about how to architect across neo clouds, hyperscalers, alternative clouds, you know, in the rest of the folks who are competing in the ecosystem. And I think this is also an area where a lot of the tier one analysts are trying to, to really catch up and adjust their taxonomies. Because if we look at data by maturity stage, then people in their early experimentation, only about 30% really identify proximity as critical. But when it’s your core business workload, 77% of companies said that that proximity is critical. And I think that really speaks to when you start figuring out the business impact of the workload that you’re building and where you actually rely on that sort of round trip time. That’s where people are starting to figure out, unfortunately, too late in the process. I mentioned before, you know, when you do your first proof of concept or proof of value or whatever you consider it, that’s when you need to start building in the infrastructure and the architectural relevance of where you are serving that pocket. Because if we look at the people who are running single region deployments, And I said 77% of those workloads, they realized that proximity was important. Under 14% of those workloads are deployed in a centralized region or a single region. And I think that just shows you when you get down to over 85% of workloads being distributed. When they realize that latency matters and proximity is an important part of that latency equation, then you have a much clearer picture on trajectory. And then you start asking yourself, where are those businesses located? If you thought that they were located in areas where, for example, sovereignty and privacy are top bill items like across emea, that would be a logical conclusion to draw, but it’s not the case. The case is that the majority of those businesses are located in North America, specifically the United States, or they’re in places in Asia Pacific, primarily in places like China and India and Japan, where we see a lot more scale and more aggressive scaling of these inference workload. And so I think for us, what we’re realizing that the centralized cloud was built for an era when compute was something that you went to, you had to go where people had cards, horsepower storage, et cetera. In the agentic era, we need something different. We need companies that are going to bring the compute closer to you. And more and more, not just users, but also people who are architecting their next applications are going to start being a lot more selective about picking the right card and the right infrastructure for their workloads and putting that in the right place for the users that they have to deploy to. That’s a completely different set of calculus than you used to apply in a centralized cloud region. Because now it’s not so much about just reserved instance capacity and committed revenue. Now it’s really thinking about where and when do I have to scale? Which is going to move us, I think to a lot more of a just in time sort of deployment deployment architecture than we’ve had in the last 20 years.

Swapnil Bhartiya: Do you feel that, you know, as of course, AI goes more and more into production, it is already in the production. Do you see that AI infrastructure will become more hybrid with training, will be centralized, but inference will be more distributed. As you also mentioned that they do want it to be closer to user, but it’s still centralized, far from them. So how do you see training versus inference where they’ll run?

Ari Weil: Well, I think the interesting thing is where we look at tokenomics and some of the companies that are really starting to push that idea forward and to give us a lot more insight into how a workload needs to align to a certain spec of machine, to a certain amount of, for example, RAM that’s available to GPU and CPU processing. The thing that’s happening is we need to better understand that scaling isn’t always linear. So centralized inference architectures always are going to assume that latency will roughly stay the same as the load increases and that doesn’t actually exist. That doesn’t occur as the GPU utilization is going to climb and you start thinking about saturating the available GPUs in a given location, then your queuing delays start to get exponentially larger and batching decisions that used to work at low loads start adding tens or hundreds of milliseconds to that time to first token. Because your response time is just going to start shooting off the more that you have in queue and the more that you’re going to be sensitive to either adding additional GPUs to the footprint that you have, or having GPUs that might not be close enough to one another because you didn’t plan for the appropriate size of GPU cluster and you’re now routing away from where some of your requests are coming. So you start thinking about, I’ve got a scaling dimension for how much I can scale up my GPUs and I have a challenge around geographic dimensions. So you think about round trip times now being exacerbated by queuing times where I have to get access to resources in my centralized deployments, and you start realizing that your load is going to reveal the sort of an architecture that you chose. And so I think if we put all of this stuff together, the really interesting challenge is going to be how different cloud providers. Whether you’re a NEO cloud that’s primarily focusing on a lot of the tensor core architectures, the GPUs that you need to drive, sort of AI workload. But classically it would have been more of the training and less of the inference workloads. And then we see everybody diversifying and starting to add more of the cards that you need for inference. And then we start seeing these sort of hybrid or alternative types of of clouds pop up. And you can’t use either word because they already mean something in the cloud space. But you have people that don’t have as many locations. They might not have as much hardware, but they’re very specialized for a given domain. Now you’ve got this really challenging architectural problem of how do I start planning my capacity based on available providers and help my providers understand how they should be building out their next infrastructure buys. Because if I’m a company like aka of money, I’ve spent the last 28 years building out a very broadly geographically distributed network of relatively small data centers that I can now use primarily for inference workloads because they lend themselves to it. They’re an air cooled architecture. I don’t need the same size cabinets, I don’t need as much power, I don’t need as much cooling. And so I will be able to build out a lot of globally distributed inference architecture if I’m a hyperscaler and somebody like an OCI for example, or Oracle, they are now investing and really doubling and tripling down on a centralized infrastructure where they want to build out AI factories with very heavyweight GPUs like H1 hundreds, H2 hundreds, L40s and things of that nature that are specifically tuned for batch type of workloads. Well, how do I start going from my centralized batch to my distributed inference when I’m a business that’s trying to scale my AI? And the answer is you start to really diversify your vendor pipeline and think about who’s going to be providing you with access to compute and GPU and maybe in the future TPU architectures that you’re going to need for your workload, where those workloads are being deployed. And then you start to really think about your roadmap and your partners based on where they’re going versus where they are, because you need the ability to steer them in the direction that is going to be consistent with the workload and the architecture that you need. And at this point there are some very large and very strategic companies that are steering this path for a lot of us out in the marketplace because they have the capital, the forethought and the maturity to already be steering people in a given direction. But I think that is going to be the new arms race. It’s not going to be centralized data centers, it’s not going to be raw power or cooling resources that we’ve been seeing for, call it the last three years, very intensively now. It’s going to be what is the specialization you need, who’s got the network and the ability to deploy that specialization based on where you need it? And how can you partner with them early enough and be meaningful enough in their roadmaps to ensure that their delivery roadmap aligns to yours? That is a really interesting challenge that we’re currently facing in 2026, going into 2027 and beyond. But I think by the time we get to 2030, it will have completely erased, reshaped what the cloud ecosystem looks like with some new specialized providers really emerging out of the current set because they will have evolved to treat these specialized needs of distributed low latency inference workloads.

Swapnil Bhartiya: Let’s now also talk about governance how should enterprises think about governance when inference is running across more distributed environments?

Ari Weil: So the interesting thing about governance is that it’s coming together iteratively, like so much of the architecture, so much of the workloads are as well. It’s easy to say that you should have a human in the loop. It’s easy to say that you should buy things like API gateways and AI gateways and start thinking about testing harness evolutions and how you should be evaluating Claude code and having people review that before it goes into production environments. But I think the reality is that across the industry, the governance question is really going to be very situational and specific to the business, to their maturity and to the environment environment that they exist in or are selling into. So, for example, if I’m part of a traditionally heavily regulated industry, like the education industry, government financial services, like banking, insurance or trading, there are a given set of rules that I have to abide by. There’s a certain burden that I have of what I can log and how I provide audit trails and recoverability and things of that nature to all of my data. And the challenge that I’m facing right now in the AI era is when something is non deterministic, how much scope do I allow it to have inside of that very scoped and audited workflow that I am responsible for attesting compliance to? And I think that’s one set of guardrails, compliance guardrails. Another one is just purely from a risk management basis. How much am I going to allow people to write code and run code from conception into production with, with whatever level of human guardrails I have or without? I mean, some companies are famously going on podcasts right now talking about how everything is cloud coded, and they’re less and less reviewing it with human reviews. And they’re allowing AI to build its own test harnesses to evaluate the code that it’s written. There’s many businesses that think about that and it’s a horrifying scenario. And other businesses that look at that and say that’s just the cost of doing business today because of the need to ship and ship so frequently. I think when it comes to governance, the same sort of rules apply to how I think about quantifying risk for my business, how I measure that risk for my business, and how I start to think about introduction of new new technologies, especially when they’re autonomous. I don’t believe that we have well established frameworks for this, but there have been frameworks, like using nist, like mapping things to mitre, ATT and CK to really understand where you feel like you need a given set of checks and balances. Akamai works with a lot of our customers, for example, through our global services team on this architecture, on testing scenarios, on how they can think about scaling and penetration testing and other things like that that they have to evaluate as they move to production and evolve their applications. But when it comes to governance, I don’t think that there’s a broad brush approach to what you need to do other than most people saying we are not ready for a fully autonomous and self contained AI sort of concept to production deployment right now without human interaction, the amount of human interaction, what the humans are actually doing, what you have people auditing, reviewing and reporting on, I think is very specific to the business. And regulators, as we’ve seen, are just now starting to kind of get their ideas together and they’re leaving a lot of this to the individual companies to define.

Swapnil Bhartiya: Are there any specific industries where you see this shift towards decentralized inference is getting moment or makes more sense or they are the early adopters of this idea?

Ari Weil: We definitely see some of our typical early adopters from a, you know, a media and technology perspective adopting things quicker. And I think that that’s part of the overall industry and their approach to technology. Throughout the years, you know, media has famously, as they’ve kept pace with all sorts of digital disruptions, had to continuously reinvent themselves. And we’re seeing this now from the way that video needs to be captured in very high fidelity, you know, 4K, even 8K streams processed very quickly, turned into a derivative work and then streamed back out to people at a global scale without incurring too much latency. Being a huge driver of how you can use different types of compute infrastructure to actually do the transcoding, to do the anomaly detection, to, you know, put together the social clips, the streaming clips, the, the downloads that people all expect to be able to take speech to text and text to speech and, and really have that working across, across a number of different languages and mediums very, very quickly driving a fair amount of innovation. We’re seeing the same sort of thing around gaming and we’re even seeing a quick evolution of things that were originally conceived of through neural networks. And using physical AI and robotics to support gaming use cases now evolve into other sorts of robotics and physical AI use cases. So people that have cut their teeth on creating early versions of the metaverse and immersive gaming are now starting to think about how they can optimize assembly lines and healthcare and thinking about even, you know, modern warfare and the way that AR and VR are currently being brought into the real world, that’s a different technology or a different purpose than what they were originally conceived of. And then we’re seeing from just a rote sort of, you know, recommendations engine and being able to mine data and come out with tailored recommendations, new sorts of form factors of the work or the work product being produced. A lot of effort is being put in from the retail and commerce segment where the web is being redefined to be more agentic, to now really needing to take that technology that we’ve done through edge computing and serverless functions and start to apply that more to inference use cases. And we’re seeing those meet production very, very quickly and then scale up as people are realizing that, that the end consumer really does. You know, they’ve always sort of looked for this human like, you know, personalization recommendation and the early versions of it are really starting to show people that there’s a big there, there and a big competitive opportunity to be seized.

Swapnil Bhartiya: Ari, thank you so much for sharing your insights. And of course, folks who are watching, please go and check out Akamai’s work in the space. And Arye, I look forward to chat with you again. Thank you.

Ari Weil: Likewise. Thanks so much, love. See you everybody.

Why AI Agents Fail in Production Without Trusted Telemetry | Shahar Azulay, groundcover | TFiR

Previous article

How to Build a Modular Cloud-Native Platform Without Locking Out Your Users | Corey McGalliard, Akamai Cloud | TFiR

Next article