AI Infrastructure

How to Route AI Inference Across Latency, Cost, and Model Fit Simultaneously | Ari Weil, Akamai | TFiR

0

Inference workloads do not have a single optimization target. Engineering teams are forced to choose between minimizing time to first token, controlling cost per token, selecting the right model, and routing to available GPU capacity — often with no unified layer to manage those tradeoffs in real time. As AI moves out of centralized data centers and into distributed hybrid and multi-cloud environments, that routing problem becomes exponentially harder.

In this interview on TFiR, Ari Weil, VP Product Marketing at Akamai, breaks down how Akamai’s AI grid orchestrator addresses distributed inference routing across five operational axes, and why the company’s partnership with Nvidia and telco carriers is central to scaling that capability globally.

Guest: Ari Weil, VP Product Marketing at Akamai
Show: TFiR

Here is what every platform engineer and AI infrastructure architect needs to know.

Technical Deep Dive

Q: What is the AI grid concept and why does Akamai believe distributed AI is required for the next wave of scaling?

Ari Weil, VP Product Marketing at Akamai, explains that Akamai and Nvidia jointly developed the AI grid concept at the GTC San Jose conference earlier in 2026. The premise is that the third scaling wave of AI cannot be achieved through centralized infrastructure alone. It requires distributed intelligence across the collective carrier networks of global telecom providers to deliver the compute reach and redundancy that AI workloads demand at scale.

“To really achieve the full benefits and to realize the potential of this third scaling wave of AI, AI was going to need to be distributed and we were going to need the collective intelligence and network distributions of the telco providers globally to help to provide that scaling.” — Ari Weil, VP Product Marketing, Akamai

Q: What Nvidia hardware is Akamai integrating into its AI grid architecture?

Weil identifies two primary hardware components from Nvidia being incorporated into the AI grid work: Bluefield DPUs, which enable fast hardware-level switching for orchestration tasks, and the RTX Pro 6000 line of GPUs for compute. Akamai is also working with Nvidia’s forward-looking Vera Rubin architecture, which shapes how the company is planning distributed orchestration for future hybrid cloud deployments. The design intentionally separates activities best handled in software from those best handled at the hardware layer, using Bluefield’s switching speed as an advantage at the edge of that boundary.

“There’s really this incredible ability to do certain amounts of your activities in software and certain amounts in the hardware.” — Ari Weil, VP Product Marketing, Akamai

Q: What is Nvidia Nemo Guardrails and how does it factor into enterprise AI scaling?

Weil describes Nvidia Nemo Guardrails as Nvidia’s response to enterprise demand for safer, governed AI scaling built on top of the open-source model ecosystem. It adds security guardrails and governance capabilities to open model deployments, making it more viable for enterprise use cases where uncontrolled model behavior is a compliance or risk concern. Weil frames it as Nvidia’s effort to capture and structure the momentum around open AI models rather than cede that space to ungoverned deployments.

“Nemo Guardrails is something that they’re trying to do to harness that wave of enthusiasm and excitement around open models, but to add things like security guardrails and other governance capabilities to make it safer to scale that capability, especially for enterprise use cases.” — Ari Weil, VP Product Marketing, Akamai

Q: What five routing axes does the Akamai AI grid orchestrator optimize across?

Weil outlines five distinct axes the Akamai AI grid orchestrator uses to route inference workloads. First is time to first token, routing to the fastest available GPU to begin streaming tokens with minimum latency. Second is cost per token, routing to resources that deliver the best value for the overall token workload. Third is model affinity, routing to and keeping the workload pinned to the model best suited to the use case while allowing switching when a closer or better-fit model becomes available. Fourth is available capacity and compute power. Fifth is proximity and network routing optimization around problems. These axes can be weighted against each other based on the business logic of the query.

“Those load balancing characteristics are all going into the Orchestrator and we think that that software and the design that we are building now is something that will help to really take advantage of all of the different carrier networks out there to start scaling AI.” — Ari Weil, VP Product Marketing, Akamai

Q: How does model affinity routing work and when should teams switch models mid-workload?

Weil makes a deliberate point that teams do not always need the largest or most capable model. The right model is the one suited to the specific workload, not the most powerful one available. Akamai’s orchestrator includes a model affinity capability, first demoed at GTC and scheduled for a further demo at an October Berlin event, that routes workloads to the best-fit model and holds the connection to that model when consistency is required. The orchestrator can then switch the workload to a different model or a geographically closer instance of the same model if that switch improves performance or cost for the active workload.

“You don’t always need the smartest model. You need the model that’s right for your workload.” — Ari Weil, VP Product Marketing, Akamai

Q: Can organizations trade off latency against cost in AI inference routing and how does the orchestrator handle that?

Weil confirms that the orchestrator is explicitly designed to support latency-cost tradeoffs based on business logic at the query level. Not every workload requires minimum latency. Batch processing, background inference tasks, or cost-sensitive applications may be better served by routing to less expensive resources even if that adds latency. The orchestrator applies the routing decision based on the business objective encoded for that workload, meaning teams can define tolerance thresholds and let the orchestrator route accordingly rather than applying a single global policy to all inference traffic.

“In some cases, you might be willing to trade off latency and cost per token in your business equation and so we can route you to that.” — Ari Weil, VP Product Marketing, Akamai

Q: Why do containerized environments matter for achieving global AI grid scale?

Weil points to containerization as the architectural foundation that makes multi-cloud AI grid scale achievable. Because GPU resources and orchestration layers are being deployed in containerized environments, they carry the inherent portability and scheduling flexibility required to operate across multiple cloud providers and carrier networks simultaneously. If the AI grid orchestrator can be deployed consistently across those containerized environments, it can route workloads across any participating telecom network without being locked to a single infrastructure provider or geographic region.

“If we’re deploying our architectures in containerized environments, those containerized environments hold the promise of multi cloud scale.” — Ari Weil, VP Product Marketing, Akamai

Q: Which telecom partners are involved in the AI grid today and what is Akamai’s vision for broader participation?

Weil names T-Mobile alongside Akamai as an existing participant in the AI grid initiative with Nvidia. He frames current participation as the beginning of a broader movement and calls for more telecom providers to join. The strategic argument is that telco carriers collectively hold the network infrastructure required to achieve true global AI distribution. The more carriers that participate and expose their network capacity to the orchestration layer, the more routing options and scaling headroom the AI grid gains across hybrid and multi-cloud architectures.

“We’re hoping that more telco partners like ourselves and T-Mobile will join the AI grid sort of movement with Nvidia and start making this available across more multi cloud and hybrid architectures.” — Ari Weil, VP Product Marketing, Akamai

Resources & Documentation

***

👇 Click to Read Full Raw Transcript

Swapnil Bhartiya: When it comes to AI, the whole we can talk use the term observability here if possible, I think it may open new set of tools also to measure that. So what is happening? What kind of tools, what kind of resources are available not only to kind of measure latency, but other factors. Because sometimes what happens is that I was talking to somebody, I think Irudu company and they are coming with a new whole new modern networking switch. They have a lot of investment there because they’re like the thing is you have have all the GPUs and Kamai server sitting there. You know, you have powerful machines that you’re. But networking, it’s not designed for this kind of, you know, workload. So we need a new kind of switches, new kind of thing. So talk a bit about resources available and where do you also see as AI workloads will be more demanding when it comes to latency and other factors, how you see this sector evolving itself.

Ari Weil: We have been working closely with Nvidia to help push forward this concept of an AI grid, something that Nvidia believes very strongly. Something that we talked about together at the GTC San Jose conference earlier this year was this notion that to really achieve the full benefits and to realize the potential of this third scaling wave of AI, AI was going to need to be distributed and we were going to need the collective intelligence and network distributions of the telco providers globally to help to provide that scaling. And what we’ve been working on with them is how to take things like their Bluefield DPUs, how to take their RTX Pro 6000 line of GPUs, how we’re thinking forward to the future of what they’re enabling with their Vera Rubin architecture and how do you distribute across a hybrid cloud environment the sorts of orchestration resources that are required one to just make good with all of the existing network capacity that exists out there to get from a core location out to a GPU wherever it might be deployed. You don’t need specialized Nvidia hardware to do that when you just need raw cloud compute. And sometimes that’s all companies are looking for today is just that raw access to compute power. The other thing that we think about is within the Nvidia ecosystem they’ve done a great job of building out both open source and open capabilities in their architecture. If we think about like as an example, Nemo Claw is something that they’re trying to do to, to seize that wave or harness that wave of enthusiasm and sort of excitement around openclaw. But to add Things like security guardrails and other governance capabilities to make it safer to scale that capability, especially for enterprise use cases. But then if you think about what they’re building into their hardware and software stack, there’s really this incredible ability to do certain amounts of your activities in software and certain amounts in the hardware. And in some cases to have that synergy between the really, really quick switching times that you can achieve by doing things, for example, on the Bluefield architecture and the DPUs, to what we’re able to do from an orchestration capability in the software perspective. And so where Akamai is excited and where we’re bringing a capability now to market is with this notion of an orchestrator for the AI grid. And that orchestrator is designed to basically do load balancer like things, but across the spectrum of what you might need to run an inference workload. So if you think about that, it could be how do I get you to the fastest time to first token for the workload that you’re running? If your business objective is that you need to start streaming tokens as quickly as possible, then part of the orchestrator code is to route you to available GPU resources on the network that you’re accessing as quickly as possible to maintain time to first token at the lowest possible latency possible. I said possible too many times. The other things that we think about are how much do you have for your overall workload? So what is the cost of the overall token workload that you are going to be streaming? And can we route you based on again the business logic for the query to an area where you will receive ultimately the best value or cost per token for what you’re attempting to run. In some cases, you might be willing to trade off latency and cost per token in your business equation and so we can route you to that. Another one is model affinity. We talked a little bit about not necessarily always needing the largest model. You don’t always need, quote unquote, the smartest model. You need the model that’s right for your workload. And Akamai is building out a capability that we actually demoed at the prior GTC and we’ll be demoing again in Berlin in October to show how we can route you to the best model and then keep you pinned to that model when the use case calls for it, but then switch you if either going to another model or going to a closer model might serve the workload better. And then from there we also just think about pure proximity or the overall scaling capability, or the scaling limit that you need to reach for your workload. So if you think across those five axes, is it the time to first token? Is it the cost for the token? Is it the amount of available capacity and power that I need to route to? Do I need to get to and keep you on the right model? Or am I ultimately just trying to optimize this for routing around problems? Those load balancing characteristics are all going into the Orchestrator and we think that that software and the design that we are building now is something that will help to really take advantage of all of the different carrier networks out there to start scaling AI. And we’re hoping that more telco partners like ourselves and T Mobile will join the AI grid sort of movement with Nvidia and start making this available across more multi cloud and hybrid architectures. Because at the end of the day, if we’re deploying our architectures in containerized environments, those containerized environments hold the promise of multi cloud scale. And if we can start routing the workloads and have this version of an orchestration capability that I outlined at a very high level available across multiple telecom networks, then we have the ability to really create that grid at a global scale.

Why HA Failover Fails: Overlooked Application Dependencies and Untested Runbooks | Matthew Pollard, SIOS Technology | TFiR

Previous article

Why AI Coding Agents Fail in Jupyter Notebooks and How Jupyter AI Fixes It | Lahari Chowtorri, Amazon | TFiR

Next article