AI Infrastructure

Why AI Inference Latency Is an Architecture Problem, Not a Tuning Problem | Ari Weil, Akamai | TFiR

AI inference latency is an architecture constraint, not a tuning problem. Ari Weil of Akamai explains why centralized deployments fail global users and how to fix it.

By Monika Chauhan 5 days ago

0

Enterprises building AI inference pipelines on centralized cloud regions are discovering that global users cannot get acceptable response times, and the cause is not model performance. Round-trip network time plus token streaming latency compounds into an experience gap that no amount of model optimization can close. The problem is locked into the architecture before the first user ever connects.

In this interview on TFiR, Ari Weil, VP Product Marketing at Akamai, breaks down why latency is a fundamental architectural constraint in AI inference deployment, how default hyperscaler region choices create silent failure conditions, and what enterprises need to do differently when designing for real-world global users.

Guest: Ari Weil, VP Product Marketing at Akamai
Show: TFiR

Here is what every platform engineer and AI infrastructure architect needs to know.

Technical Deep Dive

Q: Why is latency becoming a defining issue for enterprises deploying AI inference?

Ari Weil, VP Product Marketing at Akamai, argues that latency in AI inference is not a performance tuning problem but a fundamental characteristic of the architecture being built. The answer depends entirely on which users and which regions an application is meant to serve. Once that is defined, the infrastructure must be designed to match that geography from the start, not retrofitted after scale is reached.

“It’s not a tuning problem to make something that might be too slow go faster. It might be a fundamental characteristic of the architecture you’re building.” — Ari Weil, VP Product Marketing, Akamai

Q: Why do default hyperscaler regions like AWS US East fail global AI inference workloads?

Weil notes that US East, specifically the Virginia data center, is the default deployment target for a large share of cloud workloads simply because it is familiar and heavily used. The problem is that this default ignores whether the actual end users are located anywhere near that region. Before privacy, sovereignty, and data routing concerns are even considered, teams need to ask whether users in their target demographic can be served effectively from that single location.

“Amazon US East is a very heavily used data center. It’s a default for many people to think about. The problem is, before you start thinking about privacy and sovereignty, you have to ask yourself, are the people that I’m serving going to be served successfully and effectively from US East or Virginia in the United States?” — Ari Weil, VP Product Marketing, Akamai

Q: What is the compounded latency problem in AI inference and how does token streaming make it worse?

Weil describes the total latency an AI application delivers to a user as the sum of two components: the round-trip network time between the client and the inference endpoint, and the time required to process the query and stream tokens back as a response. Neither component can be optimized away by the other. The round-trip time can be measured independently without any AI model in the loop, which means teams can identify geographic latency problems before committing to a full architecture.

“You can test that without your full AI proof of concept. Then you add in how long does it take the question that I’m answering, the tokens that I have to stream back, and add that to whatever your round trip latency time is.” — Ari Weil, VP Product Marketing, Akamai

Q: Why do enterprises wait too long to think about scaling and geographic distribution in AI deployments?

Weil identifies a common sequencing mistake: teams build a proof of concept, engage early users, wait for a critical mass of adoption, and only then consider scaling out infrastructure or getting closer to users geographically. The flaw in this approach is that poor latency prevents reaching that critical mass in the first place. Scaling decisions around CPU, memory, and GPU headroom, as well as geographic distribution, need to be made before user growth is expected, not in response to it.

“You might never hit that scaling opportunity if you can’t provide the sort of experience that people expect.” — Ari Weil, VP Product Marketing, Akamai

Q: How should enterprises apply CDN and client-server networking lessons to AI inference architecture?

Weil draws a direct line between 15 to 20 years of internet infrastructure lessons and the current challenge of AI inference deployment. The same principles that drove the development of content delivery networks, specifically getting compute and content physically closer to users to reduce round-trip time, apply directly to where inference endpoints should be placed. The speed of light is a hard ceiling on what networking can deliver, and that constraint does not change because the workload is now AI.

“You can’t architect your way out of physics. From the latency perspective, it really is milliseconds that matter, and the speed of light is going to get in the way of those milliseconds.” — Ari Weil, VP Product Marketing, Akamai

Resources & Documentation

Akamai Cloud Computing, distributed cloud infrastructure for compute, GPU, and AI inference workloads at the edge

***

👇 Click to Read Full Raw Transcript

Swapnil Bhartiya: Let’s talk about latency first, because this is also something which in the ballpark of Akamai, what you focused on with your footprint and presence. Why is latency becoming such a defining issue for enterprises when it comes to AI inference? And how can Akamai address that challenge?

Ari Weil: I think answering that question really comes down to asking yourself, what use case and what set of users are you expecting to serve with your AI application? And I think once you can answer that you know distinctly and uniquely well, then you really have to start treating this challenge like any other sort of an infrastructure and architectural challenge. So, for example, if when you build your proof of concept, you are attempting to serve an answer to people in a given region, then you should probably be thinking, if that is going to be my target demographic, then I need to make sure that I have infrastructure and failover capabilities to uniquely serve that region. However, what we find is still because of the pervasiveness of, you know, deployments and the way that people tend to think about a couple of centralized locations to deploy their models on hyperscalers in the US and even in places around the globe, Amazon US east is a very heavily used data center. It’s a default for many people to think about. It’s where they deploy a lot of their code. The problem is, before you start thinking about privacy and sovereignty and how your data needs to be routed, and even latency concerns, you have to ask yourself, are the people that I’m serving going to be served successfully and effectively from US east or Virginia in the United States? And if the question is no, you need to stop right there and then think about how am I going to be architecting for the constituency that I’m trying to reach. Beyond that, though, I do think that people wait too long to think about scaling events. Whether it’s scaling up infrastructure capacity for additional CPU or memory headroom, for example, or GPU headroom, or if it’s scaling out when you need to get closer to users. People have a tendency to do this in the wrong order. They think they’ll build the proof of concept, engage their target audience, get to a certain number of critical mass of users, and then start to scale out. And the problem is, you might never hit that scaling opportunity if you can’t provide the sort of experience that people expect. And so I think the thing is, you’re not really thinking about latency in the right way. It’s not a tuning problem to make something that might be too slow go faster. It might be a fundamental characteristic of the architecture. You’re building, in which case you have to think you can’t architect your way out of physics. If I have to get a round trip time from a location to a destination, you can test that without your full AI proof of concept. Then you add in how long does it take the question that I’m answering, the tokens that I have to stream back to get a response for the query that I’m building, and add that to whatever your round trip latency time is. And in many cases that is taking some of the lessons that we’ve learned from building the last 15 to 20 years of the Internet and paying those forward to think about again client server architectures and networking times. Because from the latency perspective, it really is milliseconds that matter, and the speed of light is going to get in the way of those milliseconds.

You may also like

Why HA Failover Fails: Overlooked Application Dependencies and Untested Runbooks | Matthew Pollard, SIOS Technology | TFiR

By Monika Chauhan18 hours ago

Cloud Native

Why AI Inference Costs and Vendor Lock-In Are Now Your Biggest Infrastructure Risk | Swapnil Bhartiya, TFiR

By Monika Chauhan19 hours ago

AI Infrastructure

Why AI-Generated Code Needs a Cloud Sandbox to Be Trustworthy | Waldemar Hummer, LocalStack | TFiR

By Monika Chauhan21 hours ago

Cloud Native

Why Cloud Spend Now Drives Company Valuation | Peter Maloney, Azul | TFiR

By Monika Chauhan2 days ago

Cloud Native

Why Enterprises Should Stop Building AI Infrastructure Themselves | Richard Borenstein, Mirantis | TFiR

By Monika Chauhan3 days ago

AI Infrastructure

How to Govern AI-Generated Infrastructure Code at Scale | John Henry Archer & Jonah Kowall, Spacelift | TFiR

By Monika Chauhan4 days ago