Enterprises are discovering that the infrastructure decisions that work for training large language models actively sabotage production inference. Centralized compute clusters optimized for throughput introduce latency that breaks real-time user and agent interactions. The problem is not the model. The problem is assuming the two workloads share the same infrastructure requirements.
In this interview on TFiR, Ari Weil, VP Product Marketing at Akamai, breaks down why moving AI from pilots to production requires a fundamentally different inferencing architecture, what a global technologist survey revealed about AI readiness challenges, and where distributed infrastructure fits into the production scaling equation.
Guest: Ari Weil, VP Product Marketing at Akamai
Show: TFiR
Here is what every platform engineer and AI infrastructure practitioner needs to know.
Technical Deep Dive
Q: What was the purpose of Akamai’s global AI survey and what were the key areas it examined?
Ari Weil, VP Product Marketing at Akamai, explains that the survey was designed to understand how technologists across industries and geographies were thinking about AI readiness, team and organizational preparedness, and the challenges they expected to face during adoption. The research focused on what practitioners found most exciting, what concerned them most, and where distributed infrastructure could fit into their broader architecture. The goal was to get a grounded view of how enterprises were genuinely approaching AI as a production capability rather than an experimental one.
“We wanted to get a better feeling from technologists across the globe and across industries how they were really thinking about the readiness that they felt themselves, how their teams and their organizations felt about AI.” — Ari Weil, VP Product Marketing, Akamai
Q: What fundamentally changes when AI moves from pilot projects and chatbots into real production inference?
Weil explains that the shift from pilot to production exposes a critical infrastructure mismatch. Early AI adoption leaned heavily on centralized AI factories because they provided access to the GPU density, power, and space needed to handle the token consumption demands of frontier LLMs during training. The widespread assumption was that serving a trained model would work the same way as training it, only faster. That assumption is incorrect, and acting on it is a primary reason AI programs stall or fail before reaching production scale.
“People just assumed that as they started to move to production, it was going to be like training, but faster. And that’s actually not true.” — Ari Weil, VP Product Marketing, Akamai
Q: How does model size and weight density affect inference performance for domain-specific use cases?
Weil notes that large language models with hundreds of millions of weights contain far more knowledge than most production use cases require. For domain-specific queries, that excess information can increase response latency and raise the risk of hallucination because the model is reaching across a much wider knowledge surface than necessary. A smaller, purpose-built model scoped to a specific domain can return faster, more reliable answers without the overhead of a general-purpose frontier model. Selecting the right model size is an inference optimization decision, not just a cost decision.
“In some cases you’d be much better off having fewer weights, a smaller model, something that you can ask a domain specific question to and not worry so much about how long it would take the query to return or if there would be hallucinations.” — Ari Weil, VP Product Marketing, Akamai
Q: What is the core infrastructure difference between training workloads and inference workloads?
Weil draws a direct contrast between the two workload types. Training is a concentrated, predictable activity with defined latency tolerance. Engineers build a dense GPU cluster, run training algorithms for a set duration or until a capability threshold is reached, and optimize for throughput. Inference is the structural opposite. It is distributed, bursty, and latency-sensitive, entirely dependent on a user or agent issuing a query and expecting a near-immediate response. The infrastructure requirements are not variations on a theme. They are different problems that require different architectural decisions.
“Training is a concentrated activity. It’s predictable. But in the case of inference, it’s exactly the opposite. You’re dealing with a distributed, bursty and really sensitive sort of a workload that is completely dependent on the user asking a question and expecting a response.” — Ari Weil, VP Product Marketing, Akamai
Q: How does inference latency affect real business applications and user experiences in production?
Weil frames the latency problem in terms of what is actually waiting on the other end of an inference request: a real person or a real agent. Every increment of added latency degrades that interaction in measurable ways, mirroring the performance expectations that shaped client-server computing. Because most production AI interactions still occur through a chat medium, users expect near-instantaneous responses. A centralized infrastructure model cannot consistently deliver that. The result is a user experience failure that directly undermines the business case for the AI application.
“Every response today is either a real person or a real agent that’s waiting for an answer. And every time that you add latency to that equation, you can see that businesses are experiencing the sort of challenges that they experienced in client server interactions.” — Ari Weil, VP Product Marketing, Akamai
Resources & Documentation
- Akamai, distributed cloud platform for AI inference, security, and content delivery at scale
***
👇 Click to Read Full Raw Transcript
Swapnil Bhartiya: Let’s talk about this survey. Talk a bit about the survey and some of the major findings.
Ari Weil: The industry is coming to grips with AI and we think about what you need to build, how you need to think about scaling it, securing it, and ultimately how enterprises are going to engage a company like Akamai to ask us how we can help fit into their overall architecture. We wanted to get a better feeling from technologists across the globe and across industries how they were really thinking about the readiness that they felt themselves, how their teams and their organizations felt about AI, what sort of challenges they were going to be facing. And basically, you know, as we think about the adoption of this critical capability across a number of different sectors, what were the things that basically kept them up at night? What were they the most excited about? And ultimately where could distribution help them? And we got some really, really interesting answers.
Swapnil Bhartiya: Can we talk a bit about. Of course we can go with some of your findings, but I have one query before we go into your. One is that when we look at of course these days, most of the time pilots fail, but when AI moves from pilots and chatbots into real production applications, what changes there where enterprises have to start thinking about inferencing architect differently than just doing some pilot projects.
Ari Weil: Well, I think the biggest thing that we deal with a lot is as the industry is thinking about moving away from just purely leveraging AI factories. Because a lot of the early practitioners, a lot of the early scaling of AI and something that Jensen Huang talks about even at his GTC conferences for Nvidia is that the centralization of compute so that you could get access to the cards that you needed, you had access to the power and the space and the ultimate ability that you needed to scale to the token consumption that a lot of the frontier LLMs have been leading the industry with, people just assumed that as they started to move to production, it was going to be like training, but faster. And that’s actually not true. It’s not true that if you can just train your model, then you’ll be able to serve it to people at scale. And there are a number of different reasons that we’ve discovered this. One reason is the training of a large language model and having, you know, hundreds of millions of weights, for example, that are inside of this model is that many times people don’t need all of that wealth of knowledge to be readily available inside of the model. In some cases you’d be much better off having fewer weights, a smaller model, something that you can ask a domain specific question to and not worry so much about the sorts of things like how long it would take the query to return or if there would be hallucinations because too much information was inside of the model that you’re accessing. But that’s not really the main problem. The main problem when we think about going from training to actually leveraging something for inference is that you’re solving for a completely different infrastructure problem. And when you treat them the same way, your AI program is likely to stall or fail. And the reason is training is a concentrated activity. It’s predictable. You have a certain latency tolerance that you typically will build into a project. When you are training a centralized model, you create the dense CPU cluster you need, you let the queries run, you let the training algorithms run, and you optimize them for throughput for a certain amount of time or to reach a certain capability that you’re trying to train for. But in the case of inference, it’s exactly the opposite. You’re training for a distributed, bursty and really sensitive sort of a workload that is completely dependent on the user asking a question and expecting a response. So every response today is either a real person or a real agent that’s waiting for an answer. And every time that you add latency to that equation, you can see that businesses are experiencing the sort of challenges that they experienced in client server interactions. Because the way that you’re interacting with inference based chat experience, in most cases it is still a chat medium is expecting that instantaneous result and you just can’t get it when you plan for that centralized model.





