Akamai warns of an agentic AI 'latency crisis'

Akamai Technologies
Akamai's Jon Alexander says agentic AI's sprawl demands low latency times that can only be achieved with distributed infrastructure. ()
  • Up to 90% of an AI agent's runtime is CPU-side tool calling, not GPU inference
  • Akamai is pitching its 4,000-plus edge sites and AI Grid Orchestrator as the cure for latency issues
  • Skeptics, including AT&T's CTO, doubt agents really need compute pushed to the far edge to shave off a millisecond or two

AI agents need to act at machine speed. The cloud they run on was built for human patience. And that mismatch is creating a "latency crisis" for agentic AI, Akamai argued.

The public cloud is tuned for human use, comfortably absorbing the roughly 100-millisecond delays of intercontinental data transit, wrote Jon Alexander, SVP cloud technology, in a blog post this week. Chatbots fit that world fine: a person types a question, waits a few seconds and reads a streamed answer.

But autonomous agents do not operate on a human clock. Agents need latency measured in hundreds of milliseconds, not full seconds, Alexander told Fierce in an interview. Some 82% of organizations have critical use cases that need end-to-end response times of 500 milliseconds or less, and 64% are already targeting 250 milliseconds or less, according to Akamai's "State of AI Inference 2026" report.

Why the bottleneck moved from the GPU to the CPU

While the GPU is, of course, critical for AI implementation, the CPU also performs a crucial role, Alexander said. Software on the CPU serves as a controller or harness, sending prompts to models running on GPUs, then looping: calling tools, querying APIs, reading files, executing generated code and deciding what to do next. It can go around that loop many times before it finishes a task.

The result is that the GPU often sits idle. CPU-side processing accounts for up to 90.6% of total latency in agentic workloads, Alexander said.

"The majority of the time isn't even on the GPU, it's in that CPU," Alexander told Fierce. "The controller loop has an outsized impact on the overall time to deliver the outcome."

Delays stack up fast. A single workflow with 50 sequential calls "quickly incurs seconds of transport latency," Alexander wrote. That's enough to push a production agent past the half-second budget most enterprises say they need.

Split the brains from the hands

Akamai's prescription is to stop treating AI as a monolith and distribute it. Heavy reasoning stays in a centralized core of big GPUs; localized inference runs on regional GPU clusters; and the CPU-heavy orchestration work — tool calls, code execution, data access — runs on high-performance CPUs at the edge, close to wherever the data lives.

"Data has a lot of gravity. Moving data is expensive and slow, so you want to have execution sit near where the data is," Alexander said. Calling a CRM, a finance system or an internal document store from an edge node nearby, rather than from a distant cluster, is where he sees the latency savings.

Tying it together is what Akamai calls its AI Grid Orchestrator, a semantic router the company demonstrated at Nvidia's GTC conference in March. It decides where each piece of a workload should run based on the request's intent, the model required, latency targets, cost and sovereignty rules. Akamai has deployed Nvidia RTX Pro 6000 GPUs across more than 20 data centers — its "core" — and says it can run agent orchestration across the more than 4,000 edge locations it inherited from its content-delivery business. 

The company plans to expand its GPU footprint from roughly 20 sites toward 100 over the next year, and has signed a $200 million, four-year inference-cloud deal with an unnamed customer. Alexander said Akamai is weighing whether to open-source parts of the orchestrator, though it has not set a date.

Vultr circling

Akamai is not alone touting the importance of CPUs to reduce latency. Vultr CMO Kevin Cochrane made the same point at GTC, telling Fierce that "the CPU becomes more important than ever, because the CPU is what orchestrates and saturates the GPU."

Fierce Network Research found the same pattern across enterprise deployments. In an independent Fierce Network Research report sponsored by Vultr, leaders at five enterprises, including Eli Lilly, Nasdaq and others, said the important infrastructure for agents is the virtual private clouds, gateways and CPUs companies already ran — not just new GPU clusters. Inference increasingly needs to be distributed close to users and data rather than centralized.

For telcos, edge infrastructure offers a second chance to win back value ceded to hyperscalers — though telcos will have to overcome cultural and technical barriers to seize that chance, analyst Sid Nag, president and chief research officer at Tekonyx, said in an interview for another Fierce Network Research report: "AI and the automated network: Designing telco infrastructure for the age of inference."

But at least one operator is openly skeptical about the supreme importance of the far edge. AT&T CTO Igal Elbaz questions whether there is much value in pushing compute "all the way to the far edge just to save another millisecond or two of latency," given how much high-performing compute already sits in metros nationwide. An Omdia survey found only 15% of telcos ranked the network far edge as the top spot for AI inferencing

Just this week, TM Forum CEO Nik Willetts warned telcos against overinvesting in edge opportunity. “Our position right now is definitely don’t bet the farm on it, and we’re skeptical,” he told Fierce at the TM Forum conference in Copenhagen. “Until we see the use cases, we think everything needs to be based on actual demand.”