- AI chip and infrastructure orders continue to flood in
- But memory bandwidth could be a limiting factoring in putting all that new compute power to efficient use
- Tools can help improve available memory bandwidth, but there's no cure all yet
Orders for AI gear have been pouring in and not just at chip giant Nvidia either. AMD bagged deals with Meta, Microsoft, OpenAI and Oracle. Cisco already has $2 billion worth of AI orders this year. And just this week Broadcom revealed it landed a $10 billion contract for AI racks based on its XPU chips. But the AI boom is about to hit a serious speed bump.
We’ve already written about the power, cooling and networking challenges AI data centers face. But there’s another major problem that could hinder AI growth: memory.
While GPUs get all the love, AI requires more than pure processing power. As we recently pointed out, CPUs will play a vital role in orchestrating workloads and data pipelines for AI applications. But memory bandwidth allows data to flow, and right now there’s not nearly enough of it.
“Memory is very much a limiting factor in AI scale-out and performance,” J. Gold Associates Founder Jack Gold explained. “GPUs are often restricted in performance by the need to connect to external memory over interconnects that slow things down. So, anything that can bring memory closer/faster to the GPU has a big performance improvement.”
How did this happen?
JB Baker, VP of Products at ScaleFlux, told Fierce that while both processor and memory capabilities have grown exponentially in recent years, the latter hasn’t grown at the same multiplier as the former.
The result is a gap between how many calculations a chip can process in one second and how much memory bandwidth is available to send data to the chips to be processed. In other words, memory has become a bottleneck. (There's a nifty little chart you can check out here, and an IEEE paper on the Memory Wall here if you're feeling extra nerdy.)
Baker said it’s a bit like having a sprawling crop field next to a massive lake full of water but only having a tiny garden hose to get the water to the field.
“There’s a lot of potential compute capacity that is going unutilized and at the same time it’s burning power,” he said. “So, it’s not only that I lost out on things that those processors could have done, but they burn power sitting idle.”
Gold noted the issue isn’t unique to GPUs – CPUs have run up against this wall as well. But the issue is coming to a head in the AI era, particularly in light of power constraints and a desire to massively scale AI deployments.
Can it be fixed?
Baker said right now there’s no cure-all solution, but things are moving in the right direction.
He pointed to compute express link (CXL) technology and solid state drive (SSD) advancements as tools that can help scale memory bandwidth. (Of course, both are products ScaleFlux offers, but others are working on the problem as well.
Nvidia is tackling the issue through its NVLink and Storage-Next initiatives, and Gold noted other players – including Intel spin-off Cornelis – are trying to speed things up in this space as well.
In terms of what it all means for telcos and enterprises, Baker argued scaling AI will require a rebalancing of the capex equation to focus on more than just GPUs for on-prem deployments.
“If you don’t put the right amount of dollars into your memory, into your storage and your networking, you’re going to waste the ones that you spend on GPUs,” he concluded.