Podcast: Data Center 2.0—Compute comes knocking

If you’re deploying AI, your decisions around what compute to use shouldn’t be based on which three-letter acronym you remember the best. We’re breaking down  how to think about compute for AI; what CPUs, GPUs, and newer accelerators are best at; where organizations can get tripped up; and what constraints matter more than the chip itself. 

Welcome to The Five Nine miniseries digging into what's changing in the infrastructure behind modern businesses. 

Catch the video at top, listen to the audio edition and read our transcript below, or watch this and future episodes on YouTube

To learn more about the topics in this episode, check out: 

This podcast is written and hosted by Diana Goovaerts. It is edited by Diana Goovaerts and Matt Rickman. Liz Coyne is our executive producer. Special thanks to guests Matt Kimble, Brendan Burke, Shar Narasimhan and Robert Hormuth.


Diana Goovaerts, Fierce Network: If you’re trying to deploy AI right now, you’ve probably heard a lot of three-letter recommendations: CPU, GPU, maybe TPU and lately, even LPU and XPU.

But the real challenge isn’t memorizing acronyms. It’s figuring out what kind of compute matches your workload and how to avoid buying the wrong thing when workloads are changing fast. 

Matt Kimball, Moor Insights and Strategy: There are three considerations to think about when you think about not aligning the right silicon to the right workload. The first is cost.

Part two is performance. If I am, you know, trying to use a CPU where a GPU is required obviously my AI performance is gonna be horrible and there's a lot of downstream effects to that. 

And the third is power. So if I'm over-provisioning my data center with the wrong kind of silicon, I am unnecessarily using power and it prevents me from expanding my footprint effectively. 

Diana Goovaerts: Welcome to The Five Nine miniseries Data Center 2.0—the show where we break down what’s changing in the infrastructure that powers modern business. I’m Diana Goovaerts.

Today we will talk about how to think about compute for AI—what CPUs, GPUs and newer accelerators are best at, where organizations get tripped up, and what constraints matter more than the chip itself.

Let’s start simple. CPUs are the generalists. GPUs bring massive parallelism. And LPUs and other specialized accelerators tend to focus on specific AI tasks.

Brendan Burke, Futurum Group: The reality of the modern data center is that we need multiple types of compute for AI, including CPUs, GPUs, XPUs and a range of other supporting processors key differences for these processors include how they handle data. 

So conventionally CPUs do simple read-write cycles where memory is requested. An instruction is carried out and the answer is passed back to memory. GPUs exploded that by expanding processing across thousands of cores and using dynamic scheduling to pick which of those cores were needed for a particular operation.

Now with XPUs, we've paired that back using systolic arrays that mimic how the circulatory system of the body works passing data fluidly through combinations of memory and arithmetic cores to deterministically lay out how memory should create a token. 

And all of these are necessary for different stages of the AI lifecycle. 

Shar Narasimhan, Nvidia: For the majority of use cases, a combination of GPUs and CPUs meets the needs of most AI factories. Now, LPUs on the other hand, they excel at low latency inference and they leverage a synchronous deterministic software defined architecture. 

Diana Goovaerts: In other words: general purpose to more tailored, often trading flexibility for efficiency in one job.

Matt Kimball: It is fair to say that CPUs, GPUs, LPUs can all be used for AI in different ways, actually, even to some degree in the same way, but just less effectively, right? I mean, there's nothing stopping you from using a farm of CPUs to train a model. It would be a really silly decision. But it can be done. 

Robert Hormuth, AMD: So, if I kind of think about it from an end-to-end, CPU’s most general purpose, you know, can do almost anything. GPUs are – actually compared to an LPU, a GPU is much more programmable and then LPUs get even more, you know optimized more towards an ASIC than the programmability of A GPU.

Shar Narasimhan: CPUs and GPUs are programmable. ASICS or fixed function chips are not really as flexible and can struggle when you have these architectural or use case changes that inevitably seem to happen in this industry. 

Diana Goovaerts: One reason the compute question gets confusing is because “AI” is not one workload. Training looks different from inference. And even inside inference, there are stages – like building context, then generating tokens – where the best hardware can change.

Matt Kimball: What's the risk of not aligning the right silicon to the right workload? If I have an overpowered GPU that is supporting an AI function that could be served by CPUs, I'm obviously paying too much. My cost of operations, my cost of tokens is going to be a lot higher than it needs to be. 

Brendan Burke: So the question is no longer which chip to use. It's which part of the lifecycle are you in, and which processor is best tailored for that step. 

Diana Goovaerts: So what is the practical framework? Well, everyone we talked to came back with a similar idea: Start with the outcome – latency, throughput, utilization, or you know, cost – and then choose the mix of compute that gets you there.

Matt Kimball: CPU comes into play in a lot of traditional AI workloads that have been going on for some time. Right. Um, vision recognition. If I have a manufacturing plant and I'm doing visual inspections, you know, on the assembly line, CPUs typically do a very good job of that call it the discreet AI pattern recognition. 

When you get into generative where you're building a lot more context or agentic where you've got this multi-step reasoning that's going on more and more, that's where your GPU comes into play. And as that generative or agentic reaches scale that's where you would look at an LPU assist as well.

Brendan Burke: When evaluating cost versus performance, organizations need to move past the headline price of GPUs to estimate the total cost of ownership for a cluster for a given application. And this is rapidly changing in terms of how the economics of AI clusters pay off. The upfront GPU prices often hide the fact that latency as well as memory availability often become the determining factor of whether a model performs well in production. 

So it can be the case where paying more upfront for the balance of system around a chip can actually reduce the total cost of ownership in the long run by providing better uptime and higher output for users.

Diana Goovaerts: Cost is where a lot of teams still use old metrics. So, FLOPS per dollar used to be a shorthand for great performance. But for generative AI, “delivered tokens” and “total cost to serve” are usually closer to the truth.

Shar Narasimhan: It used to be that the industry always looked at flops per dollar. Basically, they took the total number of flops or compute flops in a chip, and they divided it by the purchase price when they were factoring out the total cost of ownership, also known as TCO.

But this gives you a very inaccurate picture. What you want to look at is the delivered token cost. Once all costs are factored in, basically the cost per million tokens, once all operating and capital expenditures have been fully factored in. A cheap chip that has anemic output will be very expensive in terms of cost per million tokens. A very advanced platform that has high performance consumes very little energy with minimal installation and setup costs will deliver a lot of tokens for relatively low cost. 

Diana Goovaerts: But here's a tension. The industry is producing more specialized chips for specific AI tasks, and that can improve efficiency, but it can also create lock-in or leave you with hardware that's great at last year's model and a little bit awkward for next year’s.

Robert Hormuth: So to me, the real trick is finding the best-balanced architecture for the business outcomes that you want, that give you the flexibility. Because you don't want to choose something that just narrows you into a focus. And the next thing you know, the new open claw comes out and you realize, ‘Oh, darn. I picked the wrong infrastructure and I can't go do the next big move.’

Diana Goovaerts: There's also the practical reality that not every organization has hyperscale problems. For a lot of enterprise teams, the right answer is often smaller and more intentional than the loudest recommendation. 

Matt Kimball: It's important to remember that there are certainly reasons when where this specialized silicon is super important in necessary to the AI equation.

It's equally important to understand that that might always not always apply to you, right? If you are a commercial enterprise IT leader, you have a couple thousand servers, a few thousand employees, you need to do this agentic AI thing you, you hear the kids talking about, right? If that is you, there's a really good chance you don't need LPUs, that there's a really good chance you don't need the beefiest GPUs out there, and that the number of GPUs is far less than what you might be being told by folks in the industry. 

Think about what your needs are, what you're going to do realistically and if you build the right underlying foundation, you can always build from there. 

Diana Goovaerts: At some point, the chip itself stops being the whole story. The limiting factors become, can you feed it data? Can you cool it? Can you power it, and can you keep the whole system running when demand spikes? 

Robert Hormuth: If you kind of look at the. The big barriers that we have today that are kind of hitting us front and center with the rise of AI, there's a huge supply demand right now that is taxing the industry. There's a huge memory supply issue that is taxing everybody. And then there's power and cooling. I think those are the four big kind of walls right now.

Brendan Burke: So that goes to, you know, three primary areas that hardware engineers have to optimize for. The first is memory. The availability of memory close to where compute is carried out dictates whether a model can utilize prior sessions as part of a current request. 

The second would be interconnect speed and power usage. The limitations of copper are coming into focus as the connections between chips are accelerating and the use of optical interconnects is becoming an increasing necessity to future proof. These chip-based designs for advanced processing, using advanced packaging. 

And the last topic I'd point to is thermal management and power distribution. As these systems get larger with frontier racks drawing 200 kilowatts per rack, rapidly progressing towards megawatt scale, you know, just within a given cluster the usage of high voltage power as well as the thermal dissipation of those systems is becoming a first order constraint. 

Diana Goovaerts: And then there is the part that is easy to forget when you're shopping for hardware. Software can actually change the economics after the purchase by improving utilization speed and operational simplicity. 

Shar Narasimhan: Software optimizations allow you to take your current generation of infrastructure that you've already purchased and then by constantly improving libraries, kernels, making optimizations in terms of how you fetch and rewrite data, you can actually make the same model on the exact same hardware that you have today much faster as you go forward. 

Diana Goovaerts: If there is one reason this conversation stays urgent, it is because workloads are still moving. Agentic systems, continuous learning, rising token demand are all changing the shape of the infrastructure problem/ And that may leave you wondering what's next. 

Shar Narasimhan: Agentic AI is going to have a profound impact on compute and drive much more demand for tokens. We used to see three scaling laws that drove higher and higher demands for compute. As you went through the different progressions and phases of AI, now we see a fourth scaling law that's being added, driven by the rise of the agentic AI.

It requires far more tokens, has multiple agents that can both read and write tokens much faster than humans could by an order of magnitude faster, and that is driving the demand for token generation and consumption far faster than we've seen in the past. It's this rise of agentic AI that's going to put much more pressure on compute.

Robert Hormuth: There's a lot of interesting I would say fodder right now especially around SMT, symmetric multithreading or processing, you know, one core runs as two threads. 

Because when you get into the world of agentic AI especially, you're moving from the LLM world where maybe you have six to 10 software stacks. You move to the world of agentic and you may have hundreds of software stacks, thousands of threads communicating, talking to different agents, and there's a lot of cache misses or stalls in the pipeline. So having more threads to launch other worker threads to get to take advantage of the resources you have deployed is a huge benefit. 

SMT fills those stalls with work that can be used on the resources that you've paid for. If you don't have SMT and you hit a stall, you're just waiting, doing nothing, and you have a core and all those transistors just waiting for memory or waiting for an IO. SMT allows more useful work to progress versus just stalling.

Yeah, I, I'm a big believer in SMT in the world of agentic.

Brendan Burke: The challenge of the industry will be to find an architecture that is suited for agentic inference going forward because the way that the field is progressing is from significant pre-training including generation of model weights through a backward pass and progressing towards continuous learning where feed forward steps are used to carry out instructions, gain verification and then update model weights from a smaller base. 

And so I think we're still progressing towards what that ideal architecture will look like. And I expect the next wave of processor design to set that as the target rather than the peak FLOPs for training. 

Diana Goovaerts: That's our episode. If you want more on the infrastructure stack, so including storage, software and sustainability, make sure you like and subscribe so you can catch the other episodes in this series. But that is all for now, so we will catch you next time.