Today’s data centers require extreme makeovers to meet AI requirements, says Marvell

Today’s data centers will require drastic hardware transformation to meet artificial intelligence (AI) needs, according to Marvell. By extraordinary coincidence, the company also says it sells the components needed to make data center equipment fit for purpose.

AI places unprecedented demands on processors, clustering, switching and storage—all of which Marvell provides, CTO Noam Mizrahi said in an interview.

Solving connectivity and infrastructure needs is as difficult as any other compute problem in the system because of bandwidth, latency and other requirements, Mizrahi said.

The current architecture of the data center hasn’t changed in about 15 years. “No matter how many new use cases and applications we threw at it, it was always flexible enough to adapt,” Mizrahi said.

But AI has completely different requirements, requiring massive GPU hardware with different types of clustering and networking. And data center hardware will need to be refreshed more frequently to accommodate growth and requirements for language models. Large Language Models (LLMs) are growing tenfold yearly, and infrastructure needs to keep up.

Performing AI queries is costly, requiring enormous power to train models over days and weeks. “Training a model like GPT-3 with 175 billion nodes can already consume 1,287 megawatt hours, enough to power around 120 U.S. homes for a year. A 10x increase in model performance—something that will occur—could translate to 10,000x increases in computational and energy needs,” Mizrahi said in an article on the company blog.

AI needs hundreds of thousands of GPUs, requiring the entire data center to be interconnected into a cluster with terabits per second of low-latency connectivity. Optical networking will be needed to meet those demands, using chiplets connected directly to the compute platforms or pluggable modules.

Hardware must be disaggregated, so CPUs, GPUs, DPUs, memory, storage, networking and other components can scale independently.

In many cases, GPUs will give way to customized ASICs that are fine-tuned to AI requirements. “You’ll need to push the boundaries in every aspect of silicon design to make monstrous chips efficient to your needs,” Mizrahi said. These will require larger scale chiplets and connectivity, with advanced packaging to connect all components.

Networking will also need to evolve. The size of AI networks grew by 10x per year over the past five years, and by 2027, one in five Ethernet switch ports in data centers will be dedicated to AI, ML and accelerated computing, Mizrahi said on the company blog.

Storage will need to improve capacity, speed and throughput as well.

Marvell, of course, sees business opportunities in meeting infrastructure needs, providing optical networking, switching, storage and custom ASICs

All that new hardware is costly. Generative AI data center infrastructure and operating costs will exceed $76 billion by 2028, according to Tirias Research. For comparison, that’s more than twice the estimated annual operating cost of Amazon Web Services (AWS), which today holds about a third of the cloud infrastructure services market,

Data centers will benefit from a 4X improvement in hardware compute performance, but a 50X increase in processing workloads will swamp that gain, Tirias Research warned in its report. Similarly, while software will be optimized for efficiency, that optimization will be countered by greater demand.