Software tweaks are Nvidia’s secret AI sauce

  • Nvidia used software tweaks to boost performance of its H100 GPUs
  • It notably achieved linear scaling of performance when training a GPT-3 175B model with 11,616 H100 GPUs
  • Execs said 100,000+ GPU-scale deployments are just around the corner

It feels like a long time has passed since Nvidia unveiled its H100 chip – nicknamed Hopper – in March 2022. Since then, the company has launched its H200 Blackwell chip and announced plans for two more high-power GPUs for artificial intelligence (AI), Blackwell Ultra and Rubin, that are due out in 2025 and 2026, respectively. But the company hasn’t abandoned the H100. Far from it.

This week Nvidia announced that it’s managed to make the H100 perform faster than ever courtesy of a series of software tweaks. The result? A 512 GPU Hopper cluster performed 27% better in ML Perf AI Training benchmark tests this year than the same number of GPUs did last year and delivered sustained performance of 900 teraflops per GPU during training.

So, how did Nvidia do it? Dave Salvator, director of Accelerated Computing Products at Nvidia, said among other things it involved optimizing the use of the FP8 data format for training, which uses less memory than, for instance, the FP16 format. And that means data can be moved in and out of the GPU more quickly.

“It’s important to note that we used something called our transformer engine which is a feature we’ve had in the Hopper architecture which basically allows us to make intelligent use of FP8,” he explained. “What that means is we go layer by layer through the model, analyzing each layer and basically asking what sounds like a simple question: ‘can we run this layer with FP8 precision and not harm accuracy?’ If the answer is yes, we run that layer in FP8.”

Nvidia also implemented an FP8-Aware Distributed Optimizer and brought FlashAttention implementations into its cuDNN library to minimize and speed back and forth data transfers between memory types. It also implemented intelligent GPU power allocation and collapsed its math and communications operations.

The latter refers to the balance between calculations and the conversations GPUs in a cluster need to have with one another. By allowing those two operations to overlap, Nvidia was able to reduce its overall execution time. Salvator likened it to “buttering your toast while it’s still in the toaster.”


Nvidia also highlighted another major milestone, with Salvator stating it was able to achieve linear scaling of performance when training a GPT-3 175B model with 11,616 H100 GPUs in 2024. It was able to deliver 3.2x the performance compared to when it tackled the same task with 3,584 H100 GPUs in 2023, he said.

“A lot of times with workloads as you go to much larger scales, if you can get 65, 70% scaling efficiency, you’re pretty happy. With 80% scaling efficiency, you’re thrilled. But what we’ve been able to do through a combination of more hardware but also a lot of software tuning work is to get linear scaling on this workload. It’s very rare for this to happen,” he said. “If you just throw hardware at the problem, it’s not a given that you’re going to get particularly good scaling.”

Why does this matter?

Well, Salvator noted several massive scale GPU deployments are “just around the corner.”

“There are commercial AI factories being built as we speak that have been talked about briefly publicly that are going to be at scales of over 100,000 GPUs,” he said. “We expect one of those to come online this year. The other one, will be based on Blackwell, [and] should come online sometime in 2025.”

The Blackwell-based deployment will include some 300,000 GPUs.

These AI factories, essentially data centers with 100,000+ GPUs, won’t just be used to train models but also retrain them to ensure that they remain accurate, he added. The frequency of retraining can range from every few hours to every few months, Salvator said.

“This scale will enable that next round of innovation,” he concluded.