- AI models are getting smaller, but it takes quite a few different techniques to get them there
- In model development, quantization and distillation are key tools to shrink LLMs
- But hardware vendors also have a role to play
What does “small” mean, really? If you’re Meta, it apparently means a 109 billion parameter large language model capable of running on a single Nvidia H100 GPU. But that chip is notoriously pricey. So, to make AI affordable for the masses, AI leaders are pushing to create smaller, more efficient models that can run on cheaper, less powerful hardware. It’s just that achieving that is easier said than done.
Though the public hasn’t necessarily realized it, IBM VP for AI Models David Cox noted there’s been a “radical packing down of capability into smaller and smaller models, and that’s great news for businesses and consumers because it means that you can do more and more. This is us, collectively as a field, sort of sharpening our tools and getting better and better at this stuff.”
What are those tools, you ask? Well, there are a lot of them.
Looking inside the tiny AI toolbox
The first step is distillation, which involves sifting through the data in larger models to make smaller, specialist models that are very good at a handful of key tasks. But oftentimes, that alone isn’t enough to get models as small as they need to be, AMD’s Senior Director of AI Product Application Engineering Nick Ni said.
That’s where quantization comes in. Quantization, which we touched on briefly here, involves reducing the amount of data stored for each parameter to reduce the amount of memory required and cost to run the model while preserving accuracy. Think starting from 32-bit representations for each parameter and cutting those down to 16, 8 or 4 bits each.
Lately, Ni said those looking to run AI in edge devices have been pushing the limits of this technique with 2-bit and 1-bit quantization, looking to see “how much data can you reduce while still maintaining an acceptable range of smartness.”
Hardware meets tiny AI
But it’s not just about refining the models in isolation. Hardware also matters.
As Cox pointed out, if a customer can avoid running their AI applications on GPUs and get the job done with CPUs, that’ll save a ton of money. In other instances, such as applications at the edge, the AI needs to be able to run with even less compute power on sensors in the field.
In the case of AI for sensors, AIZip CEO Yubei Chen noted that the company uses not only quantization, but also engages in hardware based training to optimize the way a model runs on a sensor’s co-processor.
“Training the model for specific hardware allows us to leverage the hardware better,” he explained.
AMD’s Ni added that it’s not just about matching the model to a chip’s technical capabilities. Vendors like AMD, he said, also bring hardware-specific software to the table that can help models run more effectively. Think things like tools to support quantization and compilers that essentially translate the AI instructions into code that can be efficiently executed on a given chip.
Room for improvement
But even with all these tools on the table, Ni said there’s still room for improvement and more that can be done to make AI models even smaller, smarter and more efficient.
The key? “A lot more co-design,” Ni said. Today, the truth is consideration of model size and hardware optimization is largely an afterthought. Companies aim to develop gigantic LLMs or make the biggest, baddest data center GPUs to impress investors before even thinking about things like distillation and quantization.
But “if you really think ‘small model first,’ there’s a fundamentally different way of approaching algorithm development from even the way you’re training,” Ni noted. “I don’t think that’s happening as much.”
In a few years, Ni predicted, there will likely be a shift in thinking toward co-design as issues like privacy and a need for edge deployments come into sharper focus.
“I think the distillation will run out of steam and there will be more native thinking on how to develop models and the software and hardware altogether for those small models,” he concluded.