Published: Apr 22, 2025

Maximizing GPU Efficiency: Model Sizing Insights from Zyphra’s Quentin Anthony

At the Beyond CUDA Summit, Quentin Anthony, Model Training Lead at Zyphra, delivered a highly technical and deeply practical talk on how model sizing choices can dramatically impact GPU efficiency—especially on non-NVIDIA hardware like the AMD Instinct™ MI300X and MI325X.

If you’re developing large language models (LLMs) or optimizing for inference on AMD GPUs, this talk is a must-watch. Here’s a distilled breakdown of the key insights, techniques, and takeaways.

🔧 Why Model Sizing Matters More Than You Think

Anthony opened by breaking down how GPU hardware prefers certain model sizes—not just at a kernel level, but in terms of architectural harmony. Drawing from a recent Zyphra research paper, he revealed that small adjustments in hidden layer dimensions, attention heads, or vocab sizes can yield massive performance gains—sometimes more than major software-level optimizations.

💡 Example: Padding a model’s vocabulary size to a power of two (e.g., 1024 instead of 1023) can lead to significant throughput gains without affecting accuracy.

📊 Transformer Models = Chains of GPU Kernels

Zyphra treats every layer of a transformer as a stack of GPU kernels, primarily:

  • GEMMs (General Matrix Multiplications) for attention and MLPs
  • LayerNorm and softmax operations

Anthony emphasized that most latency comes from MLP and attention GEMMs, meaning optimizing those kernel sizes has the biggest impact on throughput.

🧠 Optimal MLP Kernel Sizing = Roofline Model Wins

MLP kernel performance is governed by predictable patterns:

  • Dimensions like 4H → H are common
  • These ops can be easily aligned to multiples of 64 to saturate the GPU

Zyphra uses lookup tables for each hardware target (Snapdragon, MI300X, etc.) to identify the most efficient shapes per model size.

💡 “We pre-train models from scratch sized specifically for MI300X and other AMD GPUs so developers don’t have to.”

🌀 Attention Kernel Performance: Pre vs. Post Flash Attention

Before Flash Attention, the attention mechanism had erratic performance curves based on:

  • H = hidden dimension
  • A = number of heads
  • T = tensor parallel degree

Key lesson: Keep H/A divisible by 64 or at least powers of 2 for peak GPU utilization. Zyphra visualized this with a performance “wave” showing optimal and suboptimal kernel behaviors.

⚡ Flash Attention: Avoiding the Wave Quantization Trap

Modern GPUs are composed of Streaming Multiprocessors (SMs), and they need evenly distributed work to stay fully occupied.

Flash Attention v2 introduced a dual-stream design to mitigate the “wave quantization” issue, which occurs when GPU SMs idle due to imbalanced thread block scheduling.

Takeaway: Align your attention block sizes with SM count to maintain top-of-roofline throughput.

🔁 Training vs. Inference: Pick Your Poison

One standout insight: training-optimized models aren’t always inference-optimized—especially across different hardware. For example:

  • Zyphra trained a 1B parameter model optimized for mobile inference
  • It outperformed smaller models on latency despite being larger, simply because it was hardware-aware from day one

📌 Model developers must now consider where the model will run—in the cloud, on edge devices, or across hybrid stacks.

🔍 Final Thoughts + Future Directions

Quentin closed with a clear thesis:

“Small model sizing tweaks can unlock massive efficiency gains. But only if you know your hardware target in advance.”

Zyphra’s approach—treating transformer model sizing like a classical HPC problem—is becoming essential as AI compute diversifies across MI300X, MI325X, Snapdragon, and non-NVIDIA environments.

They’ll soon release a broader paper covering sizing strategies across architectures like RWKV, Mamba, and others. Watch this space.

📺 Watch the Full Talk

👉 Maximizing GPU Efficiency w/ Quentin Anthony | Beyond CUDA Summit

🚀 Run Efficient Models on AMD GPUs

Deploy your optimized models on TensorWave’s AMD-powered AI cloud—built for training, inference, and experimentation at scale on MI300X and MI325X GPUs.

About TensorWave

TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.

Ready to get started? Connect with a Sales Engineer.