Published: Apr 22, 2025

GPU Kernel Optimization: Mako’s Plan to Outrun CUDA

At Beyond CUDA Summit 2025, Waleed Atallah, Co-founder and CEO of Mako, delivered one of the most electric talks of the event:
What if we could out-optimize CUDA itself?

Through AI-powered kernel generation and autotuning, Mako is reshaping the future of GPU performance — for NVIDIA, AMD, and beyond.
Here’s the full story.👇

🦈 Why Mako? Fast, Agile, and Coming for CUDA

Waleed kicked off by comparing Mako to its namesake — the fastest shark in the ocean and a natural predator of the barracuda.
(Yes, that CUDA reference.)

Mako Labs is focused on one mission:

  • Optimizing GPU kernels using AI
  • Breaking the CUDA performance moat
  • Unlocking new levels of efficiency for AI models

🔎 Three Core Beliefs Driving Mako’s Vision

  1. Yes, we still need more GPU kernels
  2. Yes, autotuning at massive scale is the future
  3. Yes, AI will soon generate 100,000+ kernels automatically

In short: we're just getting started optimizing the GPU era.

💡 Why More GPU Kernels Are Essential

Even today, kernel fusion and algorithmic innovations (like Flash Attention) show how small changes unlock huge efficiency gains.

Waleed shared examples like:

  • Flash Attention enabling long-context LLMs
  • Quantization innovations needing brand-new, custom kernels
  • DeepSeek’s groundbreaking low-rank factorization kernels

Bottom line?
Each new model and each new hardware innovation demands new, smarter kernels — and the need is accelerating.

🎯 The New Bottleneck: How Do You Choose the Best Kernel?

As the number of kernels grows into the thousands, manual selection becomes impossible.

Mako’s answer:

  • AI-powered autotuning that intercepts and expands the TorchInductor pipeline
  • Smart search algorithms using reinforcement learning and zero-cost proxies
  • Caching best kernels automatically for fast, repeatable deployment

Early results:

  • 20–30% speed gains on NVIDIA GPUs
  • 50–100%+ gains on AMD MI300X and MI325X GPUs

That’s serious performance unlocked — with no human hand-tuning needed.

🧠 The Future: AI-Generated GPU Kernels at Scale

A new era is beginning:

  • LLM-generated kernels at massive scale
  • Recursive self-improving agents benchmarking, profiling, and evolving kernels in real time
  • Continuous offline optimization baked directly into compilers

Mako is building its own GPU Kernel Agent, fine-tuned to:

  • Learn hardware internals (NVIDIA, AMD, more)
  • Autonomously compile, benchmark, and improve
  • Optimize beyond what human engineers could scale manually

This is not just code generation — it’s autonomous performance evolution.

🏁 Why It Matters: Beyond CUDA, Toward True AI Performance Freedom

Waleed’s vision isn’t just about being faster than CUDA.
It’s about freeing AI development from hardware lock-in:

  • Kernel optimization that adapts to any hardware
  • Enabling new models, bigger architectures, and faster training
  • Democratizing performance for the entire AI industry

Mako isn’t just rewriting kernels.
They’re rewriting the rules for AI performance itself. 🚀

📺 Watch the Full Talk 👉 GPU Kernel Optimization with Waleed Atallah | Beyond CUDA Summit 2025

🚀 Run Efficient Models on AMD GPUs

Deploy your optimized models on TensorWave’s AMD-powered AI cloud—built for training, inference, and experimentation at scale on MI300X and MI325X GPUs.

About TensorWave

TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.

Ready to get started? Connect with a Sales Engineer.