Published: Apr 22, 2025
GPU Kernel Optimization: Mako’s Plan to Outrun CUDA

At Beyond CUDA Summit 2025, Waleed Atallah, Co-founder and CEO of Mako, delivered one of the most electric talks of the event:
What if we could out-optimize CUDA itself?
Through AI-powered kernel generation and autotuning, Mako is reshaping the future of GPU performance — for NVIDIA, AMD, and beyond.
Here’s the full story.👇
🦈 Why Mako? Fast, Agile, and Coming for CUDA
Waleed kicked off by comparing Mako to its namesake — the fastest shark in the ocean and a natural predator of the barracuda.
(Yes, that CUDA reference.)
Mako Labs is focused on one mission:
- Optimizing GPU kernels using AI
- Breaking the CUDA performance moat
- Unlocking new levels of efficiency for AI models
🔎 Three Core Beliefs Driving Mako’s Vision
- Yes, we still need more GPU kernels
- Yes, autotuning at massive scale is the future
- Yes, AI will soon generate 100,000+ kernels automatically
In short: we're just getting started optimizing the GPU era.
💡 Why More GPU Kernels Are Essential
Even today, kernel fusion and algorithmic innovations (like Flash Attention) show how small changes unlock huge efficiency gains.
Waleed shared examples like:
- Flash Attention enabling long-context LLMs
- Quantization innovations needing brand-new, custom kernels
- DeepSeek’s groundbreaking low-rank factorization kernels
Bottom line?
Each new model and each new hardware innovation demands new, smarter kernels — and the need is accelerating.
🎯 The New Bottleneck: How Do You Choose the Best Kernel?
As the number of kernels grows into the thousands, manual selection becomes impossible.
Mako’s answer:
- AI-powered autotuning that intercepts and expands the TorchInductor pipeline
- Smart search algorithms using reinforcement learning and zero-cost proxies
- Caching best kernels automatically for fast, repeatable deployment
Early results:
- 20–30% speed gains on NVIDIA GPUs
- 50–100%+ gains on AMD MI300X and MI325X GPUs
That’s serious performance unlocked — with no human hand-tuning needed.
🧠 The Future: AI-Generated GPU Kernels at Scale
A new era is beginning:
- LLM-generated kernels at massive scale
- Recursive self-improving agents benchmarking, profiling, and evolving kernels in real time
- Continuous offline optimization baked directly into compilers
Mako is building its own GPU Kernel Agent, fine-tuned to:
- Learn hardware internals (NVIDIA, AMD, more)
- Autonomously compile, benchmark, and improve
- Optimize beyond what human engineers could scale manually
This is not just code generation — it’s autonomous performance evolution.
🏁 Why It Matters: Beyond CUDA, Toward True AI Performance Freedom
Waleed’s vision isn’t just about being faster than CUDA.
It’s about freeing AI development from hardware lock-in:
- Kernel optimization that adapts to any hardware
- Enabling new models, bigger architectures, and faster training
- Democratizing performance for the entire AI industry
Mako isn’t just rewriting kernels.
They’re rewriting the rules for AI performance itself. 🚀
📺 Watch the Full Talk 👉 GPU Kernel Optimization with Waleed Atallah | Beyond CUDA Summit 2025
🚀 Run Efficient Models on AMD GPUs
Deploy your optimized models on TensorWave’s AMD-powered AI cloud—built for training, inference, and experimentation at scale on MI300X and MI325X GPUs.
About TensorWave
TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.
Ready to get started? Connect with a Sales Engineer.