TensorWave Welcomes the AMD Instinct™ MI355X

Published: Apr 22, 2025

GPU Kernel Optimization: Mako’s Plan to Outrun CUDA

At Beyond CUDA Summit 2025, Waleed Atallah, Co-founder and CEO of Mako, delivered one of the most electric talks of the event:
What if we could out-optimize CUDA itself?

Through AI-powered kernel generation and autotuning, Mako is reshaping the future of GPU performance — for NVIDIA, AMD, and beyond.
Here’s the full story.👇

🦈 Why Mako? Fast, Agile, and Coming for CUDA

Waleed kicked off by comparing Mako to its namesake — the fastest shark in the ocean and a natural predator of the barracuda.
(Yes, that CUDA reference.)

Mako Labs is focused on one mission:

Optimizing GPU kernels using AI
Breaking the CUDA performance moat
Unlocking new levels of efficiency for AI models

🔎 Three Core Beliefs Driving Mako’s Vision

Yes, we still need more GPU kernels
Yes, autotuning at massive scale is the future
Yes, AI will soon generate 100,000+ kernels automatically

In short: we're just getting started optimizing the GPU era.

💡 Why More GPU Kernels Are Essential

Even today, kernel fusion and algorithmic innovations (like Flash Attention) show how small changes unlock huge efficiency gains.

Waleed shared examples like:

Flash Attention enabling long-context LLMs
Quantization innovations needing brand-new, custom kernels
DeepSeek’s groundbreaking low-rank factorization kernels

Bottom line?
Each new model and each new hardware innovation demands new, smarter kernels — and the need is accelerating.

🎯 The New Bottleneck: How Do You Choose the Best Kernel?

As the number of kernels grows into the thousands, manual selection becomes impossible.

Mako’s answer:

AI-powered autotuning that intercepts and expands the TorchInductor pipeline
Smart search algorithms using reinforcement learning and zero-cost proxies
Caching best kernels automatically for fast, repeatable deployment

Early results:

20–30% speed gains on NVIDIA GPUs
50–100%+ gains on AMD MI300X and MI325X GPUs

That’s serious performance unlocked — with no human hand-tuning needed.

🧠 The Future: AI-Generated GPU Kernels at Scale

A new era is beginning:

LLM-generated kernels at massive scale
Recursive self-improving agents benchmarking, profiling, and evolving kernels in real time
Continuous offline optimization baked directly into compilers

Mako is building its own GPU Kernel Agent, fine-tuned to:

Learn hardware internals (NVIDIA, AMD, more)
Autonomously compile, benchmark, and improve
Optimize beyond what human engineers could scale manually

This is not just code generation — it’s autonomous performance evolution.

🏁 Why It Matters: Beyond CUDA, Toward True AI Performance Freedom

Waleed’s vision isn’t just about being faster than CUDA.
It’s about freeing AI development from hardware lock-in:

Kernel optimization that adapts to any hardware
Enabling new models, bigger architectures, and faster training
Democratizing performance for the entire AI industry

Mako isn’t just rewriting kernels.
They’re rewriting the rules for AI performance itself. 🚀

📺 Watch the Full Talk 👉 GPU Kernel Optimization with Waleed Atallah | Beyond CUDA Summit 2025

🚀 Run Efficient Models on AMD GPUs

Deploy your optimized models on TensorWave’s AMD-powered AI cloud—built for training, inference, and experimentation at scale on MI300X and MI325X GPUs.

About TensorWave

TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.

Ready to get started? Connect with a Sales Engineer.

SOC2 Type II certified and HIPAA compliant

Subscribe to our Blog

Stay ahead of the curve with the latest in AI, AMD accelerators, and all things TensorWave.

SOC2 Type II certified and HIPAA compliant

TensorWave Welcomes the AMD Instinct™ MI355X

GPU Kernel Optimization: Mako’s Plan to Outrun CUDA

🚀 Run Efficient Models on AMD GPUs

About TensorWave

Subscribe to our Blog

Stay ahead of the curve with the latest in AI, AMD accelerators, and all things TensorWave.

Product

Resources

Company

© 2025 TensorWave Inc. - All rights reserved.