Published: Apr 22, 2025
Scaling AI Inference on AMD: Insights from Chai, TensorWave, and MK1

At the Beyond CUDA Summit 2025, leaders from Chai, TensorWave, and MK1 took the stage to share real-world lessons on scaling AI inference to millions of users — and why AMD MI300X and MI325X cloud platforms are changing the game for cost, performance, and flexibility.
Here’s the full breakdown of this powerhouse panel with Will Beauchamp, Kyle Bell, and Paul Merolla.👇
🔗 Quick Backgrounds: Meet the Panelists
- Will Beauchamp, Founder of Chai
➔ Built one of the world’s biggest consumer AI platforms — 60 trillion tokens generated per month. - Kyle Bell, VP of AI at TensorWave
➔ Leads AI infrastructure and MLOps on AMD’s MI300X and MI325X cloud. - Paul Merolla, CEO of MK1
➔ Ex-Neuralink founding engineer, now building one of the fastest inference platforms in the world.
🌎 The Birth of Chai: Democratizing AI Creation
Will shared how Chai began — before ChatGPT went viral — as a mission to make AI accessible to everyone, not just coders.
Instead of gatekeeping AI behind APIs, Chai built an open social platform where users create and interact with AI just like uploading videos on YouTube.
Today, Chai drives:
- 5M+ active users
- 60 trillion tokens processed monthly
- Teens spending over 90 minutes a day talking to AI's
⚡ Scaling Inference: Why AMD is Winning for Cost & Performance
As Chai scaled to massive traffic, they faced a critical decision: stick with expensive NVIDIA GPUs or find better efficiency.
After rigorous benchmarks:
- AMD MI300X outperformed H100 across key scenarios
- Performance per dollar nearly doubled
- Massive memory on AMD GPUs allowed multi-model hosting on a single chip
Result? Chai cut compute costs in half — saving $10M+ per year — without degrading user experience. =
🛠️ How MK1 Optimized Inference Workloads on AMD
Paul Merolla explained how MK1 fine-tuned performance for Chai:
- Advanced quantization and cache optimizations
- Tailored vectorized operations for AMD’s architecture
- Continuous A/B testing to optimize user experience and retention
MK1’s stack delivered 2x gains over standard inference engines — proving AMD GPUs could not just match but beat legacy setups.
🧠 Inside the MI300X: Why It's a Game-Changer for AI
Kyle Bell (TensorWave) highlighted why AMD's new chips dominate inference:
- More memory means larger models and concurrent workloads
- Lower total cost of ownership vs H100
- Chiplet architecture allows flexible GPU partitioning
- CAG (Cache Augmented Generation) techniques unlock faster, cheaper long-context handling
👉 More memory = more efficient RAG, longer contexts, persistent caching, and future-ready AI pipelines.
🔍 Quant Trading Meets AI: Lessons in Fast Iteration
Will drew parallels between his early algorithmic trading days and Chai’s approach to AI:
- Focus on pipelines, not just models
- 100+ LLMs trained and evaluated daily
- Human preference A/B tests (not synthetic benchmarks) to optimize real user satisfaction
Small 1% improvements stacked over time — a mentality key to scaling AI at hypergrowth speeds.
📢 MK1’s Big Announcement: New Open Source Optimization Library
Paul closed the panel by unveiling an exciting surprise:
MK1 has open-sourced a library that compresses and optimizes multi-GPU inference bandwidth — achieving up to 2x latency improvements.
📺 Watch the Full Panel 👉 Scaling AI Inference: Chai, TensorWave & MK1 | Beyond CUDA Summit 2025
🚀 Run Efficient Models on AMD GPUs
Deploy your optimized models on TensorWave’s AMD-powered AI cloud—built for training, inference, and experimentation at scale on MI300X and MI325X GPUs.
About TensorWave
TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.