Slash Inference Costs 100x on AMD GPUs: Featherless CEO Eugene Cheah on RWKV’s Breakthrough

Apr 04, 2025

At TensorWave’s Beyond CUDA Summit 2025, Eugene Cheah — co-lead of RWKV and CEO of Featherless AI — ...

At TensorWave’s Beyond CUDA Summit 2025, Eugene Cheah — co-lead of RWKV and CEO of Featherless AI — dropped a masterclass on how his team is pushing the boundaries of inference efficiency, challenging the Transformer status quo, and making personal AGI not just possible, but practical.

Here’s a fast breakdown of the big ideas from his talk, “Slash Inference Costs 100x on AMD GPUs.”

The Transformer Problem

Cheah didn’t waste time: “Transformers work — but they hit a wall.”

Every token processed in Transformer-based models increases memory and compute quadratically. This means massive VRAM requirements, slower inference, and ballooning costs.

“If your brain stored every second of your life in perfect detail,” Eugene explained, “it would explode by age 10. That’s basically how Transformers behave.”

Enter RWKV: Recurrent Power, Linear Cost

RWKV is the first post-Transformer architecture under the Linux Foundation, and it flips the scaling problem on its head. By compressing memory as the model runs, RWKV reduces inference costs by over 100x.

“Quadratic curves are great when they’re making you money,” Eugene said, “but horrible when you’re paying GPU bills.”

RWKV mimics how humans remember: you retain key context, not every word ever said.

Qwerky 32B & 72B: Proof You Don’t Need Attention

Featherless just released Qwerky-32B and Qwerky-72Bnon-transformer, non-attention, non-QKV models. They were trained on AMD MI300 GPUs. The 72B model? Trained on just 16 GPUs.

“They told us recurrent architectures don’t scale. They said the same about CNNs before T1. And they were wrong.”

Even more shocking: the models beat Transformers on some benchmarks — and they weren’t even trained on superclusters.

Transformers In, RWKV Out

Featherless developed a way to convert Transformer models into RWKV. How?

  1. Strip out the attention layer
  2. Replace it with RWKV’s recurrent attention
  3. Fine-tune with ~200M tokens

That’s how Qwerky-72B runs like a dream on just 16 GPUs.

“This might mean we’ve misunderstood where AI’s intelligence really comes from.”

Eugene argues the real intelligence lies not in attention, but in the feed-forward network (FFN) — the part Featherless didn’t touch during conversion.

Open, Accessible, Cheap

All models are available now:

Much of this is already running on AMD MI300 GPUs, unlocking incredible performance-per-dollar.

The Real Frontier? Reliability

Cheah isn’t chasing “the smartest brain.” He’s building reliable assistants that work.

“Today’s best models can calculate orbital physics to Mars — but can’t reliably run a college cafeteria register.”

To close that gap, Featherless is focused on memory tuning, unique to RWKV. The goal: personalized, consistent behavior — not just raw IQ.

And it doesn’t take a supercluster to do this. Just two MI300X nodes per researcher is enough to iterate and fine-tune reliably.

The Vision: Personal AGI for Everyone

The future? Smaller, smarter, more helpful models — not bigger, bloated ones.

Featherless is shifting its mission from just open access to something bolder:

“Lightweight, personalized AGI — made accessible for everyone.”

TL;DR

  • RWKV cuts inference costs by 100x vs. Transformers
  • Qwerky-72B trained on just 16 AMD MI300X GPUs
  • Models are open source, available now, and attention-free
  • Focus is shifting from raw capability to reliability and personalization
  • You don’t need a supercluster to build great AI

About TensorWave

TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.