TensorWave Welcomes the AMD Instinct™ MI355X

Published: Apr 22, 2025

Data-Efficient Training on AMD GPUs

At the Beyond CUDA Summit, Neha Prakriya, PhD student at UCLA and AMD intern, gave a compelling presentation on how her lab is tackling one of AI’s biggest bottlenecks: the memory wall.

As models grow, training costs are skyrocketing—up to 750x increases every two years—while memory and bandwidth improvements lag behind. Neha’s team is flipping the script with software-driven data efficiency methods, making training on AMD GPUs like the MI300X smarter, faster, and leaner.

🚧 The AI Memory Wall: A Hardware-Software Mismatch

Despite rapid model scaling, hardware improvements in memory bandwidth and interconnects haven’t kept up. This gap—dubbed the memory wall—is limiting GPU performance. While some are solving it with better hardware, UCLA’s team is attacking the problem from the data side.

Their approach: optimize what goes into the model, not just how it’s trained.

🔁 Smart Data Selection: Learn, Focus, Review

Inspired by human learning techniques like spaced repetition, Neha’s team developed an iterative method for selecting the most impactful data during training. Instead of blindly feeding massive datasets into the GPU, they:

Learn: Sample broadly from the dataset.
Focus: Identify hard examples based on loss trajectories and prioritize them.
Review: Reintroduce easier examples to avoid forgetting.

💡 Result: Better models trained on fewer tokens, saving time and compute.

📉 Results: 130B Tokens > 1T Tokens

Using this method, they trained Open LLaMA-style models with just 130B carefully selected tokens, outperforming baselines trained on 1 trillion tokens.

They also:

Trained models from 300M to 7B parameters.
Outperformed random sampling and static selection baselines.
Open-sourced models and code on GitHub (link placeholder).

🧪 Domain Adaptation: Fine-Tuning, Smarter and Faster

When adapting pre-trained models to specialized domains like math, code, or medicine, Neha’s team used proxy models to identify and cluster impactful training samples. This:

Cut fine-tuning time by 2x
Maintained or improved accuracy across both in-domain and out-of-domain tasks

Crucially, the clustering info from small models transferred reliably to larger ones—enabling scalable, smart fine-tuning.

💻 Real-World Impact: FPGA Coding Assistant on AMD MI300X

Thanks to AMD’s HPC grant program, the team gained access to AMD Instinct™ MI300X GPUs for applied research. Their current project:

Training an HLS coding assistant to help developers optimize code for FPGAs
Tackling a solution space with 3M+ possible design points
Leveraging a dataset of 40,000+ circuit designs, built over a decade

This tool aims to reduce the friction of FPGA adoption by automating pragma selection and resource prediction.

🤝 Open to Collaboration

Neha wrapped her talk with a call for collaboration—especially from industry partners interested in scaling these innovations. As her team looks to bring data-efficient training to a broader set of use cases and hardware targets, the door is open.

📺 Watch the Full Talk

👉 Data-Efficient Training Methods with Neha Prakriya | Beyond CUDA Summit

⚡ Run AI Workloads on AMD GPUs

Explore data-efficient training and inference on TensorWave’s AMD-powered AI cloud featuring the MI300X and MI325X.

About TensorWave

TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.

Ready to get started? Connect with a Sales Engineer.

SOC2 Type II certified and HIPAA compliant

Subscribe to our Blog

Stay ahead of the curve with the latest in AI, AMD accelerators, and all things TensorWave.