Published: Apr 22, 2025
Data-Efficient Training on AMD GPUs

At the Beyond CUDA Summit, Neha Prakriya, PhD student at UCLA and AMD intern, gave a compelling presentation on how her lab is tackling one of AI’s biggest bottlenecks: the memory wall.
As models grow, training costs are skyrocketing—up to 750x increases every two years—while memory and bandwidth improvements lag behind. Neha’s team is flipping the script with software-driven data efficiency methods, making training on AMD GPUs like the MI300X smarter, faster, and leaner.
🚧 The AI Memory Wall: A Hardware-Software Mismatch
Despite rapid model scaling, hardware improvements in memory bandwidth and interconnects haven’t kept up. This gap—dubbed the memory wall—is limiting GPU performance. While some are solving it with better hardware, UCLA’s team is attacking the problem from the data side.
Their approach: optimize what goes into the model, not just how it’s trained.
🔁 Smart Data Selection: Learn, Focus, Review
Inspired by human learning techniques like spaced repetition, Neha’s team developed an iterative method for selecting the most impactful data during training. Instead of blindly feeding massive datasets into the GPU, they:
- Learn: Sample broadly from the dataset.
- Focus: Identify hard examples based on loss trajectories and prioritize them.
- Review: Reintroduce easier examples to avoid forgetting.
💡 Result: Better models trained on fewer tokens, saving time and compute.
📉 Results: 130B Tokens > 1T Tokens
Using this method, they trained Open LLaMA-style models with just 130B carefully selected tokens, outperforming baselines trained on 1 trillion tokens.
They also:
- Trained models from 300M to 7B parameters.
- Outperformed random sampling and static selection baselines.
- Open-sourced models and code on GitHub (link placeholder).
🧪 Domain Adaptation: Fine-Tuning, Smarter and Faster
When adapting pre-trained models to specialized domains like math, code, or medicine, Neha’s team used proxy models to identify and cluster impactful training samples. This:
- Cut fine-tuning time by 2x
- Maintained or improved accuracy across both in-domain and out-of-domain tasks
Crucially, the clustering info from small models transferred reliably to larger ones—enabling scalable, smart fine-tuning.
💻 Real-World Impact: FPGA Coding Assistant on AMD MI300X
Thanks to AMD’s HPC grant program, the team gained access to AMD Instinct™ MI300X GPUs for applied research. Their current project:
- Training an HLS coding assistant to help developers optimize code for FPGAs
- Tackling a solution space with 3M+ possible design points
- Leveraging a dataset of 40,000+ circuit designs, built over a decade
This tool aims to reduce the friction of FPGA adoption by automating pragma selection and resource prediction.
🤝 Open to Collaboration
Neha wrapped her talk with a call for collaboration—especially from industry partners interested in scaling these innovations. As her team looks to bring data-efficient training to a broader set of use cases and hardware targets, the door is open.
📺 Watch the Full Talk
👉 Data-Efficient Training Methods with Neha Prakriya | Beyond CUDA Summit
⚡ Run AI Workloads on AMD GPUs
Explore data-efficient training and inference on TensorWave’s AMD-powered AI cloud featuring the MI300X and MI325X.
About TensorWave
TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.
Ready to get started? Connect with a Sales Engineer.