TensorWave Welcomes the AMD Instinct™ MI355X

Published: Apr 08, 2025

ScalarLM: Open-Source LLM Training & Inference on AMD ROCm

At the 2025 Beyond CUDA Summit, Gregory Diamos, co-founder of ScalarLM, delivered a forward-looking talk that hit right at the intersection of performance, scaling, and open-source AI infrastructure—running on AMD’s ROCm stack and deployed on a TensorWave cluster.

Here’s the recap, distilled for builders:

🚀 Introducing ScalarLM: A Unified Stack for LLM Training + Inference

Over the past three years, Gregory has been hands-on building LLM software stacks on AMD ROCm. ScalarLM is the culmination of that work—a fully open-source framework that combines Megatron-core (training), Hugging Face model libraries, and the VLLM inference engine, all optimized for AMD MI300X GPUs. It runs out of the box on ROCm with no weird workarounds or custom builds.

Live on TensorWave: ScalarLM is actively training a 70B parameter model on an 8x MI300X server—and it’s built to scale to 400B+.

🧠 Beyond Scaling Laws: Toward Faster-Than-Scaling Growth

Diamos goes beyond traditional scaling laws. His thesis: while scaling laws have driven model progress (more compute = better models), the future is systems that scale faster than the laws themselves.

Modern AMD GPUs are starting to show that added memory and capacity can unlock new algorithmic growth curves.

🛠 Superalignment & DeepSeek R1: What’s Next

He introduces a next-gen model architecture approach—“superalignment”—where inference happens before training. Think: the model reasons through a question, generates “reasoning tokens”, then backpropagates based on correctness. It’s a smarter, feedback-looped architecture built for acceleration.

These hybrid workflows require frameworks like ScalarLM that unify inference + training as a single, integrated loop.

🔄 Fully Open Source, CC0 Licensed, and Commercial-Ready

ScalarLM is open-sourced under CC0. That’s no strings, no paywalls, no license headaches. It’s ready to drop into your stack—whether you’re building agents, RAG pipelines, autopilot systems, or high-throughput drug discovery pipelines.

Version 0.5 is live, performance-optimized release coming soon.

⚙️ Real-Time Looping for RLHF, Agents, and RAG

ScalarLM isn’t just static infra. You can run training/inference loops in real-time—use a reasoning trajectory, update weights, and serve the updated model—all on the same ROCm-powered stack. It works with OpenAI-style inference APIs and supports modern agent-style architectures (think LLaMA agents, memory-augmented tools, and more)

🎯 TL;DR

ScalarLM: Unified LLM framework (train + infer) optimized for AMD ROCm
Runs today on TensorWave clusters with MI300X
Built for superalignment + faster-than-scaling growth
CC0 license, no lock-in, fully transparent
The future of LLM infra is open, fast, and ready to loop

🔁 LLMs that learn during inference. Frameworks that train in the loop. That’s ScalarLM.

Got feedback? Contributions are welcome on GitHub. And if you’re ready to get hands-on with AMD-powered AI, TensorWave’s MI300X clusters are ready to run.

💬 Full Transcript

Introduction to ScalarLM & AMD ROCm Stack

Thanks everyone for um powering through all of these. So I'm going to tell you ScalarLM today. So over the last three years I've spent a lot of time building multiple iterations of software stacks on top of AMD ROCm and I want to share a complete stack for training and inference for LLMs So if you want to know what it takes to run a complete modern application on top of a AMD ROCm cluster this is running on top of a

Open Source Repo & TensorWave Integration

TensorWave cluster today Um you can check out this repo This is fully open source um fully commercially um available So you can try this out and give it a try whenever you want and read through the source code and see exactly how this works So beyond the existing applications we've also started looking at what comes beyond LLMs and what comes beyond the current generation of models So I'm going to talk about that a little bit and tell you about some of the design decisions that went into this framework and then tell you a little bit about the

Embracing Scaling Laws for Smarter Models

Framework Okay So my belief for a long time has been scaling laws has been that by training these models on larger amounts of data they're going to keep getting smarter This was our paper from BYU in 2017 where we um introduced scaling laws Um and these have been the driver of all of the progress that we've seen today This is the Nurub's keynote slide about OpenAI's progress in in scaling laws Um the idea is that by following a simple recipe of making the computer faster the models keep getting

Beyond LLMs: The Path to Faster Growth

smarter and they can keep solving additional tasks And my belief is that the future beyond scaling laws is actually scaling faster than scaling laws So you'll probably notice in the modern generation of GPUs especially the AMD GPUs they've started to add additional capacity So my belief is that systems that have additional capacity will be able to take advantage of much larger neural networks that can scale faster and learn even faster than neural than than scaling laws So this is an example of a um what a LLM can do today This is written by Claude 3.5 This is a full React app This is a um maps app that was generated from data It has um tabular data It has drop down menus It has um four different um panels And this was rendered entirely um by a web browser um running this 2000 line React app So this is an example of what LLMs can do today

LLM Capabilities Today: Example React App

But if we extrapolate forward with scaling laws either we scale faster or we just follow scaling laws for the next 10 years we should be able to solve much harder problems that have a much higher level of complexity And these are going to get us from these medium value use cases to very high valuable high value engineering use cases things like writing complete system software writing uh developing new drugs um building entire database pipelines not just 100 line queries but several thousand line data pipelines um and building um AI applications themselves like building autopilots So my belief is these things are going to be enabled

Future Use Cases: High-Value Engineering Solutions

and in order to do that we're actually going to need a new type of algorithm So the new algorithms that are going to scale faster than scaling laws are going to be kind of like deepseek R1 So the previous generation of models was heavily oriented on chat They would look at the what the user typed in and then they would complete the sentence They were auto reggressive models that would complete the sentence So Deepseek R1 is a step towards um super alignment which is what I'm calling a new kind of model that can learn faster than scaling laws So these models do reasoning When the user asks it a question the model before it answers the question will first spend some time doing inference So it will think about the answer It'll generate reasoning tokens where it tries to figure out the right way of answering the question and then it answers the question So from a

DeepSeek R1 & Super Alignment: Next-Gen Algorithms

software infrastructure perspective these models are starting to require not just training for training foundation models and not just inference for evaluating our models They're starting to do inference first to think about what the right thing to do is and then based on what the correct answer actually is whether the model got it right or not they're doing back propagation So they're doing training So we're seeing um emerging algorithms that scale faster start take advantage of both um training and inference And that's why we built ScalarLM which is the first um machine learning framework that includes high performance training and inference together into a single framework It uses Megatron core um the training stack from Megatron core which normally doesn't run very easily on AMD but has been ported to AMD Um it uses the model library from HuggingFace and it uses the inference engine um from VLM and it unifies them all together into a single framework that runs out of the box on ROCm So this is an example um of a snapshot on TensorWave on a um 8 MI300X um GPU server that's um training a 70 billion

Introducing Scalar LM: Training & Inference Unified

parameter model um using a fraction of the memory So it can fit about you know a 400 billion parameter model So this can train serious big models today Um this is just a comparison of you know how does ScalarLM fit into the ecosystem The biggest thing is that it has both training and inference together We don't think that this exists even in the um Nvidia ecosystem Um and secondly it has mainline rocm support So out of the box you don't have to have a separate build You just build the normal containers and it'll run on MI300 Okay Um this is the llama stack

Harnessing Llama Stack: Agents, RAG, & Reinforcement

This is Meta's perspective on what you can build on top of this layer So this is just another um way of thinking about how this fits into the ecosystem These are some of the types of um capabilities you can build on top of a framework like this So you can build agents obviously you can build inference you can build post training um you can build evals you can connect it to um memory augmented systems like rag systems So this layer really sits um at the lower layer providing um a train endpoint So the ability to train models as soon as models are trained they're immediately deployed So

Full Open Source Launch & Call for Community Feedback

if you've taken a reasoning trajectory and you've generated a reasoning example you can immediately send that to backrop and then backrop will produce a new model which can immediately be served from the inference engine So you can write loops that are going to call training and inference in a loop for example to perform reinforcement learning So we can do training um we can also do um inference using the normal um open AI client um interfaces for inference Okay And this is just giving you a little bit um you know more details of under the hood how the different pieces are put together Um as I mentioned the entire thing is completely open source Um this is a CC0 license so should be available for absolutely any commercial use you can think about Um this is our v 0.5 release so it's fully functional You can run it today as I showed running on an MI300 cluster Um we're going to do a performance oriented release um in a few months and our very interesting community feedback Thanks

About TensorWave

TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.

Ready to get started? Connect with a Sales Engineer.

SOC2 Type II certified and HIPAA compliant

Subscribe to our Blog

Stay ahead of the curve with the latest in AI, AMD accelerators, and all things TensorWave.