PostHeaderSection

Symbol

Global NavBar

PostContentSection

Global Footer

Blog Detail Page

Staff Writer

Today we are excited to announce the release ScalarLM v0.5.

ScalarLM is a fully open source, CC-0 Licensed (unrestricted commercial use), integrated LLM inference and training platform.

We created ScalarLM to simplify the development of reinforcement learning agents with advanced reasoning and memory capabilities, similar to those of DeepSeek R1. By integrating inference and training engines into a single platform, ScalarLM enables the seamless generation and utilization of reasoning trajector

Introducing ScalarLM v0.5: Unifying LLM Inference and Training for RL Agents

Today we are excited to announce the release ScalarLM v0.5.

ScalarLM is a fully open source, CC-0 Licensed (unrestricted commercial use), integrated LLM inference and training platform.

We created ScalarLM to simplify the development of reinforcement learning agents with advanced reasoning and memory capabilities, similar to those of DeepSeek R1. By integrating inference and training engines into a single platform, ScalarLM enables the seamless generation and utilization of reasoning trajectories for training updates, streamlining the development process.

 * Contact us for help deploying ScalarLM on your GPU supercomputer: Contact
 * Access the source code at: Github
 * Read the docs at https://docs.scalarlm.com
 * Join the community on Discord

ScalarLM builds on top of the vLLM inference engine, the Megatron-LM training framework, and the HuggingFace model hub. It unifies the capabilities of these tools into a single platform, enabling users to easily perform LLM inference and training, and build higher level applications such as LLM-Agents.

ScalarLM is designed for high performance, with out of the box support for AMD ROCm GPUs and NVIDIA CUDA GPUs. It includes development mode support for x86 CPUs and ARM CPUs. ScalarLM is particularly helpful for AMD GPU platforms, which currently lack a robust, performant, open-source training stack. It inherits the distributed training capabilities of Megatron-LM and the optimized inference engine of vLLM. ScalarLM is also designed to be easy to use. It provides an OpenAI-compatible server and a simple command line interface for users to interact with the platform. ScalarLM is inspired by the work of Seymour Roger Cray, an American electrical engineer and supercomputer architect who designed a series of computers that were the fastest in the world for decades, and founded Cray Research, which built many of these machines. Called “the father of supercomputing”, Cray has been credited with creating the supercomputer industry.

We hope that ScalarLM will help to democratize access to large language models, and enable more people to build and deploy AI systems that push the boundaries of accuracy to benefit society. We are excited to see what you will build with ScalarLM!

If you are interested in contributing to ScalarLM, please reach out to us at Contact.


v0.5 - Functionality and Feedback

ScalarLM v0.5 is a fully functional prototype that demonstrates the core capabilities of the platform. It is intended to show the potential of the platform and gather feedback from the community. We are actively seeking feedback on the platform, including suggestions for new features, bug reports, and ideas for how to improve the platform.We expect to release v1.0 in the coming months, which will include additional features, performance optimizations, benchmarks, and bug fixes. We are also planning to release a series of blog posts and tutorials that will help users get started with ScalarLM and build their own LLMs with advanced reasoning capabilities.


Lessons Learned

Over the last ten years, we have built seven generations of training and inference stacks at Baidu, Apple, AI Fund, MLCommons, several startups including Lamini, and multiple enterprise partners. We have learned a lot about what works and what doesn’t. ScalarLM is a clean slate rewrite of a unified inference and training stack incorporating the lessons learned. Here are some of the key lessons:

 * Training and Inference are both essential, but most frameworks focus on one or the other.
 * SLURM and Kubernetes have different worldviews, but pieces of both are useful.
 * A model exchange format has not yet emerged, and the best we have is the huggingface model hub.
 * The main dependency is the GPU, and it is possible to achieve a simpler and more secure design by cutting out the rest.

These infrastructure issues are so difficult that many organizations focus on either training or inference, leaving significant accuracy on the table. Our vision for ScalarLM is to enable organizations to build LLM systems that leverage backpropagation in addition to in-context learning, just like humans, enabling them to solve more challenging problems.


Unified Training and Inference

Training is essential because it enables models to learn. Training creates flywheels. A machine learning model without training could never be updated and could not learn.

Inference is essential for models to have practical value. Inference is used when a model is deployed in production to serve predictions to users.

ScalarLM unifies training and inference into a single platform. This enables users to simultaneously train and deploy models using the same GPU cluster, and to easily switch between the two modes without changing their code.

ScalarLM builds on pytorch models stored in the huggingface model hub, allowing it to support training most models that are supported by huggingface. It is possible to train new models from scratch, or to continue training from existing models.

A ScalarLM cluster will spin up inference and training workers as needed. Training jobs are submitted to the SLURM queue and pulled out by the training workers as batch jobs. Inference requests are submitted to a persistent queue and pulled out by the inference workers as needed. As checkpoints are produced by the training workers, they are picked up by the inference workers and new inference requests can immediately use the new model.

ScalarLM follows the API design of the Llama Stack, which is a set of APIs designed to facilitate simple deployment of AI applications on top of open source LLMs. ScalarLM is designed to be compatible with the Llama Stack, and to work well with benchmark suites like AILuminate.


SLURM and Kubernetes are Better Together

Many software developers are familiar with Kubernetes for cloud and enterprise apps. So the natural temptation is to use Kubernetes as the job queue, e.g. with the MPI Operator to get the network to work. This leaves network performance on the table and forces the user to set up the RDMA network stack manually.

Kubernetes without SLURM can work, but it forces the infra team to reinvent the wheel.

Some HPC clusters run SLURM on bare metal. This also works and simplifies the network configuration, but deployments without containers create version and dependency management hell.

In ScalarLM, we run a virtual SLURM cluster inside of a kubernetes application, providing the best of both worlds. We are closely monitoring development on the SLINKY project and expect to fold in better approaches as they become available.

Let’s look at SLURM and Kubernetes in more detail.

SLURM is a batch job scheduler that is widely used in high performance computing environments. It is designed to run large numbers of jobs on a cluster of machines with a high performance network. The high performance network is essential for training large models, because it allows the GPUs to communicate with each other quickly using RDMA primitive operations that are composed into collective operations like allreduce and allgather. Superficially it seems like SLURM only handles job scheduling, but it also sets up the MPI environment that brings up the high performance network that distributed training frameworks like Megatron-LM tap into.

Kubernetes is a container orchestration system that is widely used in cloud computing environments. It handles the problem of deploying containers onto a cluster of machines. Containers are essential to manage the complex software dependencies. If you want to upgrade the software on the cluster, you can just upgrade the container image and restart the container. Kubernetes allows you to do this without affecting the other containers running on the cluster.

Putting them together may seem difficult, and indeed, there are many subtle details that need to be taken care of including how SLURM daemons manage hostnames and handle cgroups which make conflicting assumptions with the lower layers of kubernetes networking, containerd, and accelerator management. However, SLURM is essentially a set of lightweight C programs that coordinate over TCP sockets. They tap into RDMA interfaces exposed by linux network drivers. It is possible to entirely encapsulate these SLURM daemons in containers, and run those containers on Kubernetes. ScalarLM does this.

When the ScalarLM worker containers start, they have access to a virtual SLURM network and control plane. We build the orchestration, distributed training framework, and collectives on top of these interfaces.

So a user can helm install ScalarLM and get a full functional SLURM cluster.


Huggingface is the model exchange format

New models come up almost daily, and it is important to be able to train and deploy them with minimal porting effort.

The huggingface model hub is the closest thing we have to a model exchange format. The transformers library has pytorch source code implementations of hundreds of classes of models. Even though most LLMs are based on transformers, there is no standard implementation of a transformer. Different models might choose different position embeddings or layer shapes. One of the biggest points of friction between training and inference teams in large companies is resolving these differences. Craylm modifies Megatron-LM to use the huggingface model hub as the source of truth for model implementations, which makes it possible to immediately load most trained models directly into the inference engine.


Security by cutting out everything except the GPU

Multiple previous generations of ScalarLM used a variety of cloud services including parallel distributed object storage such as Azure Blob or GCS, database services like BigQuery, cloud load balancers, etc. These lead to significant complexity and reduced portability between supercomputers, which are often optimized for efficiency and have different cloud stacks. The current generation of ScalarLM requires GPU compute nodes, a high performance interconnect, and nothing else.

This allows ScalarLMto be deployed in a secure enclave, with no external dependencies.

Several design issues had to be addressed to make this possible.


Training State Versioning without a Database or Object Store

Training jobs need to produce a model that is passed to the inference engine, and state needs to be tracked between the training and inference workers. We uniquely identify each model by a hash of the training data, the model code, and the hyperparameters. This state is saved to a shared filesystem that is accessible by both the training and inference workers. The training system picks up this training state and launches a SLURM job to train the model. As it trains, it produces checkpoints that are saved to the shared filesystem. The inference system picks up these checkpoints as they are made available. This enables fault tolerance, because if a training worker fails, the training system can restart from a checkpoint. It also avoids the need for a database to track the state of the training jobs.


Massive Inference without Cloud Queues or Load Balancers

LLMs are big and expensive, even on GPUs. Generating thousands of tokens can take minutes. A user who submits a few requests from their laptop in a loop can quickly overwhelm a GPU cluster that costs millions of dollars. Submitting these requests as HTTP REST requests leads to load imbalance and timeouts. Submitting too few requests leads to underutilization. vLLM uses continuous batching to handle load imbalance, but that can still lead to overwhelming the workers. ScalarLM pushes inference requests into a persistent queue, which can be very large in size. The requests are pulled out by vLLM workers as they have capacity. The user client polls to get the results from the queue. This allows efficient and lossless processing of very high numbers of inference requests, common to inference pipelines.


Datasets without an Object Store

Training requires training data. Typically this requires uploading a training dataset. The natural place to put it is in a cloud object store. It initially seems hard to get around this. But how hard is it really?

How big is each data item that an LLM is trained on? A few hundred tokens.

How many bytes does that take up? Let’s say one byte per token compressed, so a few hundred bytes per item.

How many training items are there for a dataset like Alpaca? 52K.

So how big is that? 52K (Q&A pairs) * ~300 (bytes per pair) = 15.6 MB

How long does it take to upload 15.6MB over a cloud internet connection, e.g. 1 Gbs? = 0.124 seconds

How long does it take to train a 70B parameter model on that data on an H100 at 40% MFU? = 6.5 hours

So the upload perf isn’t nearly as big of a practical problem as it seems.

We built a fast file uploader in ScalarLM that can handle 10s to 100s of GBs of data over common cloud connections. It hashes the data to avoid uploading duplicates. No need to rely solely on object store.

Our main insight is as follows. If you are running ScalarLM, you built a supercomputer capable of training an LLM. That computer can handle moving around a few GB, so could most laptops today. An object store managing thousands of parallel disks is not always needed.


Future

Our target for v1.0 release is to include performance optimizations and kernel level benchmarks on modern GPUs including MI300X and H200.

We plan to keep ScalarLM up to date with the latest advances in LLM research, including new models, new training stack optimizations, and new inference stack optimizations.

We are excited to build a new generation of LLM supercomputers that democratize access to training language models. We are in particular interested in applications that build on top of ScalarLM to support LLM agents that train and deploy themselves.


Contributing

ScalarLM is an open source project developed by a small team at TensorWave.

 * Greg Diamos, PhD, is a computer architect and deep learning researcher. He has worked at NVIDIA.
 * Sudnya Diamos is a 3x startup founding engineer, and former NVIDIA GPU architect.
 * Naila Farooqui, PhD, is a deep learning researcher. She is the co-author of NVIDIA Cutlass.
 * Suhabe Bugrara, PhD, is a 3x CTO.

Contact Us to explore opportunities to work with us.

We welcome community contributions. Some areas that could use help:

 1. Help completing the Llama Stack client
 2. Running AILuminate evals
 3. Performance optimizations in the training and inference engines
 4. New deployment targets, benchmarks, and regression tests
 5. Hosted ScalarLM clusters on your cloud
 6. A refactor of the unified vLLM build that makes it easier to accept upstream changes
 7. Integration with SLINKY


Connect

Connect With Us

Introducing ScalarLM v0.5: Unifying LLM Inference and Training for RL Agents

$350M Series B Announcement

Product

Solutions

Resources

Company

Product

Resources

Solutions

Company