Published: Jul 21, 2025

How to Run Multi-Node Training with Pyxis + AMD GPUs

Coordinating resource scheduling for multi-node training between AI research teams becomes incredibly difficult as you scale out.

  • Nodes must be scheduled spatially close together and with proper RDMA networking between GPU’s to optimize training bandwidth.
  • To make effective use of resources, teams often switch between small resource consumption for designing model/training runs to large consumption in production runs. Each time nodes get added to the training run, RDMA networking may need to be readjusted.
  • Teams that share Nodes have different dependencies for their projects that can corrupt the host nodes if they don’t containerize their applications.

Slurm solves many of these issues, allowing teams to coordinate resource allocation for jobs while solving the issue of containerizing these workloads using “Spank” plugins like Pyxis.

Slurm and Pyxis integration does not work out of the box and requires specific configuration modifications to enable topology aware scheduling. Further, since Pyxis is responsible for handing containerization of distributed workloads, one must make heavy modifications to infrastructure for Slurm/ Pyxis to properly mount accelerator hardware, route RDMA + mgmt network traffic, and manage containers/ images.

Thankfully Slurm and Pyxis integration is solved by a few cloud providers like Tensorwave so researchers can focus on developing models.

Slurm + Pyxis on 3 Nodes

Lets say we have a 3 node cluster, 2 nodes for compute work loads and 1 head node to cut up and distribute jobs to resources.

We log into our console and check the resources as usual via slurm sinfo.

From this we can take a look at what jobs are running vs what jobs might be queued up. Below we show that theres 1 job running on 117 and so we cancel it.

Additionally, we can view the resources available to us in the cluster by querying with sinfo.

So far, so good… we have 8 GPU’s per node on 117 and 118. If you’ve used Slurm, this is all pretty basic. However, when it comes to running distributed workloads on Slurm, data syncing does not come out of the box. For example, if we were to download a model to run on node 118, Slurm will not automatically pick up that model and sync it to other nodes running the job.

Thankfully, we have shared Weka storage on our machines. We simply just make a file on the head node and the data auto syncs with 117 and 118.

So for the remainder of our tutorial we will change into the shared work directory to run our distributed training jobs.

Interactive Shell in a Pyxis Container

We want to run a rocm/pytorch container via Slurm on one of the nodes. This container will allow us to install any dependency we want without corrupting the host file system. This is a good approach for designing runtime environments for your training jobs.

As you can see we’ve dropped into a shell container on node 117. We add the flags:

--container-writable This allows us to write within the container file system. The default is read only.

--container-name=interactive-tut This allows you to cache the docker image on pyxis. The initial download and subsequent container start up take a long time. Adding this flag caches the container which makes a difference of waiting minutes to instant container start.

--container-mounts=$(pwd):/src This lets you mount the current directory into the container src.

--container-workdir=/src This allows us to set our working directory within the container where our scripts and work will be done, for example /src.

--container-remap-root This is an optional flag to add if you need to apt-install system packages in your container.

Here we can checkout access to our AMD GPUs.

Further we can also see that we have access to the host network and RDMA interfaces:

Since we’ve re-ran our command with the --container-remap-root flag, we were able to apt install -y rocm-validation-suite and run rvs as a final validation that the GPU’s are functioning within the Pyxis container.

Distributed Training on Two Nodes

Now that we can properly use Pyxis to interact with containers via Slurm without messing up our compute hosts, let’s see how this applies to a small training job.

For this example we are using TRL to perform GRPO finetuning and will make use of a pre-built image from Tensorwave. This example requires that you have a shared storage drive between your headnode and compute nodes. The drive directory for us is: /opt/manifest/trl-example

In this directory we’ve created two files:

Our file run-accelerator.sh is:

#!/bin/bash
echo my rank $PMIX_RANK

accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
 --num_machines=2 \
 --num_processes=16 \
 --main_process_ip=10.21.8.117 \
 --main_process_port=1234 \
 --machine_rank=$PMIX_RANK \
 --rdzv_backend=c10d \
 src/open_r1/grpo.py \
 --config recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml \
 --use_vllm false \
 --push_to_hub false \
 --report_to none \
 --num_generations 16

Our file run-me.sh is:

#!/bin/bash 

srun -N 2 --gres=gpu:amd:8 --mpi=pmix --ntasks-per-node=1 \ 
--container-remap-root --container-name=trl-deepspeed \ 
--container-mounts=$(pwd)/run-accelerator.sh:/home/root/open-r1/run-accelerator.sh \ 
--container-image=tensorwavehq/training_r1 \ 
--container-writable run-accelerator.sh 

Since we added these two files to the shared directory, the files get automatically copied to the 2 compute nodes that will run training. We use a simple bash script to kick off the srun job to make editing jobs easier.

This particular job can take some time, but eventually we see the generation and training start up.

Conclusion

With the right infrastructure setup, deploying multinode training with Pyxis can be a simple process.

Authors

Taylor Kaplan — Pyxis & Infrastructure Setup
https://www.linkedin.com/in/taylor-kaplan-a4732731

Nikhil Gupta — Preparation of the Docker Image and Training Example
https://www.linkedin.com/in/nikhil-gupta-cal

About TensorWave

TensorWave is the AMD GPU cloud purpose-built for performance. Powered exclusively by Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.

Ready to get started? Connect with a Sales Engineer.