Alex Medick

There has been much anticipation around AMD's flagship MI300X accelerator. With unmatched raw specs, the pressing question remains: Can it outperform NVIDIA's Hopper architecture in real-world AI workloads? We have some exciting early results to share.

For the past month, TensorWave and MK1 have worked closely to unlock performance of AMD hardware for AI inference. To start, we focused on Mixture of Expert (MoE) architectures due to their compute efficiency and popularity – notably used by Mist

AMD’s MI300X Outperforms NVIDIA’s H100 for LLM Inference

There has been much anticipation around AMD's flagship MI300X accelerator. With unmatched raw specs, the pressing question remains: Can it outperform NVIDIA's Hopper architecture in real-world AI workloads? We have some exciting early results to share.

For the past month, TensorWave and MK1 have worked closely to unlock performance of AMD hardware for AI inference. To start, we focused on Mixture of Expert (MoE) architectures due to their compute efficiency and popularity – notably used by Mistral, Meta, Databricks, and X.ai for their most powerful open-source LLMs.

The initial results are impressive: using MK1's inference software, the MI300X achieves 33% higher throughput compared to the H100 SXM running vLLM on Mixtral 8x7B for a real-world chat use case. Despite NVIDIA’s software ecosystem being more mature, it is clear that AMD is already a formidable competitor in the AI market. When hardware availability and cost are factored in, the MI300X proves to be an attractive option for enterprises running large-scale inference in the cloud.

We expect AMD’s performance advantage to climb even higher after further optimization, so stay tuned for more updates!

-Darrick Horton, CEO TensorWave

We invite you to experience the MI300X firsthand at TensorWave, which comes prepackaged with MK1’s inference software. Contact us to find out more.


Inference Benchmarks

We conducted extensive offline and online inference tests comparing the MI300X and H100 SXM5 accelerators using the Mixtral 8x7B model.

 * Offline Tests: These are standardized and provide insights into the performance of the forward pass across different setups.

 * Online Tests: These are more sophisticated and estimate system performance in a real-world setting where multiple users are serviced asynchronously.

Benchmark Setup

AMD

 * Hardware: TensorWave node equipped with 8 MI300X accelerators, 2 AMD EPYC CPU Processors (192 cores), and 2.3 TB of DDR5 RAM.
 * MI300X Accelerator: 192GB VRAM, 5.3 TB/s, ~1300 TFLOPS for FP16
 * Drivers: ROCm 6.1.2
 * Inference Stack: MK1’s inference engine (Flywheel) v0.9.2 and AMD’s ROCm optimized fork of vLLM (rocm/vllm) v0.4.0.
 * Configuration: Tensor parallelism set to 1 (tp=1), since we can fit the entire model Mixtral 8x7B in a single MI300X’s 192GB of VRAM.

NVIDIA

 * Hardware: Baremetal node with 8 H100 SXM5 accelerators with NVLink, 160 CPU cores, and 1.2 TB of DDR5 RAM.
 * H100 SXM5 Accelerator: 80GB VRAM, 3.35 TB/s, ~986 TFLOPS for FP16
 * Drivers: CUDA 12.2
 * Inference Stack: vLLM v4.3
 * Configuration: Tensor parallelism set to 2 (tp=2), which is required to fit Mixtral 8x7B in two H100’s 80GB VRAM.

Notes

 * All benchmarks are performed using the Mixtral 8x7B model.
 * All inference frameworks are configured to use FP16 compute paths. Enabling FP8 compute is left for future work.
 * To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.


Offline Results

To measure peak throughput for each inference solution, we generate prompts of a fixed size and directly feed them to the model. This method, known as offline batching, enhances hardware efficiency by processing multiple prompts simultaneously. Although larger batch sizes boost throughput, they also increase latency due to more requests being processed. Following standard practice, we constrain requests in a batch to have equal input sizes, and to have equal output sizes.

We assess the throughput of each system by varying the batch size. This was done using a modified version of `benchmark_throughput.py` script in the vLLM repository, refactored to include Flywheel as a backend. Prompts were also randomly generated within a batch to remove caching mechanisms. The performance metrics, detailed in the table below, measure throughput as a function of batch size.

Notably, our results show that MI300X running MK1 Flywheel outperforms H100 running vLLM for every batch size, with an increase in performance ranging from 1.22x to 2.94x.


Online Results for Chat Data Distribution

Moving beyond offline metrics, we designed a series of online benchmarks to simulate a realistic typical chat application. This involves generating responses to user inputs that closely mirror actual usage patterns.

Specifically, we simulate chat traffic by spawning independent workers to send requests to an endpoint. We then sweep the number of workers to increase the number of concurrent requests.

In these experiments, requests were generated using a standard text chat distribution with an average of 573 input tokens and 50 output tokens. Note that our benchmarking tool supports arbitrary data distributions; please reach out if you have a specific use case you’d like to test.

The key metrics of interest are:

 * Throughput (Requests per Second): The number of requests the system can handle per second for a given workload.
 * Average Latency (Seconds): The average time taken to generate a full response for each request.
 * Time Per Output Token (TPOT): The time to generate each subsequent token (averaged) after the first token, which impacts the overall speed of generating long responses.

For the first benchmark, we tested a non-streaming use case where throughput and latency are measured for servicing the full response.

At a target average latency of 5 seconds, two MI300X with tp=1 services 33% more requests per second than two H100s with tp=2. This means you can service the same number of users with similar quality of service using fewer accelerators!

For the second benchmark, we enable streaming and measure throughput and TPOT for individual tokens as they are streamed out.

Here we observe that the MI300X has higher throughput for every TPOT compared to the H100. This means the MI300X can generate text faster at higher traffic volumes, which is crucial for any LLM application.


Conclusion

Our benchmarks demonstrate that AMD's MI300X outperforms NVIDIA's H100 in both offline and online inference tasks for MoE architectures like Mixtral 8x7B. The MI300X not only offers higher throughput but also excels in real-world scenarios requiring fast response times.

Given its impressive performance, competitive cost, and hardware availability, the MI300X with MK1 software is an excellent choice for enterprises looking to scale their AI inference capabilities. We encourage you to explore the capabilities of MI300X at TensorWave and experience these benefits first-hand. Contact us to learn more and schedule a test drive of this powerful accelerator!

TensorWave Welcomes the AMD Instinct™ MI355X

AMD’s MI300X Outperforms NVIDIA’s H100 for LLM Inference

Subscribe to our Blog

Stay ahead of the curve with the latest in AI, AMD accelerators, and all things TensorWave.

Product

Resources

Company

© 2025 TensorWave Inc. - All rights reserved.