$100m raised to scale world's leading AMD AI cloud

Published: Jul 19, 2024

Empowering AI: A Detailed Comparison of AMD Instinct MI300X and NVIDIA H100 GPUs for Large-Scale Clusters

Graphics processing units (GPUs), once a niche product for doing the number crunching required for high-performance video rendering, have found a new purpose in the last few years: doing the number crunching required to develop and train artificial intelligence (AI) models. In particular, GPUs are well suited for a matrix arithmetic operation called “multiply-accumulate” (MAC) as well as various floating-point operations (FLOPS), which are executed repeatedly during model training.

The explosive recent growth of AI research and development, in particular in the area of large-language models (LLMs), has driven soaring demand for advanced GPUs. Newer GPUs, in fact, have little or nothing to do with graphics because they’re designed for the specific task of AI work.

The current go-to provider of GPUs, NVIDIA, has a long history in developing graphics accelerators and related hardware, so adding a line of AI-focused GPUs was not a great leap for them. Their flagship AI GPU, the H100, is in such high demand that customers must wait for a year or more for their orders to be filled.

Meanwhile, Advanced Micro Devices (AMD), better known as a competitor to Intel in the PC and server CPU market, has introduced its own GPU product line, called Instinct. The Instinct MI300X, introduced in late 2023, is causing a stir in the AI development community.

In this article, we look at these offerings from NVIDIA and AMD and how they compare.

Technical Specifications Comparison

Architecture

The H100 and MI300X have quite different architectures. The H100 is implemented on a single large (814 square millimeters) chip of silicon, with all the components in the same plane. This architecture is the same tried-and-true approach used in almost all integrated circuits. The advantage is that the manufacturing process is mature, although the large size pushes the limits of what can be manufactured using standard processes.

The MI300X, in contrast, is assembled as a three-dimensional stack. The MI300X has eight separate GPU integrated circuits surrounded by high-bandwidth memory in one layer, which is placed on top of a layer of input-output circuitry. This approach packs more transistors in a smaller area with shorter distances between the computing modules and memory. However, the manufacturing process is entirely new and more complex: The layers must line up perfectly with nanometer precision for the device to work.

Memory

The H100 comes with 80 GB of GPU memory, whereas the MI300X has 192 GB. The memory bandwidth—the speed at which the chip can move data between memory and the computing modules, and an important contributor to overall performance—is also greater for the M1300X (5.2 TB/s vs. 3.35 TB/s).

Performance Benchmarks

Overview of Performance Claims

Upon introducing the MI300X, AMD claimed that it is 20% faster than the H100 in single-GPU setups and 60% faster when deployed as an eight-GPU cluster. NVIDIA responded that AMD did not use a true apples-to-apples comparison to make these claims, suggesting that the H100 does, in fact, outperform the MI300X. The war of words escalated from there as each side accused the other of tipping the scales in some way.

At this writing, independent comparisons are not yet available, so all we have are published performance claims by each side, without knowing the exact environments from which these claims were generated.

Inference Performance

In any case, AMD claims a 20% advantage over the H100 in inference performance (that is, using a trained AI model to perform tasks) on the Llama 2 LLM with 13 billion parameters.

Floating-Point Operations (FLOPS)

Another important performance measure is FLOPS, which can vary depending on the precision level of the numbers being calculated. AI training often requires lower precision than, say, running simulations to predict the weather. For eight-bit floating-point precision (known as FP8), AMD claims 2,614.9 trillion FLOPS (TFLOPS) vs. 1,978.9 TFLOPS for the H100.

Latency

Latency—essentially a measure of how long it takes the system to generate a response to a given input—is an important factor in overall system speed. Here, too, AMD claims a 40% advantage over the H100 in inference latency on Llama 2 with 70 billion parameters. The higher memory bandwidth of the MI300X has a strong influence on this performance metric.

Practical Considerations for Businesses

If your business aims to build a large-scale GPU environment for AI work, how do you choose the GPU? How well do these comparisons help you make a decision?

As mentioned, most of the claims from both AMD and NVIDIA should be taken with a grain of salt. It’s also important to remember that some claims by each side may be theoretical, which is almost always higher than what can be achieved in real life. Other variables to consider are the software stack used in the performance tests; AMD’s software stack is optimized for its hardware and, therefore, is different from NVIDIA’s optimized software stack. Your mileage, as they say, may vary.

Raw computing performance, by whatever measure or benchmark, is only one factor to consider. Another important one is total cost of ownership (TCO)—how much does each GPU cost to purchase and operate?

A major factor here is power consumption. GPUs are among the most power-hungry computing components on the market. However, you can’t judge by wattage alone (the H100 consumes 700 W to the MI300X’s 750 W); it’s more important to compare performance per watt to see how much energy will be needed for a given task.

In the end, you need to choose on the basis of how well the hardware performs on your specific tasks, how much you’re willing to spend, and how long you’re willing to wait to receive and deploy the hardware.

The Case for AMD MI300X

At TensorWave, we believe the AMD MI300X GPU offers distinct performance advantages over NVIDIA’s H100, in terms of memory capacity and bandwidth, which are crucial for large-scale AI workloads such as LLM training and inference. Furthermore, the significant performance increase enabled by its unique architecture, with only a small increase in power consumption, means that the MI300X is more efficient overall, leading to a lower TCO.

Contact TensorWave for a Proof-of-Concept Demo

The best way to determine which platform is right for you is to try it with tasks that are relevant to the problems you need to solve. TensorWave is deploying MI300X GPUs in a scalable, easy-to-use cloud-based environment, which gives you an opportunity to “try before you buy.”

For more information and to discuss your specific requirements, contact TensorWave today.

Conclusion

It’s still early days for the MI300X, and most performance claims and counterclaims from both AMD and NVIDIA should be viewed for what they are: high-level marketing. That said, the MI300X does have a clear advantage over the H100 in memory, memory bandwidth, and power efficiency.

The choice of GPU platform should be based on a comprehensive evaluation that includes both the technical specifications and performance against your particular requirements in a real-world setting, whether for AI applications or general-purpose high-performance computing. TensorWave’s cloud-based MI300X environment can help with this evaluation.

About TensorWave

TensorWave is a cutting-edge cloud platform designed specifically for AI workloads. Offering AMD MI300X accelerators and a best-in-class inference engine, TensorWave is a top-choice for training, fine-tuning, and inference. Visit tensorwave.com to learn more.

SOC2 Type II certified and HIPAA compliant

Subscribe to our Blog

Stay ahead of the curve with the latest in AI, AMD accelerators, and all things TensorWave.