405 Billion Parameters. 1 Node.

Jul 23, 2024

At TensorWave, we are thrilled about the release of Llama 3.1 405B, an impressive leap forward by Me...

At TensorWave, we are thrilled about the release of Llama 3.1 405B, an impressive leap forward by Meta. This model outperforms GPT4o on multiple evaluations, marking the first time an open-source model can effectively compete with proprietary state-of-the-art models.

However, big models require a significant amount of computational resources for inference. Typically, this would necessitate multiple nodes or severe quantization. Fortunately, our AMD MI300X accelerators have an enormous amount of memory, allowing us to run the model on a single node without quantization, or in FP8 on just four accelerators, or on eight accelerators for faster performance (below).

Let’s break down the math to understand the requirements.

FP16 Precision

The AMD MI300X is designed with large language models (LLMs) in mind. Each GPU is equipped with 192 GB of HBM3 memory, providing a combined 1,536 GB (1.5 TB) of memory across an 8-GPU server node.

Running a 405 billion parameter model requires substantial memory. At FP16 precision, each parameter occupies 2 bytes of memory. The total memory requirement for the model parameters and activations (including a modest KV cache) can be approximated as follows:

Total Memory Required = parameters x bytes per parameter x 1.25 (activations)

For 405B parameters at FP16:

Total Memory Required = 405,000,000,000 x 2 x 1.25
Total Memory Required = 1,012,500,000,000 bytes
Total Memory Required = 1,012.5 GB

This requirement is well within the 1,536 GB available in an 8x GPU node. With 192 GB of memory per chip, the model can operate on just over 5 GPUs, leaving plenty of memory to spare for other operations.

FP8 Precision

FP8, or quarter-precision floating-point, uses just 1 byte per parameter, significantly reducing the memory requirements for models. The total memory requirements can be calculated as follows:

For 405B parameters at FP8:

Total Memory Required = 405,000,000,000 x 1 x 1.25
Total Memory Required = 506,250,000,000 bytes
Total Memory Required = 506.25 GB

This reduced memory footprint means that a single node can support two replicas of the model with tensor parallelism of 4, providing the flexibility to optimize the node for either latency (1xTP8) or throughput (2xTP4). At TP8, we also have over a terabyte of extra memory, which can be used for additional optimizations such as KV caching.

NVIDIA H100

When running these numbers on NVIDIA’s flagship chip, we see a different story. Each H100 accelerator comes with 80GB of vRAM, less than half of the MI300X. This means a single 8xH100 node has a combined memory capacity of just 640GB.

This falls far short of the 1,012.5 GB required for FP16, and just over the 506.25 GB requirements for FP8, leaving no room for additional replicas.

Conclusion

TensorWave’s MI300X accelerators demonstrate superior memory capacity, enabling the hosting of the 405 billion-parameter Llama 3.1 model on a single node. In contrast, NVIDIA’s H100 falls short with FP16 and is limited to quantized models.

These findings highlight the critical role of memory capacity in running large language models for inference. As TensorWave continues to innovate and support advancements in the open source community, we are committed to providing cutting-edge solutions for AI and machine learning applications.

Stay tuned for more updates and insights from TensorWave.

About TensorWave and MK1

TensorWave
TensorWave is a cutting-edge cloud platform designed specifically for AI workloads. Offering AMD MI300X accelerators and a best-in-class inference engine, TensorWave is a top-choice for training, fine-tuning, and inference. Visit tensorwave.com to learn more.

MK1
Engines for the AI Economy. Visit us online at mk1.ai