Overcoming the GPU Shortage: AMD's MI300X and TensorWave's Solution for AI's Growing Compute Needs

Feb 28, 2024

In case you’ve been hiding in a cave lately: There’s a global shortage of GPUs. The advent of large...

In case you’ve been hiding in a cave lately: There’s a global shortage of GPUs.

The advent of large-language models (LLMs) has led to an insatiable appetite for GPUs as myriad companies race to develop and train LLMs and other large AI models, for internal use or as products for sale. The major cloud providers (Amazon, Microsoft, and Google) are deploying GPUs as part of their cloud service offerings, and enterprises and AI start-ups alike are contributing to large backorders of GPUs on a global scale.  As Elon Musk observed, “It seems like everyone and their dog is buying GPUs at this point.”

And unless your dog has billions of dollars to spend on GPUs, as Facebook parent Meta plans to do in 2024, you might be out of luck. Meta isn’t alone: Many large firms that can afford premium prices are hoovering up all the GPUs they can find.

NVIDIA’s flagship H100 GPUs, while being the most sought-after for their efficiency in LLM training, are ironically at the heart of the supply shortage, significantly hampering the progression of LLM development due to their limited availability. This is nice for NVIDIA, whose CEO Jensen Huang stated in the company’s 2023Q4 earnings call that he sees no decrease in demand for their GPUs through 2025 and beyond.

It’s not so nice for most other businesses. A recent survey of engineers by market analysts Guidepoint Global found that 88% were using NVIDIA GPUs for their AI projects, but 87% of those were unable to secure sufficient GPU capacity. If companies are to remain competitive, they must access more options that are equally effective; such an option exists with AMD's MI300X.

Background: Impact of the GPU Shortage on AI Development

As you might expect, the GPU shortage has put a crimp on the development of large AI models. With demand for computing power far outstripping supply, developers have to make do with a smaller number of GPUs, which extends the time needed to train and test the models. As a result, developers are not able to bring their AI products to market as fast as they anticipated. For startups, this can mean running out of funding before the product is ready to sell.

Even deep-pocketed enterprises may have to put their internal AI projects on hold if they can’t muster enough dedicated GPUs or lease them from cloud service providers.

Sam Altman, founder and CEO of OpenAI (the company behind ChatGPT), had this to say about the current GPU supply shortage “We’re so short on GPUs the less people use our products the better… We’d love it if they use it less because we don’t have enough GPUs” You know the situation is serious when CEO's want less business and not more.

Exploring Alternative to NVIDIA GPUs

Coming up with these “other ways,” however, requires some fundamental research and development, which takes time and money. Meanwhile, numerous other providers, large and small, are developing alternative GPU products and application-specific integrated circuits (ASICs) to fill the gap.

Some of these firms are household names, such as Microsoft, Amazon (in collaboration with Broadcom), and Qualcomm. Others are less well-known firms or startups, such as Cerebras, Groq, SambaNova, Syntiant, and Mythic.

One recent entrant in the GPU fray is Advanced Micro Devices (AMD), better known as one of Intel’s main competitors in the PC processor market. In mid-2023, AMD introduced its Instinct MI300 line of AI accelerator chips as direct competitors to NVIDIA’s offerings.

The Rising Stars: AMD’s Instinct MI300 Series

From an architectural standpoint, the Instinct MI300 series represents a radical departure from the traditional, single planar chip. The MI300 devices are among the first three-dimensional chips on the market, featuring a multi-layer stack with modular compute chiplets on top of input-output dies, all surrounded by a ring of high-bandwidth DRAM.

AMD is shipping two flavors of the Instinct MI300: The MI300A, which has six GPU chiplets and three CPU chiplets (24 cores total), and the MI300X, which has eight GPU chiplets and no CPU chiplets. The memory space in the MI300A is shared between the CPU and GPU chiplets, which improves energy efficiency by eliminating data transfers.

The result of MI300 architecture is a significant increase in performance in comparison with NVIDIA’s H100 products. AMD claims the following performance specifications:

  • High memory density: 128 GB (MI300A) and 192 GB (MI300X), a 2.4-fold improvement in memory density over the NVIDIA H100
  • Memory bandwidth: 5.3 TB/s, higher than the NVIDIA H100 by a factor of 1.6

In a product announcement event in San Francisco in June 2023, AMD CEO Lisa Su said that a "single MI300X can run models up to approximately 80 billion parameters" in memory. This is significant because it reduces the I/O bottleneck, thereby increasing both speed and energy efficiency.

Real-World GPU Usage in AI Model Training

Up until recently, AI models have been deployed to address narrow, specific problems, such as identifying certain objects in photographs, interpreting handwriting, and generating recommendations on streaming services. That all changed when LLMs burst on the scene with their ability to interact with users with “natural language” to carry on conversations and generate text and visual content.  LLMs have rapidly become the “killer app” that had previously eluded the AI space.

An AI model’s “size” is measured in part by the number of parameters it contains. By this measure, the largest AI models are LLMs, such as Falcon-40B (40 billion parameters), Meta’s LLaMA (several versions ranging from 7 billion to 65 billion parameters), Google’s LaMDA (137 billion parameters), and Open AI’s GPT-3 (175 billion parameters) and GPT-4 (estimated 1 trillion parameters).

More parameters mean more computing resources required to train the models. In Su's words, "The generative AI, large language models have changed the landscape. The need for more compute is growing exponentially, whether you're talking about training or inference." For example, it is estimated that complete training of GPT-3 would take eight days using NVIDIA’s Eos supercomputer, which sports over 10,000 H100 GPUs. A more reasonably-sized computer with 512 GPUs would take four months.

How TensorWave Addresses the GPU Compute Constraint

TensorWave is building a hyperscale AI compute cloud based on award-winning composable infrastructure technology, including a novel advanced memory fabric architecture and AMD’s Instinct MI300X GPUs.

According to Su, “What's important about this is it will actually make it easier for developers and AI startups to get access to MI300X GPU’s as soon as possible with a proven set of providers.” Co- founder and CEO of TensorWave, Darrick Horton echoed Su's sentiments stating, "We are eager to leverage the AMD Instinct MI300X accelerator, as it not only offers leadership performance, but it represents our strategic alignment with AMD moving forward. AMD has shown a commitment to open standards and a history of innovation that we are proud to be a part of."

Partnering with TensorWave to power your AI training requirements provides numerous benefits, including:

  • Access to thousands of best in class GPU accelerators
  • Easy scaling from 8 to 80 GPUs on a single node without coding, eliminating the need for workload distribution across nodes
  • Native support for PyTorch, JAX and TensorFlow frameworks. “It just works”
  • A 50% boost in effective bandwidth and 90% reduction in latency compared to other GPU cloud offerings
  • And best of all, a lower TCO

Book a TensorWave free demo today to test the performance and ease of use for yourself..

Conclusion

With the emergence of LLMs and their application as potential business tools, demand for compute resources for model training is far outstripping supply. NVIDIA, the GPU market leader so far, is unable to meet this demand, and even cloud providers with thousands of GPUs at their disposal can’t keep up.

Without alternative solutions, progress in AI development and deployment is slowing to a crawl.

If your business has big plans for leveraging generative AI but is struggling to secure the computing resources to support them, contact TensorWave today to learn how a partnership with us can help you meet your goals.