Using MI300X for Real-Time Inference Applications
Aug 21, 2024
When we talk about AI GPUs, the main use case we have in mind is AI model training. After all, train...

When we talk about AI GPUs, the main use case we have in mind is AI model training. After all, training requires significant computing resources because of all the data that needs to be processed and the calculations that need to be performed. The more GPUs you can throw at that problem (assuming they are designed to work together in parallel), the faster your model will be trained and the sooner it can be tested.
But what happens when your model is trained, tested, and ready to deploy for solving actual business problems?
When an AI model is applied to do something useful, this activity is called inference. It takes data that it has never seen before and applies what it “learned” to produce (ideally) the right answer. It turns out that inference requires significant compute resources, although in most cases, not as much as training does.
The main reason we built the TensorWave AI cloud platform on the basis of the AMD MI300X GPU is that the MI300X accelerates AI training very well. But it’s also useful for real-time inference. We discuss this use case below.
Training vs. Inference
Most AI models these days take the form of an artificial neural network (ANN). Various types of ANNs exist, but most of those in use today are called “deep learning” ANNs. This means that the “neurons” are arranged in multiple layers: an input layer that takes in the data to be processed, an output layer that presents the answer, and one or more “hidden layers” in between. Each neuron in a layer has a connection with the neurons in the next layer, and each connection has a numeric weight value associated with it.
To train a model, large amounts of “known” data (such as images of objects) must be fed into it. Then the operator must adjust the weight values of every neuron-to-neuron connection until the model can produce the right answer for the input data. In this way, the system “learns” the patterns of input data that produce the right answers.
One reason training is so computationally intensive is that it is iterative. The same type of calculation must be performed over and over again, and the whole process can take days, depending on the size of the model, the amount of training data, and the number of GPUs available.
Inference, by comparison, is straightforward: Take one set of data (for example, a single image) that the model has not seen in its training, run it through the model, and get an answer. Does the new data contain a pattern close to what it was trained to recognize, or doesn’t it?
Inference Hardware Requirements
And yet, inference can still require robust hardware resources. In particular, you need the extra horsepower for real-time AI applications, such as:
- Online meeting transcriptions
- Real-time translation of spoken language
- Industrial process monitoring and alerts
- Cybersecurity monitoring
- Detection of fraudulent financial transactions
- Generating text and images
In these applications, you need answers immediately, not 30 minutes or a day from now.
For small models trained to perform narrow, specific tasks, it’s possible to perform inference on a PC, laptop, or mobile device. To process data from distributed sources, such as internet-of-things (IoT) sensors, you might be able to perform inference on “edge computers,” which are placed close to the IoT devices to optimize network bandwidth.
However, for the large language models (LLMs) and other gigantic AI models that are in vogue these days, you need an array of servers, either in a local data center or in the cloud, for real-time inference.
Hardware used for inference needs to have high throughput and low latency, in particular when the data to be processed is coming in at a steady stream, such as audio data for transcription or IoT sensor data for industrial processes.A couple of years ago—before LLMs burst on the scene and changed everything—the common thinking was that training and inference were best performed on separate processing chips, each optimized for its task. GPU manufacturers have shown, however, that their GPU products do quite well with inference tasks, in particular those involving LLMs.
MI300X Inference Performance
How well does AMD’s MI300X GPU perform on inference tasks? Both AMD’s in-house testing and various independent evaluations comparing the MI300X with NVIDIA’s flagship H100 GPU on LLMs such as Llama 2 70B found superior performance with the MI300X—in some cases by a wide margin. The better performance can be attributed in part to the much larger memory size of the MI300X (192 GB vs. 80 GB for the H100) and speedier memory bandwidth.
TensorWave’s Inference Support
As the leading AI cloud development provider using MI300X hardware, we at TensorWave designed our platform to support both training and real-time inference. If your AI aspirations include large models or LLMs, and you prefer not to purchase and maintain your own GPUs, you need a cloud-based AI solution for both. Our consultants can help you leverage the hardware platform and AMD’s ROCm software to get the best out of each use case.
About TensorWave
TensorWave is a cutting-edge cloud platform designed specifically for AI workloads. Offering AMD MI300X accelerators and a best-in-class inference engine, TensorWave is a top-choice for training, fine-tuning, and inference. Visit tensorwave.com to learn more.