TensorWave Welcomes the AMD Instinct™ MI355X

Published: Mar 19, 2024

A Comprehensive Guide: Switching from CUDA to ROCm

CUDA and ROCm are two frameworks that implement general-purpose programming for graphics processing units (GPGPU). What are the differences between these two systems, and why would an organization choose one over the other?

GPGPU basics

The graphics processing unit (GPU) offloads the complexities of representing graphics on a screen. It also frees up the central processing unit (CPU) that runs software. The presence of this second processor in a computer provides a ready-made architecture for parallel computing.

Parallel processing creates procedural difficulties. If the two parallel processes never regroup, they are not really parallel; they are simply adjacent. Creating a parallel process doesn’t save time if the main process must be held in a “wait state” before the subprocess can feed its results back.

GPGPU is not suitable for all tasks. Does a program really need to spread its load onto the GPU, or does it just need more processing capacity? Requirements should necessitate parallel processing rather than multi-threading.

Tasks for which parallel processing and GPGPU are well suited include big data applications, encryption or password cracking, complex modeling systems for scientific research, artificial intelligence, and CGI/3D animation.

GPGPU frameworks

The main issues of GPGPU revolve around the ways that instructions and data are delivered to a GPU and how results are returned. These are not issues of programming languages. They are organizational and architectural matters. Therefore, the implementation of GPGPU doesn’t require a new programming language but, rather, a library of methods to manage processors and coordinate their activities.

CUDA and ROCm are frameworks, and in both, the actual code can be implemented by pre-existing programming languages. Typically, both CUDA and ROCm systems are written in C, C++, Fortran, Julia, and Python.

The core of each framework is its interaction with the hardware components of a processor, specifically, memory usage and bus and register management. The frameworks also provide access to the communication pipeline between a CPU and a GPU.

The differences between CUDA and ROCm

When considering whether to use CUDA or ROCm for GPGPU development, the choice is dictated by one factor: whether your GPU was made by Nvidia or by AMD. An experienced programmer can work on either a CUDA implementation or a ROCm development. Similarly, the design issues between the two are the same.

These systems interpret programs into hardware instructions so that the coder can create relatively hardware-agnostic programs. However, there are some framework-specific functions supplied by the SDK libraries. The more a program uses the extra libraries of the framework, the more work will be needed to convert code from one system to the other.

About CUDA

Officially, CUDA isn’t short for anything. The system was created by Nvidia, and while it originally stood for Compute Unified Device Architecture, that full name is no longer used. Nvidia now refers to the framework only as CUDA. The definitive source for the CUDA Toolkit and all related packages is the CUDA Zone on the Nvidia Developer website.

Nvidia Corporation produced the first GPU in 1999 and released CUDA for GPGPU in 2006. The framework is implemented as a software development kit (SDK), which interprets programs into low-level instructions for Nvidia GPUs. Only Nvidia GPUs are compatible with CUDA; you can’t use it on ARM GPUs.

The SDK is now in its 12th version. Each release adds more capabilities, and an ever-increasing list of libraries extends the platform. The CUDA Toolkit is available for free and installs on Windows and Linux. The last version that could run on MacOS was 10.2. The Toolkit will run on Ubuntu, Debian, Fedora, CentOS, and OpenSUSE Linux.

The core package of CUDA includes compilers for C/C++ and Fortran. Third parties provide free wrappers for Python, Perl, Java, Ruby, Golang, Lua, Common Lisp, Haskell, R, MATLAB, IDL, and Julia.

About ROCm

ROCm stands for Radeon Open Compute Platform. The “m” comes from the final letter of “platform.” This framework was created by Advanced Micro Devices (AMD) to give the buyers of its GPU products the same level of programming that Nvidia owners enjoy with CUDA.

This framework is newer than its rival. It was first released in 2016 and is currently in version 6.0.2, which was released on 31 January 2024. The SDK is free to use, and its definitive source is the AMD ROCm Developer Hub. The software will run on Windows and Linux. It is compatible with RHEL, SLES, and Ubuntu Linux.

This is an open-source project. AMD maintains the core code and makes the decision over which improvements submitted by outsiders will be included in the next release.

Like CUDA, ROCm includes compilers for C, C++, and Fortran. The Fortran implementations of CUDA and ROCm do not have the same exact syntax and, therefore, are not directly portable. Extensions allow Julia, Java/JavaScript/Node.js, Ruby, PHP, .NET, Golang, and Python programs to run on the platform. Neither Lua nor MATLAB will run on ROCm, and the AI programming languages Lisp, Haskell, and Prolog are also not available.

An interesting innovation that ROCm has but CUDA does not is compatibility with Docker for containers and Kubernetes for container orchestration. The system also provides OpenCL compatibility with an Installable Client Driver (ICD) loader to get parallel processing ROCm programs installed on different platforms.

The ROCm package also integrates the C++ Heterogeneous-Compute Interface for Portability (HIP). This is a version of C++ that enables a single source code to be compiled for use on AMD or Nvidia hardware. However, all the talk of HIP is written by the ROCm community while Nvidia asserts its determination to block all attempts at cross-compatibility. So, don’t pin your HIP-generated code to run on an Nvidia GPU.

Spotlight on ROCm: Working with TensorWave's AMD GPUs

If you are going to implement parallel processing by including your GPU, and that GPU is provided by AMD, you must build your program in the ROCm framework.

TensorWave is an AI-compute cloud platform that pushes data processing speed limits with a proprietary architecture called a “memory fabric.” The underlying hardware of this system includes the AMD Instinct MI300X platform, which provides its own 4th Gen fabric link to connect 8 GPUs in one unit. The TensorWave system groups these accelerators to provide a mass of up to 80 GPUs.

The advantage of the proximity of the eight GPUs in the MI300X is its reduction of power loss over connections and increase in response times. The architecture lends itself well to parallel processing, so developers who want to exploit this path will need to use the ROCm platform.

Ecosystem and Compatibility

ROCm supports programming languages other than its core C/C++ compiler through interpreters. It is also compatible with several deep learning models. In most cases, these models can be set up to run on Docker containers or installed directly with Python Wheels.

TensorWave stresses its system’s support for the PyTorch model and TensorFlow. However, its ROCm service can add other models to an account on the platform. For example, PyTorch compatibility also gives you the option of using MAGMA. ROCm also allows you to install Inception V3 with PyTorch. This is a good option for image processing with convolutional neural networks.

ROCm can exploit the MIOpen library, which is AMD’s version of GPUOpen. This is where TensorFlow gets its TensorFlow capabilities from. That MIOpen system gives you access to Theano, Caffe, MXNet, Microsoft Cognitive Toolkit, Torch, and Chainer. Programming with the MIOpen libraries can be implemented with OpenCL (a version of C++) or with Python.

Making the Switch: From CUDA to ROCm

There are several issues if you want to move your system from an Nvidia-based system to an AMD environment, such as the TensorWave platform. The first is that, if you wrote your code in Fortran, it must be adjusted for the syntax differences between the two services. Other issues include the different names for functions in the memory access libraries of the CUDA and ROCm platforms.

Both Nvidia and AMD are members of technology-sharing consortiums, such as the Khronos Group. However, Nvidia sees compatibility as an erosion of its market leadership in advanced microchips. Therefore, many compatibility projects falter because of a lack of cooperation or outright legal action.

However, a recent release of a translation layer, called ZLUDA, has burst compatibility into the headlines. What’s more, ZLUDA is free to use and available from a GitHub repository. This tool will run on Windows and Linux.

The ZLUDA project has been under development for two years and was released in February 2024. The service mediates between a CUDA binary and the AMD environment. This means you don’t have to re-compile your programs but just move over the binaries and run them on top of ZLUDA.

Performance benchmarking shows that the ZLUDA system provides faster processing of CUDA systems when run on AMD than the original program running on Nvidia GPUs. It also performs better than OpenCL-based programs and HIP-converted recompiled code.

Converting Code Through HIP

HIP has already been mentioned above. It is a neutral version of C++ that can be compiled for use on Nvidia or AMD systems. The neutral HIP system offers a path to convert C++ code to be AMD compatible, so it can be compiled.

Step-by-Step Guide to Transitioning

You can transition your CUDA programs for use on AMD GPUs using the HIPIFY package, which is available on GitHub. You should also download the HIPCC package.

The following steps give brief details on the conversion process. They summarize the guidance given by the AMD HIP Porting Guide:

Change to the CUDA source code directory.
Run hipexamine-perl.sh, which will examine all files and convert those that can be changed. These files will have the .cu extension and will be converted to have the .cpp extension.
Compile with HIPCC. See the AMD HIPCC Guide for details on setting the environment variables for this tool and how to run the command.

Additional links for more support Admin Magazine: Porting CUDA to HIP and Video: AMD Porting CUDA to HIP.

Conclusion

ROCm gives you the ability to run processor-intensive programs on AMD GPUs and gives AI developers access to the high-speed TensorWave AI Cloud platform.

The conversion tips in the previous sections concentrate on the process of moving C++ code from CUDA to ROCm. They don’t cover those programs that were written in Fortran. In all cases, try the ZLUDA translation layer technique first. If you are confident that this method will give you an efficient method to run your CUDA systems on AMD, you will at least buy some time to review all of your programs and then decide whether you could just convert them or if they are due for a rewrite.

Applications and models

Additional Resources

AMD ROCm Blog - overview of the ROCm blog
- Applications and models - for the latest posts on applications and models
- Software tools and optimizations - for latest posts on software tools and optimizations
AMD ROCm Developer Hub
The Guru of 3D: AMD RO C m Solution Enables Native Execution of NVIDIA CUDA Binaries on Radeon GPUs
Medium: AMD, ROCM, PyTorch, and AI on Ubuntu: The Rules of the Jungle
Dell Technologies Info Hub: Is AMD ROCm Ready to Deploy Leading AI Workloads?
AMD Community ROCm Forum
GitHub ROCm Community Discussions
Reddit ROCm Forum
ROCm Developer Tools
AMD Infinity Hub
AMD Instinct Accelerators
TensorWave AI Memory Fabric

SOC2 Type II certified and HIPAA compliant

Subscribe to our Blog

Stay ahead of the curve with the latest in AI, AMD accelerators, and all things TensorWave.