Home / A100 GPU: Accelerate Numerical Computing in C++ with a Python-like Syntax in NVIDIA MatX

A100 GPU: Accelerate Numerical Computing in C++ with a Python-like Syntax in NVIDIA MatX

matx

It is said that you can write faster code in C++, but code faster in Python.” Because of CUDA, C and C++ programmers have been able to get the most out of NVIDIA GPUs for more than a decade.

More recently, libraries like CuPy and PyTorch made it easier for people who write interpreted languages to use the speed of CUDA libraries from other languages. These interpreted languages have a lot of great features, like easy-to-read syntax, automatic memory management, and the same types for all functions.

However, sometimes having these features can slow down your computer because of how your computer manages memory and other things you can’t control. Often, it’s worth it to cut down on development time. Still, it may end up requiring parts of the application to be rewritten later when performance becomes an issue.

Suppose you could use C++ to get the best performance while still taking advantage of the interpreted languages.

All about MatX

An experimental C++ library called MatX is trying to bridge the gap between people who want the best performance and those who want to use the same easy syntax and types across all CUDA libraries. Using the C++17 support that was added to CUDA 11.0, MatX lets you write the same natural algebraic expressions that you would in a high-level language like Python without having to pay for it in terms of speed.

Types of tensor

MatX has interfaces for many of the most popular math libraries, like cuBLAS, CUTLASS, cuFFT, and CUB. All of these libraries use the same type of data, called tensor t. This makes the API for these libraries a lot easier by figuring out what it knows about the tensor type and calling the right APIs based on that.

A FFT-based resampler is shown in the code below.

Python

N = min(ns, ns_resamp)
nyq = N // 2 + 1

# Create an empty vector
sv = np.empty(ns)

# Real to complex FFT
svc = np.fft.rfft(sv)

# Slice
sv = svc[0:nyq]

# Complex to real IFFT
rsv = np.fft.irfft(sv, ns_resamp)

MatX

uint32_t N = std::min(ns, ns_resamp); 
uint32_t nyq = N / 2 + 1;

auto sv = make_tensor<float>({ns}); 
auto svc = make_tensor<complex>({ns / 2 + 1}); 
auto rv = make_tensor<float>({ns_resamp});

// Real to complex FFT
fft(svc, sv, stream);

// Slice the vector
auto sv = svc.Slice({0}, {nyq});

// Complex to real IFFT
ifft(rsv, sv, stream);

MatX runs 2100 times faster than NumPy on an A100. The code length and readability are about the same. In addition, the MatX version has a lot of hidden advantages over directly using the CUDA libraries, such as type checking, checking the size of input and output data, and cutting a tensor without having to change the pointer code.

The tensor types aren’t just for FFTs, though. They can also be used in other libraries and expressions, and the same variables can be used in many different ways. Suppose you wanted to do a GEMM with CUTLASS on the resampler output. You could write:

matmul(resampOut, resampView, B, stream);

In this code, resampOut and B are the right sizes for the GEMM operation. As in the FFT example that came before, the types, sizes, batches, and strides are all based on the tensor metadata. Use a C++ API that is strongly typed, and this means that many runtime and compile-time errors can be found without having to do extra debugging, which saves time.

These same tensor types can also be used in algebraic expressions to perform operations on each individual element:

(C = A * B + (D / 5.0) + cos(E)).run(stream);

Lazy Evaluation

MatX uses “lazy evaluation” to make a GPU kernel at compile time that looks like the expression in parentheses. If you write an expression, you must call the run function in order for it to work on the GPU.

Over 40 different types of operators can be used with tensors of different sizes and types that have the same parameters. They can be mixed and matched. You can see how it would look if it were written in CUDA:

__global__ void Expression( float *C, 
                            const float *A,
                            const float *B,
                            const float *D,
                            const float *E,
                            int length)
{
    for (int idx = blockIdx.x * blockDim.x + threadIdx.x; 
         idx < length; 
         idx += blockDim.x * gridDim.x) 
    {
        C[idx] = A[idx] * B[idx] + (D[idx] / 5.0) + cosf(E[idx]);    
    }
}

There are a lot of problems hidden in the earlier code.

The data types are set to floats. To change from one type to another, you must change the signature of the kernel. Astute people would say to use templates and let the compiler figure out what types you have.

As long as that works with some types, it won’t work with all of them. Because cosf doesn’t work with half-precision type values, you must use compile-time conditionals to handle different types of values.

If you make a small change to the function signature, you’ll need a whole new function. Suppose you wanted to add a F tensor in some cases but keep the signature that came with it. That would mean there would be two things to keep up with that are almost the same.

As good as it is to use a grid-stride loop, you still need code to make sure there are enough threads when you start a kernel. This way, the GPU can stay busy.

All of the inputs are assumed to be 1D vectors. If there are non-unity strides in a higher dimension, it could break.

There are a lot of other problems that aren’t on this list, like not being able to broadcast different-sized tensors, not checking the sizes, needing contiguous memory layouts, and more.

This code only works in certain situations, but the MatX version solves all of these problems and more while usually having the same speed as writing the kernel directly.

Additional MatX Features

  • By slicing, copying, and permuting existing tensors, you can make zero-copy views.
    Supporting tensors of any dimension.
  • Generators that make data on the fly without having to store it in memory. A common example would be to make a linearly spaced vector, a hamming window, or a diagonal matrix, for example.
  • Half-precision numbers (both FP16 and BF16) and complex numbers (both FP16 and BF16) can be used (both full and half-precision).
  • Linear solvers, sorting and scanning with CUB, random number generation with cuRAND, reductions, and more can be done through cuSolver.

Leave a Reply