Maximizing GPU Acceleration with NVIDIA cuBLAS CU11 Library

Introduction to NVIDIA cuBLAS CU11

The NVIDIA cuBLAS CU11 library is a part of the CUDA Toolkit that delivers GPU-accelerated implementations of linear algebra primitives, such as matrix and vector operations. cuBLAS is widely used in machine learning, scientific computing, and high-performance applications where optimized math computations are critical.

cuBLAS CU11 provides highly optimized APIs for performing operations like matrix multiplication, vector scaling, and more, directly on NVIDIA GPUs. In this blog, we explore cuBLAS CU11, explain dozens of its useful APIs, and illustrate how to integrate them into real-world GPU-accelerated applications.

Key cuBLAS APIs and Examples

1. cublasSgemm: Single-Precision Matrix Multiplication

The cublasSgemm API performs single-precision general matrix multiplication (GEMM). Here’s an example:

  #include <cublas_v2.h>
  #include <cuda_runtime.h>

  void matrixMultiply(float* A, float* B, float* C, int M, int N, int K) {
      cublasHandle_t handle;
      float alpha = 1.0f, beta = 0.0f;

      cublasCreate(&handle);
      cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K, &alpha, A, M, B, K, &beta, C, M);
      cublasDestroy(handle);
  }

2. cublasDaxpy: Double-Precision Vector Addition

Use cublasDaxpy for adding scaled vectors:

  void vectorAddition(cublasHandle_t handle, double* x, double* y, int n) {
      double alpha = 2.0;
      cublasDaxpy(handle, n, &alpha, x, 1, y, 1);
  }

3. cublasSetVector and cublasGetVector: Memory Transfers

Transfer data between the host and the device:

  int N = 10;
  float hostData[N];
  float* deviceData;
  cudaMalloc(&deviceData, N * sizeof(float));

  cublasSetVector(N, sizeof(float), hostData, 1, deviceData, 1);
  cublasGetVector(N, sizeof(float), deviceData, 1, hostData, 1);

4. cublasSscal: Scale Vectors

cublasSscal scales a vector by a scalar:

  void scaleVector(cublasHandle_t handle, float* x, int n) {
      float alpha = 3.0f;
      cublasSscal(handle, n, &alpha, x, 1);
  }

5. cublasIdamax: Find the Index of the Max Magnitude Element

Use cublasIdamax to find the index of an element with the maximum absolute value:

  int findMaxIndex(cublasHandle_t handle, float* x, int n) {
      int index;
      cublasIdamax(handle, n, x, 1, &index);
      return index;
  }

Real-World Application: Matrix Multiplication App for Neural Networks

Let’s build an example of a neural network layer that leverages the optimized cublasSgemm for matrix multiplication. This neural net layer computes:

  #include <cublas_v2.h>
  #include <cuda_runtime.h>
  #include <stdio.h>

  void neuralNetworkForwardPass(float* input, float* weights, float* output, int batchSize, int inputSize, int outputSize) {
      cublasHandle_t handle;
      float alpha = 1.0f, beta = 0.0f;

      cublasCreate(&handle);
      cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, outputSize, batchSize, inputSize,
                  &alpha, weights, outputSize, input, inputSize, &beta, output, outputSize);
      cublasDestroy(handle);
  }

  int main() {
      int batchSize = 2, inputSize = 3, outputSize = 4;
      float input[6] = {1, 2, 3, 4, 5, 6};
      float weights[12] = {0.1, 0.2, 0.3, 0.4, 
                           0.5, 0.6, 0.7, 0.8, 
                           0.9, 1.0, 1.1, 1.2};
      float output[8] = {0};

      float *d_input, *d_weights, *d_output;
      cudaMalloc(&d_input, batchSize * inputSize * sizeof(float));
      cudaMalloc(&d_weights, inputSize * outputSize * sizeof(float));
      cudaMalloc(&d_output, batchSize * outputSize * sizeof(float));

      cudaMemcpy(d_input, input, batchSize * inputSize * sizeof(float), cudaMemcpyHostToDevice);
      cudaMemcpy(d_weights, weights, inputSize * outputSize * sizeof(float), cudaMemcpyHostToDevice);

      neuralNetworkForwardPass(d_input, d_weights, d_output, batchSize, inputSize, outputSize);

      cudaMemcpy(output, d_output, batchSize * outputSize * sizeof(float), cudaMemcpyDeviceToHost);

      for (int i = 0; i < batchSize * outputSize; i++) {
          printf("%.2f ", output[i]);
      }

      cudaFree(d_input);
      cudaFree(d_weights);
      cudaFree(d_output);
      return 0;
  }

In this example, cublasSgemm is used for a single forward pass of a fully connected layer, multiplying input values with weights and summing the results.

Conclusion

Using cuBLAS CU11, developers can dramatically enhance performance while simplifying GPU programming. Optimized operations such as matrix multiplication, vector scaling, or other linear algebra primitives can accelerate applications like neural networks, simulations, and more.

Leave a Reply

Your email address will not be published. Required fields are marked *