Unleashing GPU Power with NVIDIA cuBLAS CU11 API

Introduction to NVIDIA cuBLAS CU11

The NVIDIA cuBLAS library is the GPU-accelerated version of the complete standard BLAS library, providing GPU implementations of routines for matrix and vector operations. In this post, we will explore the powerful features of cuBLAS CU11, showcase dozens of useful API explanations with code snippets, and demonstrate a practical application example.

Basic cuBLAS API Examples

1. Initialization and Cleanup

  #include <cublas_v2.h>
  cublasHandle_t handle;
  cublasCreate(&handle);
  
  // Perform operations using handle
  
  cublasDestroy(handle);

2. Vector Addition

  int n = 100;
  float alpha = 1.0f;
  cublasSaxpy(handle, n, &alpha, d_x, 1, d_y, 1);

3. Dot Product

  float result;
  cublasSdot(handle, n, d_x, 1, d_y, 1, &result);

4. Matrix-Matrix Multiplication

  float alpha = 1.0f;
  float beta = 0.0f;
  cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &alpha, d_A, m, d_B, k, &beta, d_C, m);

5. Matrix Inversion

  int* P, * INFO;
  cublasSgetrfBatched(handle, n, d_Aarray, n, P, INFO, batchSize);
  cublasSgetriBatched(handle, n, (const float**)d_Aarray, n, P, d_output, n, INFO, batchSize);

Application Example: Matrix Multiplication

Let’s create a simple application that utilizes the cuBLAS API to perform matrix multiplication:

  #include <cublas_v2.h>
  #include <cuda_runtime.h>
  #include <stdio.h>
  
  void printMatrix(const float* A, int rows, int cols) {
      for (int i = 0; i < rows; i++) {
          for (int j = 0; j < cols; j++) {
              printf("%0.2f ", A[i * cols + j]);
          }
          printf("\n");
      }
  }
  
  int main() {
      cublasHandle_t handle;
      cublasCreate(&handle);
      
      int m = 2, n = 2, k = 2;
      float A[m*k] = {1, 2, 3, 4};
      float B[k*n] = {1, 2, 3, 4};
      float C[m*n];
      
      float *d_A, *d_B, *d_C;
      cudaMalloc((void**)&d_A, m*k * sizeof(float));
      cudaMalloc((void**)&d_B, k*n * sizeof(float));
      cudaMalloc((void**)&d_C, m*n * sizeof(float));
      
      cudaMemcpy(d_A, A, m*k * sizeof(float), cudaMemcpyHostToDevice);
      cudaMemcpy(d_B, B, k*n * sizeof(float), cudaMemcpyHostToDevice);
      
      float alpha = 1.0f;
      float beta = 0.0f;
      
      cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &alpha, d_A, m, d_B, k, &beta, d_C, m);
      
      cudaMemcpy(C, d_C, m*n * sizeof(float), cudaMemcpyDeviceToHost);
      
      printf("Result matrix C:\n");
      printMatrix(C, m, n);
      
      cudaFree(d_A);
      cudaFree(d_B);
      cudaFree(d_C);
      cublasDestroy(handle);
      
      return 0;
  }

This example demonstrates how to multiply two matrices using the cuBLAS library on an NVIDIA GPU. The `cublasSgemm` function is used for the matrix multiplication operation. The result is then copied back to the host and printed.

With the growing complexity of scientific computations, leveraging GPU acceleration using cuBLAS can drastically reduce computation time and improve efficiency.

Hash: 0fa895f441bb2f4fa7d586489d7a0f1b012442ccb85775683ef958388d9d3deb

Leave a Reply

Your email address will not be published. Required fields are marked *