Comprehensive Guide to NVIDIA CUDA Runtime CU11 for GPU Acceleration

Introduction to NVIDIA CUDA Runtime CU11

The NVIDIA CUDA Runtime CU11 is a powerful suite for leveraging GPU acceleration in applications. This runtime helps developers optimize their applications for performance by offloading intensive computations to the GPU. In this comprehensive guide, we’ll walk you through the useful APIs provided by CUDA Runtime CU11 with code snippets and a sample application.

Useful API Examples

1. cudaMalloc

Allocate memory on the GPU.

 float* devPtr; size_t size = 1024 * sizeof(float); cudaError_t err = cudaMalloc((void**)&devPtr, size); 

2. cudaMemcpy

Copy data between host and device.

 float* hostPtr = (float*)malloc(size); err = cudaMemcpy(devPtr, hostPtr, size, cudaMemcpyHostToDevice); 

3. cudaFree

Free allocated GPU memory.

 err = cudaFree(devPtr); 

4. cudaMemGetInfo

Get free and total memory available on the GPU.

 size_t freeMem, totalMem; err = cudaMemGetInfo(&freeMem, &totalMem); 

5. cudaDeviceSynchronize

Wait for the device to finish all preceding requested tasks.

 err = cudaDeviceSynchronize(); 

6. cudaGetDeviceCount

Get the number of CUDA capable devices.

 int deviceCount; err = cudaGetDeviceCount(&deviceCount); 

7. cudaSetDevice

Set the active CUDA device.

 int device = 0; err = cudaSetDevice(device); 

8. cudaGetDeviceProperties

Get the properties of the specified CUDA device.

 cudaDeviceProp prop; err = cudaGetDeviceProperties(&prop, device); 

9. cudaEventCreate

Create an event to synchronize activities on the GPU.

 cudaEvent_t event; err = cudaEventCreate(&event); 

10. cudaEventDestroy

Destroy an event created earlier.

 err = cudaEventDestroy(event); 

11. cudaEventRecord

Record an event.

 err = cudaEventRecord(event, 0); 

12. cudaEventSynchronize

Wait for an event to complete.

 err = cudaEventSynchronize(event); 

13. cudaEventElapsedTime

Calculate the elapsed time between two events.

 float elapsedTime; err = cudaEventElapsedTime(&elapsedTime, startEvent, endEvent); 

Example Application

Here’s a simple example application that uses some of these APIs to perform vector addition.

 #include <iostream> #include <cuda_runtime.h>
__global__ void vectorAdd(const float* A, const float* B, float* C, size_t N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) C[i] = A[i] + B[i];
}
int main() {
    size_t N = 1024;
  size_t size = N * sizeof(float);

  // Allocate memory on host
  float *h_A = (float *)malloc(size);
  float *h_B = (float *)malloc(size);
  float *h_C = (float *)malloc(size);

  // Initialize vectors
  for (size_t i = 0; i < N; i++) {
      h_A[i] = static_cast(i);
      h_B[i] = static_cast(2 * i);
  }

  // Allocate memory on device
  float *d_A, *d_B, *d_C;
  cudaMalloc((void **)&d_A, size);
  cudaMalloc((void **)&d_B, size);
  cudaMalloc((void **)&d_C, size);

  // Copy vectors from host to device
  cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

  // Launch the kernel
  int threadsPerBlock = 256;
  int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
  vectorAdd<<>>(d_A, d_B, d_C, N);

  // Copy result back to host
  cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

  // Validate results
  for (size_t i = 0; i < N; i++) {
      if (std::abs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
          std::cerr << "Result verification failed at element " << i << "!" << std::endl;
          return -1;
      }
  }

  std::cout << "Test PASSED" << std::endl;

  // Free device memory
  cudaFree(d_A);
  cudaFree(d_B);
  cudaFree(d_C);

  // Free host memory
  free(h_A);
  free(h_B);
  free(h_C);

  return 0;
} 

This example demonstrates how to use CUDA Runtime APIs to manage memory, launch kernels, and handle device synchronization. You can use these building blocks to develop complex, high-performance applications that leverage GPU acceleration.

Hash: 2e829f60908cfae5388b8ec09433848c637afac9308d6b4cd07787dfe09d49ba

Leave a Reply

Your email address will not be published. Required fields are marked *