Introduction to NVIDIA CUDA Runtime CU11
The NVIDIA CUDA Runtime CU11 is a powerful suite for leveraging GPU acceleration in applications. This runtime helps developers optimize their applications for performance by offloading intensive computations to the GPU. In this comprehensive guide, we’ll walk you through the useful APIs provided by CUDA Runtime CU11 with code snippets and a sample application.
Useful API Examples
1. cudaMalloc
Allocate memory on the GPU.
float* devPtr; size_t size = 1024 * sizeof(float); cudaError_t err = cudaMalloc((void**)&devPtr, size);
2. cudaMemcpy
Copy data between host and device.
float* hostPtr = (float*)malloc(size); err = cudaMemcpy(devPtr, hostPtr, size, cudaMemcpyHostToDevice);
3. cudaFree
Free allocated GPU memory.
err = cudaFree(devPtr);
4. cudaMemGetInfo
Get free and total memory available on the GPU.
size_t freeMem, totalMem; err = cudaMemGetInfo(&freeMem, &totalMem);
5. cudaDeviceSynchronize
Wait for the device to finish all preceding requested tasks.
err = cudaDeviceSynchronize();
6. cudaGetDeviceCount
Get the number of CUDA capable devices.
int deviceCount; err = cudaGetDeviceCount(&deviceCount);
7. cudaSetDevice
Set the active CUDA device.
int device = 0; err = cudaSetDevice(device);
8. cudaGetDeviceProperties
Get the properties of the specified CUDA device.
cudaDeviceProp prop; err = cudaGetDeviceProperties(&prop, device);
9. cudaEventCreate
Create an event to synchronize activities on the GPU.
cudaEvent_t event; err = cudaEventCreate(&event);
10. cudaEventDestroy
Destroy an event created earlier.
err = cudaEventDestroy(event);
11. cudaEventRecord
Record an event.
err = cudaEventRecord(event, 0);
12. cudaEventSynchronize
Wait for an event to complete.
err = cudaEventSynchronize(event);
13. cudaEventElapsedTime
Calculate the elapsed time between two events.
float elapsedTime; err = cudaEventElapsedTime(&elapsedTime, startEvent, endEvent);
Example Application
Here’s a simple example application that uses some of these APIs to perform vector addition.
#include <iostream> #include <cuda_runtime.h>
__global__ void vectorAdd(const float* A, const float* B, float* C, size_t N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) C[i] = A[i] + B[i];
}
int main() {
size_t N = 1024;
size_t size = N * sizeof(float);
// Allocate memory on host
float *h_A = (float *)malloc(size);
float *h_B = (float *)malloc(size);
float *h_C = (float *)malloc(size);
// Initialize vectors
for (size_t i = 0; i < N; i++) {
h_A[i] = static_cast(i);
h_B[i] = static_cast(2 * i);
}
// Allocate memory on device
float *d_A, *d_B, *d_C;
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
cudaMalloc((void **)&d_C, size);
// Copy vectors from host to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch the kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<>>(d_A, d_B, d_C, N);
// Copy result back to host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Validate results
for (size_t i = 0; i < N; i++) {
if (std::abs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
std::cerr << "Result verification failed at element " << i << "!" << std::endl;
return -1;
}
}
std::cout << "Test PASSED" << std::endl;
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
// Free host memory
free(h_A);
free(h_B);
free(h_C);
return 0;
}
This example demonstrates how to use CUDA Runtime APIs to manage memory, launch kernels, and handle device synchronization. You can use these building blocks to develop complex, high-performance applications that leverage GPU acceleration.
Hash: 2e829f60908cfae5388b8ec09433848c637afac9308d6b4cd07787dfe09d49ba