Introduction to NVIDIA CUDA NVRTC CU11
The NVIDIA CUDA NVRTC CU11 library provides a runtime compilation toolkit that allows developers to dynamically compile CUDA C++ device code. It enables flexibility in scenarios where kernels need to be created, compiled, and executed at runtime. NVRTC, or NVIDIA Runtime Compilation, is part of the CUDA toolkit and is designed to be lightweight and efficient.
In this blog post, we’ll explore the core features of NVRTC CU11, introduce dozens of API methods with usage examples, and build a practical app demonstrating how to leverage this powerful library for runtime kernel compilation. Let’s dive in!
Key APIs and Their Usage
1. nvrtcCreateProgram
Creates an NVRTC program object.
#include <nvrtc.h> nvrtcProgram prog; const char* program_source = "__global__ void add(int* a, int* b, int* c) {" " c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];" "}"; nvrtcCreateProgram(&prog, program_source, "add_kernel.cu", 0, NULL, NULL);
This initializes a program object for the given kernel source code.
2. nvrtcCompileProgram
Compiles the CUDA program.
const char* options[] = {"--gpu-architecture=compute_75"}; nvrtcResult compile_result = nvrtcCompileProgram(prog, 1, options); if (compile_result != NVRTC_SUCCESS) { const char* log; nvrtcGetProgramLog(prog, &log); printf("Compilation failed: %s\n", log); }
Pass relevant options to optimize runtime compilation.
3. nvrtcGetPTXSize
Retrieves the size of the compiled PTX (Parallel Thread Execution) code.
size_t ptx_size; nvrtcGetPTXSize(prog, &ptx_size); printf("PTX Size: %ld bytes\n", ptx_size);
4. nvrtcGetPTX
Retrieves the compiled PTX code.
char* ptx = (char*)malloc(ptx_size); nvrtcGetPTX(prog, ptx); printf("Compiled PTX code:\n%s\n", ptx); free(ptx);
5. nvrtcDestroyProgram
Destroys the previously created program object.
nvrtcDestroyProgram(&prog);
Use this to free memory allocated for the program object after compilation is complete.
Building a Practical Application
Here, we’ll create a simple app that dynamically compiles and runs a CUDA kernel at runtime using NVRTC CU11. The application demonstrates vector addition.
#include <stdio.h> #include <cuda_runtime.h> #include <nvrtc.h> const char* source_code = "__global__ void vectorAdd(const int* a, const int* b, int* c, int n) {" " int idx = blockIdx.x * blockDim.x + threadIdx.x;" " if (idx < n) {" " c[idx] = a[idx] + b[idx];" " }" "}"; int main() { // Compile the CUDA kernel with NVRTC nvrtcProgram prog; nvrtcCreateProgram(&prog, source_code, "vectorAdd.cu", 0, NULL, NULL); const char* options[] = {"--gpu-architecture=compute_75"}; nvrtcCompileProgram(prog, 1, options); size_t ptx_size; nvrtcGetPTXSize(prog, &ptx_size); char* ptx = (char*)malloc(ptx_size); nvrtcGetPTX(prog, ptx); nvrtcDestroyProgram(&prog); // Load the compiled PTX and execute the kernel CUmodule module; CUfunction kernel; cuModuleLoadDataEx(&module, ptx, 0, 0, 0); cuModuleGetFunction(&kernel, module, "vectorAdd"); const int N = 512; int h_a[N], h_b[N], h_c[N]; for (int i = 0; i < N; ++i) { h_a[i] = i; h_b[i] = N - i; } int *d_a, *d_b, *d_c; cudaMalloc(&d_a, N * sizeof(int)); cudaMalloc(&d_b, N * sizeof(int)); cudaMalloc(&d_c, N * sizeof(int)); cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(d_b, h_b, N * sizeof(int), cudaMemcpyHostToDevice); void* args[] = {&d_a, &d_b, &d_c, &N}; cuLaunchKernel(kernel, 32, 1, 1, 16, 1, 1, 0, 0, args, 0); cudaMemcpy(h_c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost); for (int i = 0; i < 10; ++i) { printf("c[%d] = %d\n", i, h_c[i]); } cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); cuModuleUnload(module); free(ptx); }
Conclusion
The NVIDIA CUDA NVRTC CU11 toolkit is a powerful tool that enables runtime kernel compilation, offering flexibility and performance for dynamic CUDA programming. By understanding these essential APIs and implementing practical applications like the one in this guide, developers can unlock immense potential for their CUDA-enabled applications.