Comprehensive Guide to NVIDIA CUDA NVRTC CU11 APIs and Practical Examples

Introduction to NVIDIA CUDA NVRTC CU11

The NVIDIA CUDA NVRTC CU11 library provides a runtime compilation toolkit that allows developers to dynamically compile CUDA C++ device code. It enables flexibility in scenarios where kernels need to be created, compiled, and executed at runtime. NVRTC, or NVIDIA Runtime Compilation, is part of the CUDA toolkit and is designed to be lightweight and efficient.

In this blog post, we’ll explore the core features of NVRTC CU11, introduce dozens of API methods with usage examples, and build a practical app demonstrating how to leverage this powerful library for runtime kernel compilation. Let’s dive in!

Key APIs and Their Usage

1. nvrtcCreateProgram

Creates an NVRTC program object.

  #include <nvrtc.h>
  nvrtcProgram prog;
  const char* program_source = 
    "__global__ void add(int* a, int* b, int* c) {"
    "  c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];"
    "}";
  nvrtcCreateProgram(&prog, program_source, "add_kernel.cu", 0, NULL, NULL);

This initializes a program object for the given kernel source code.

2. nvrtcCompileProgram

Compiles the CUDA program.

  const char* options[] = {"--gpu-architecture=compute_75"};
  nvrtcResult compile_result = nvrtcCompileProgram(prog, 1, options);

  if (compile_result != NVRTC_SUCCESS) {
    const char* log;
    nvrtcGetProgramLog(prog, &log);
    printf("Compilation failed: %s\n", log);
  }

Pass relevant options to optimize runtime compilation.

3. nvrtcGetPTXSize

Retrieves the size of the compiled PTX (Parallel Thread Execution) code.

  size_t ptx_size;
  nvrtcGetPTXSize(prog, &ptx_size);
  printf("PTX Size: %ld bytes\n", ptx_size);

4. nvrtcGetPTX

Retrieves the compiled PTX code.

  char* ptx = (char*)malloc(ptx_size);
  nvrtcGetPTX(prog, ptx);
  printf("Compiled PTX code:\n%s\n", ptx);
  free(ptx);

5. nvrtcDestroyProgram

Destroys the previously created program object.

  nvrtcDestroyProgram(&prog);

Use this to free memory allocated for the program object after compilation is complete.

Building a Practical Application

Here, we’ll create a simple app that dynamically compiles and runs a CUDA kernel at runtime using NVRTC CU11. The application demonstrates vector addition.

  #include <stdio.h>
  #include <cuda_runtime.h>
  #include <nvrtc.h>

  const char* source_code =
    "__global__ void vectorAdd(const int* a, const int* b, int* c, int n) {"
    "  int idx = blockIdx.x * blockDim.x + threadIdx.x;"
    "  if (idx < n) {"
    "    c[idx] = a[idx] + b[idx];"
    "  }"
    "}";

  int main() {
    // Compile the CUDA kernel with NVRTC
    nvrtcProgram prog;
    nvrtcCreateProgram(&prog, source_code, "vectorAdd.cu", 0, NULL, NULL);

    const char* options[] = {"--gpu-architecture=compute_75"};
    nvrtcCompileProgram(prog, 1, options);

    size_t ptx_size;
    nvrtcGetPTXSize(prog, &ptx_size);
    char* ptx = (char*)malloc(ptx_size);
    nvrtcGetPTX(prog, ptx);

    nvrtcDestroyProgram(&prog);

    // Load the compiled PTX and execute the kernel
    CUmodule module;
    CUfunction kernel;
    cuModuleLoadDataEx(&module, ptx, 0, 0, 0);
    cuModuleGetFunction(&kernel, module, "vectorAdd");

    const int N = 512;
    int h_a[N], h_b[N], h_c[N];
    for (int i = 0; i < N; ++i) {
      h_a[i] = i;
      h_b[i] = N - i;
    }

    int *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, N * sizeof(int));
    cudaMalloc(&d_b, N * sizeof(int));
    cudaMalloc(&d_c, N * sizeof(int));

    cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, N * sizeof(int), cudaMemcpyHostToDevice);

    void* args[] = {&d_a, &d_b, &d_c, &N};
    cuLaunchKernel(kernel, 32, 1, 1, 16, 1, 1, 0, 0, args, 0);

    cudaMemcpy(h_c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);

    for (int i = 0; i < 10; ++i) {
      printf("c[%d] = %d\n", i, h_c[i]);
    }

    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    cuModuleUnload(module);
    free(ptx);
  }

Conclusion

The NVIDIA CUDA NVRTC CU11 toolkit is a powerful tool that enables runtime kernel compilation, offering flexibility and performance for dynamic CUDA programming. By understanding these essential APIs and implementing practical applications like the one in this guide, developers can unlock immense potential for their CUDA-enabled applications.

Leave a Reply

Your email address will not be published. Required fields are marked *