Nvidia CUDA Compiler¶

The NVIDIA CUDA Compiler Driver (nvcc) is the primary compiler used for building CUDA applications targeting NVIDIA GPUs. It is part of the CUDA Toolkit and is responsible for compiling GPU kernels and coordinating compilation of host-side code.

The compiler is based on LLVM infrastructure and internally splits the program into two components:

Host code – compiled with a standard CPU compiler such as GCC or the Intel C++ Compiler
Device code – compiled for the GPU architecture using the CUDA compilation pipeline

This design allows CUDA programs to integrate with standard C/C++ applications while offloading parallel workloads to GPUs.

Available CUDA Versions¶

CUDA toolkits are available through the module system.

DevanaPerun

CUDA modules can be loaded directly:

\[$ module load CUDA/<version> $\]

Example:

\[$ module load CUDA/12.0.0 $\]

To see all available CUDA modules:

\[$ module avail CUDA $\]

Content will be added.

Verifying CUDA Installation¶

After loading the CUDA module, verify that the CUDA compiler is available:

\[$bash nvcc --version $\]

Typical output should include information about the CUDA toolkit version and the host compiler used by NVCC.

You may also verify GPU visibility on compute nodes using:

\[$bash nvidia-smi $\]

Compiling with CUDA¶

CUDA code can be compiled directly on the login nodes and users do not have to compute nodes with GPU acceleration for the compilation. CUDA code can be compiled using the NVCC compiler. Thorough description of NVCC command options can be found on the Nvidia websites dedicated to CUDA compiler.

nvcc --version

NVCC comes with a large set of examples that cover both, elementary and intermediate, compilation codes. To test the compilation users should copy these examples to their home directory. To compile the code change directory to the particular example and run make command to start the compilation.

mkdir nvcc-tests
cp -r /storage-apps/easybuild/software/CUDA/samples/ nvcc-tests/.

cd nvcc-tests/Samples/1_Utilities/deviceQuery
make

Running CUDA Programs¶

CUDA programs must be executed on GPU nodes. An interactive Slurm session can be started with:

srun --job-name=nvcc-tests --partition=gpu -G 1 --pty bash # Ran on login node

ml CUDA # Ran on gpu accelerated node

bash deviceQuery # Ran on gpu accelerated node

exit # Ran on gpu accelerated node

The expected output of the deviceQuery example executed on a node with a Nvidia SXM A100 40 GB is:

Devana CUDA output

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA A100-SXM4-40GB"
  CUDA Driver Version / Runtime Version          12.0 / 11.4
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40370 MBytes (42331013120 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 23 / 0
  Compute Mode:
   < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.0, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

Code Example¶

Following is CUDA based code for two vector addition. To test it copy the code to cuda-vector-addition.cu and compile it with nvcc.

CUDA vector addition code

#define N (2048*2048)
#define THREADS_PER_BLOCK 512

#include <stdio.h>
#include <stdlib.h>

// Add vectors on GPU
__global__ void add_gpu(int *x, int *y, int *z, int n) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n)
        z[index] = x[index] + y[index];
}

// Add vectors on CPU
void add_cpu(int *x, int *y, int *z, int n) {
    for (int j = 0; j < n; j++)
        z[j] = x[j] + y[j];
}

// Generate vectors 
void random_ints(int *x, int n) {
    for (int j = 0; j < n; j++)
        x[j] = rand() % 10000; // random number between 0 and 9999
}

// Compare vectors on CPU 
int compare_ints(int *x, int *y, int n) {
    int pass = 0;
    for (int j = 0; j < N; j++) {
        if (x[j] != y[j]) {
            printf("Value differ at location %d, with values of %d and %d\n", j, x[j], y[j]);
            pass = 1;
        }
    }
    if (pass == 0)
        printf("Success\n");
    else
        printf("Fail\n");
    return pass;
}

int main(void) {

    int *x, *y, *z; // CPU copies x, y, z
    int *dev_x, *dev_y, *dev_z; // GPU copies x, y, z
    int size = N * sizeof(int); // Create N dimensional space for N integers

    // Allocate GPU copies of dev_x, dev_y, dev_z
    cudaMalloc((void**)&dev_x, size);
    cudaMalloc((void**)&dev_y, size);
    cudaMalloc((void**)&dev_z, size);

    // Allocate CPU copies of x, y, z
    x = (int*)malloc(size);
    y = (int*)malloc(size);
    z = (int*)malloc(size);

    // Create input vector with random integer numbers
    random_ints(x, N);
    random_ints(y, N);

    // Copy inputs to GPU
    cudaMemcpy(dev_x, x, size, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_y, y, size, cudaMemcpyHostToDevice);

    // Launch add_gpu() kernel with blocks and threads
    add_gpu<<<N / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_x, dev_y, dev_z, N);

    // Copy GPU result back to CPU copy of z
    cudaMemcpy(z, dev_z, size, cudaMemcpyDeviceToHost);

    // Compare with CPU results
    int *z_h;
    z_h = (int*)malloc(size);
    add_cpu(x, y, z_h, N);
    compare_ints(z, z_h, N);

    // Clean CPU memory
    free(x);
    free(y);
    free(z);
    free(z_h);

    // Clean GPU memory
    cudaFree(dev_x);
    cudaFree(dev_y);
    cudaFree(dev_z);

    return 0;
}

To execute the code open an interactive Slurm session on GPU accelerated node compile on login node:

ml CUDA/12.0.0

nvcc cuda-vector-addition.cu -o cuda-vector-addition

srun --job-name=nvcc-tests --partition=gpu -G 1 --pty bash # Ran on login node

Execute on gpu accelerated node:

ml CUDA/12.0.0 # Ran on gpu accelerated node

bash cuda-vector-addition # Ran on gpu accelerated node

Expected output message is Success.

Recommended CUDA Development Practices¶

When developing CUDA applications on HPC systems:

Always compile using the CUDA module
Run GPU programs only on GPU partitions
Match CUDA version with driver compatibility
Use MPI + CUDA for multi-GPU applications
Test with small inputs before large-scale runs

Created by: Andrej Sec