Nvidia CUDA Compiler¶
The NVIDIA CUDA Compiler Driver (nvcc) is the primary compiler
used for building CUDA applications targeting NVIDIA GPUs. It is part
of the CUDA Toolkit and is responsible for compiling GPU kernels and
coordinating compilation of host-side code.
The compiler is based on LLVM infrastructure and internally splits the program into two components:
- Host code – compiled with a standard CPU compiler such as GCC or the Intel C++ Compiler
- Device code – compiled for the GPU architecture using the CUDA compilation pipeline
This design allows CUDA programs to integrate with standard C/C++ applications while offloading parallel workloads to GPUs.
Available CUDA Versions¶
CUDA toolkits are available through the module system.
CUDA modules can be loaded directly:
Example:
To see all available CUDA modules:
Content will be added.
Verifying CUDA Installation¶
After loading the CUDA module, verify that the CUDA compiler is available:
Typical output should include information about the CUDA toolkit version and the host compiler used by NVCC.
You may also verify GPU visibility on compute nodes using:
Compiling with CUDA¶
CUDA code can be compiled directly on the login nodes and users do not have to compute nodes with GPU acceleration for the compilation. CUDA code can be compiled using the NVCC compiler. Thorough description of NVCC command options can be found on the Nvidia websites dedicated to CUDA compiler.
nvcc --version
NVCC comes with a large set of examples that cover both, elementary
and intermediate, compilation codes. To test the compilation users
should copy these examples to their home directory. To compile the
code change directory to the particular example and run make
command to start the compilation.
mkdir nvcc-tests
cp -r /storage-apps/easybuild/software/CUDA/samples/ nvcc-tests/.
cd nvcc-tests/Samples/1_Utilities/deviceQuery
make
Running CUDA Programs¶
CUDA programs must be executed on GPU nodes. An interactive Slurm session can be started with:
srun --job-name=nvcc-tests --partition=gpu -G 1 --pty bash # Ran on login node
ml CUDA # Ran on gpu accelerated node
bash deviceQuery # Ran on gpu accelerated node
exit # Ran on gpu accelerated node
The expected output of the deviceQuery example executed on a node with a Nvidia SXM A100 40 GB is:
Devana CUDA output
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA A100-SXM4-40GB"
CUDA Driver Version / Runtime Version 12.0 / 11.4
CUDA Capability Major/Minor version number: 8.0
Total amount of global memory: 40370 MBytes (42331013120 bytes)
(108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores
GPU Max Clock rate: 1410 MHz (1.41 GHz)
Memory Clock rate: 1215 Mhz
Memory Bus Width: 5120-bit
L2 Cache Size: 41943040 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 5 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 23 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.0, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS
Code Example¶
Following is CUDA based code for two vector addition. To test it copy the code to cuda-vector-addition.cu and compile it with nvcc.
CUDA vector addition code
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
#include <stdio.h>
#include <stdlib.h>
// Add vectors on GPU
__global__ void add_gpu(int *x, int *y, int *z, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
z[index] = x[index] + y[index];
}
// Add vectors on CPU
void add_cpu(int *x, int *y, int *z, int n) {
for (int j = 0; j < n; j++)
z[j] = x[j] + y[j];
}
// Generate vectors
void random_ints(int *x, int n) {
for (int j = 0; j < n; j++)
x[j] = rand() % 10000; // random number between 0 and 9999
}
// Compare vectors on CPU
int compare_ints(int *x, int *y, int n) {
int pass = 0;
for (int j = 0; j < N; j++) {
if (x[j] != y[j]) {
printf("Value differ at location %d, with values of %d and %d\n", j, x[j], y[j]);
pass = 1;
}
}
if (pass == 0)
printf("Success\n");
else
printf("Fail\n");
return pass;
}
int main(void) {
int *x, *y, *z; // CPU copies x, y, z
int *dev_x, *dev_y, *dev_z; // GPU copies x, y, z
int size = N * sizeof(int); // Create N dimensional space for N integers
// Allocate GPU copies of dev_x, dev_y, dev_z
cudaMalloc((void**)&dev_x, size);
cudaMalloc((void**)&dev_y, size);
cudaMalloc((void**)&dev_z, size);
// Allocate CPU copies of x, y, z
x = (int*)malloc(size);
y = (int*)malloc(size);
z = (int*)malloc(size);
// Create input vector with random integer numbers
random_ints(x, N);
random_ints(y, N);
// Copy inputs to GPU
cudaMemcpy(dev_x, x, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_y, y, size, cudaMemcpyHostToDevice);
// Launch add_gpu() kernel with blocks and threads
add_gpu<<<N / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_x, dev_y, dev_z, N);
// Copy GPU result back to CPU copy of z
cudaMemcpy(z, dev_z, size, cudaMemcpyDeviceToHost);
// Compare with CPU results
int *z_h;
z_h = (int*)malloc(size);
add_cpu(x, y, z_h, N);
compare_ints(z, z_h, N);
// Clean CPU memory
free(x);
free(y);
free(z);
free(z_h);
// Clean GPU memory
cudaFree(dev_x);
cudaFree(dev_y);
cudaFree(dev_z);
return 0;
}
To execute the code open an interactive Slurm session on GPU accelerated node compile on login node:
ml CUDA/12.0.0
nvcc cuda-vector-addition.cu -o cuda-vector-addition
srun --job-name=nvcc-tests --partition=gpu -G 1 --pty bash # Ran on login node
Execute on gpu accelerated node:
ml CUDA/12.0.0 # Ran on gpu accelerated node
bash cuda-vector-addition # Ran on gpu accelerated node
Expected output message is Success.
Recommended CUDA Development Practices¶
When developing CUDA applications on HPC systems:
- Always compile using the CUDA module
- Run GPU programs only on GPU partitions
- Match CUDA version with driver compatibility
- Use MPI + CUDA for multi-GPU applications
- Test with small inputs before large-scale runs