Your First Script¶
On HPC clusters, you rarely run commands interactively due to shared resources and queueing policies. Instead, you prepare a batch script to describe your job's requirements and execution steps. This script is submitted to the scheduler (Slurm) which runs it when resources become available.
The most common way to submit a batch job to the scheduler is using sbatch command. In this example we are submitting a script called my_script.sh
:
sbatch my_script.sh
SLURM will try to find the suitable resources for the job as defined in the file and than launch the execution on the selected nodes.
Creating Batch Scripts¶
Job Builder
You can create your own batch script using our interactive Job Builder!
A batch script is typically organized to the following sections:
- interpreter to use for the execution of the script: bash, python, ...
- SLURM directives that define the job options: resources, run time, partitions, ...
- preparation steps: setting up the environment, load modules, prepare input files, ...
- job execution: run the application(s) with appropiate command
- epilogue: post processing and cleaning the data, ...
As an example, let's look at this simple batch job script:
#!/bin/bash
#SBATCH -J "slurm test" # Job name
#SBATCH -N 1 # Request 1 node
#SBATCH --ntasks-per-node=16 # Run 16 tasks (processes) on that node
#SBATCH -o test.%J.out # File to write standard output (%J = job ID)
#SBATCH -e test.%J.err # File to write standard error
module load intel # Load required environment (compiler, MPI, etc.)
mpirun /bin/hostname # Run hostname command 16 times (once per task)
exit
This script defines only a minimal job requirements, such as number of nodes and tasks, standard output and error files and then runs
hostname
command (which displays the server name) in parallel.
sbatch my_script.sh
Submitted batch job 38678
ls -ltr
total 8
-rw-rw-r-- 1 user user 198 Sep 21 14:08 my_script.sh
-rw-rw-r-- 1 user user 0 Sep 21 14:08 test.38678.err
-rw-rw-r-- 1 user user 80 Sep 21 14:08 test.38678.out
cat test.38678.out
n079
n079
n079
n079
n079
n079
n079
n079
n079
n079
n079
n079
n079
n079
As you can see, the job's ID was 38678 (displayed right after job submission). As expected, this number
was also used in STDERR and STDOUT file names. Since we requested 16 tasks on one node, the output file
contains 16 outputs of hostname
command (and you can see the script was actually executed on n079).
Obviously, this is just a demonstration script and if you want to use it for a real HPC application, you have
to modify it accordingly. You can find more examples for specific applications in the Software section of userdocs portal.
Typical Batch Job Types¶
#!/bin/bash
#SBATCH -J serial_job # Job name
#SBATCH -N 1 # Use 1 node
#SBATCH -n 1 # Run 1 task (single-core)
#SBATCH -t 00:05:00 # Maximum runtime: 5 minutes
#SBATCH -o serial.out # Output file
#SBATCH -e serial.err # Error file
./my_serial_app
#!/bin/bash
#SBATCH -J openmp_job # Job name
#SBATCH -N 1 # 1 node
#SBATCH -n 1 # 1 task (OpenMP uses threads, not tasks)
#SBATCH -c 8 # Request 8 CPUs per task (i.e., 8 threads)
#SBATCH -t 00:10:00 # 10-minute wall time
#SBATCH -o openmp.out # Output file
#SBATCH -e openmp.err # Error file
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
./my_openmp_app
#!/bin/bash
#SBATCH -J mpi_job # Job name
#SBATCH -N 2 # Use 2 nodes
#SBATCH --ntasks-per-node=8 # 8 tasks (processes) per node (total: 16)
#SBATCH -t 00:15:00 # 15-minute time limit
#SBATCH -o mpi.out # Output file
#SBATCH -e mpi.err # Error file
module load openmpi # Load OpenMPI environment
mpirun ./my_mpi_app
#!/bin/bash
#SBATCH -J hybrid_job # Job name
#SBATCH -N 2 # 2 nodes
#SBATCH --ntasks-per-node=4 # 4 MPI tasks per node
#SBATCH -c 4 # 4 CPUs (threads) per task
#SBATCH -t 00:30:00 # 30-minute limit
#SBATCH -o hybrid.out # Output file
#SBATCH -e hybrid.err # Error file
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} # OpenMP thread count
mpirun ./my_hybrid_app
#!/bin/bash
#SBATCH -J gpu_job # Job name
#SBATCH -N 1 # 1 node
#SBATCH --gres=gpu:1 # Request 1 GPU
#SBATCH -t 00:20:00 # 20-minute limit
#SBATCH -o gpu.out # Output file
#SBATCH -e gpu.err # Error file
module load cuda # Load CUDA environment
./my_gpu_program
Environment Setup Tips¶
Loading Modules¶
module purge # Clean module environment
module load gcc/12.1.0 openmpi/4.1.4
Using Conda¶
source activate myenv
Using Containers¶
singularity exec my_container.sif ./my_program
Debugging failed jobs¶
- Check the
.err
file for errors. - Use
sacct -j <jobid>
orscontrol show job <jobid>
to inspect job status. - Use email notifications to stay updated
#SBATCH --mail-user=your@email.com
#SBATCH --mail-type=END,FAIL
- Submit short test jobs while debugging
#SBATCH -t 00:01:00
- Try it interactively first. Interactive jobs are great for testing or debugging
salloc -N 1 -n 4 --time=00:10:00
# Then inside the allocation:
mpirun ./my_program
- If the job terminated with a non-zero exit code it can be beneficial to add
-x
flag to script header to enable higher verbosity.
#!/bin/bash -x
Job Submission¶
At a minimum a job submission script must include number of nodes, time, type of partition and nodes (resource allocation constraint and features). If a script does not specify any of these options then a default may be applied.
Task distribution options
Option | Description |
---|---|
-a, --array=<index> |
Job array specification (sbatch only) |
--cpu-bind=<type> |
Bind tasks to specific CPUs (srun only) |
-c, --cpus-per-task=<count> |
Number of CPUs required per task |
--gpus-per-task=<list> |
Number of GPUs required per task |
--mem=<size>[units] |
Memory required per allocated node (e.g., 16GB) |
--mem-per-cpu=<size>[units] |
Memory required per allocated CPU (e.g., 2GB) |
--nodes |
Number of nodes to be allocated to the job |
--ntasks |
Set the maximum number of tasks (MPI ranks) |
-N, --nodes=<count> |
Number of nodes required for the job |
-n, --ntasks=<count> |
Number of tasks to be launched |
--ntasks-per-node=<count> |
Number of tasks to be launched per node |
Within a job, you aim at running a certain number of tasks, and Slurm allow for a fine-grain control of the resource allocation that must be satisfied for each task.
Beware of Slurm terminology in Multicore Architecture!
- Slurm Node = Physical node, specified with
-N <#nodes>
- Always add explicit number of expected number of tasks per node using
--ntasks-per-node <n>
. This way you control the node footprint of your job.
- Always add explicit number of expected number of tasks per node using
-
Slurm CPU = Physical CORE
- Always use
-c <threads>
or--cpus-per-task <threads>
to specify the number of cpus reserved per task. - Hyper-Threading (HT) Technology is disabled on Devana cluster.
- Always use
-
Assume cores = threads, thus when using
-c <threads>
, you can safely setto automatically abstract from the job context.OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
The total number of tasks defined in a given job is stored in the $SLURM_NTASKS
environment variable.
The --cpus-per-task option of srun in Slurm 23.11 and later
In the latest versions of Slurm srun
inherits the --cpus-per-task
value requested by salloc
or sbatch
by reading the value of SLURM_CPUS_PER_TASK
, as for any other option. This behavior may differ from some older versions where special handling was required to propagate the --cpus-per-task
option to srun
.
In case you would like to launch multiple programs in a single allocation/batch script, divide the resources accordingly by requesting resources with srun
when launching the process:
srun --cpus-per-task <some of the SLURM_CPUS_PER_TASK> --ntasks <some of the SLURM_NTASKS> [...] <program>
Basic accounting and scheduling options
Option | Description |
---|---|
-A, --account=<account> |
Charge resources used by this job to the specified user project. |
-e, --error=<filename> |
File in which to store job error messages (sbatch and srun only) |
--exclusive |
Reserve all CPUs and GPUs on allocated nodes |
-J, --job-name=<name> |
Job name |
--mail-user=<address> |
E-mail address |
-o, --output=<filename> |
File in which to store job output (sbatch and srun only) |
-p, --partition=<names> |
Partition in which to run the job |
-t, --time=<time> |
Limit for job run time |
A full list of variables that specify data handled by sbatch can be
found with the man sbatch
command or by visiting the slurm documentation on
sbatch
.
Common Tips
- Always request the minimum resources and time needed.
- Use
module purge
to prevent environment conflicts. - Use SLURM environment variables:
$SLURM_NTASKS
,$SLURM_CPUS_PER_TASK
- Define
OMP_NUM_THREADS
properly if using OpenMP. - Check
man sbatch
or visit official Slurm documentation.