Skip to content

Job monitoring

Monitoring jobs in Slurm is essential for understanding resource usage and troubleshooting job performance.

Running Job Statistics Metrics

The sstat command allows users to retrieve status information about currently running jobs, including details on CPU usage, task information, node information, memory usage (RSS), and virtual memory (VM).

To check job statistics, use:

sstat --jobs=<jobid>

Showing information about running job

sstat --jobs=<jobid>
  JobID         MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask  AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot
  ------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- --------------- ------------------- ------------------- --------------- ------------------- ------------------- ---------------
  152295.0          2884M           n143              0   2947336K    253704K       n143          0    253704K       11         n143              0         11   00:06:04       n143          0   00:06:04        1     10.35M       Unknown       Unknown       Unknown              0     29006427            n143               0     29006427     11096661             n143                0     11096661 cpu=00:06:04,+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ energy=0,fs/di+ energy=0,fs/di+ energy=n143,fs/dis+           fs/disk=0 energy=0,fs/di+ energy=n143,fs/dis+           fs/disk=0 energy=0,fs/di+

By default, sstat provides extensive output. To customize the displayed metrics, use the --format flag. An example of some these variables is listed in the table below:

Showing formatted information about running job

sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 152295
  JobID          NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize
  ------------ -------- -------------------- ---------- ---------- ---------- ----------
  152295.0            1                 n143 183574492K 247315988K    118664K    696216K

If you do not run any srun commands, you will not create any job steps and metrics will not be available for your job. Your batch scripts should follow this format:

#!/bin/bash
#SBATCH ...
#SBATCH ...
# set environment up
module load ...

# launch job steps
srun <command to run> # that would be step 1
srun <command to run> # that would be step 2

The main metrics code you may be interested to review are listed below.

Variable Description
avecpu Average CPU time of all tasks in job.
averss Average resident set size of all tasks.
avevmsize Average virtual memory of all tasks in a job.
jobid The id of the Job.
maxrss Maximum number of bytes read by all tasks in the job.
maxvsize Maximum number of bytes written by all tasks in the job.
ntasks Number of tasks in a job.

See the sstat documentation for more information or type sstat --help.

Monitoring CPU Usage of Running Jobs

Once a job is launched you will be provided with a job number. This is immediately returned after submitting the job. For example:

sbatch my_script.sh 
  Submitted batch job 38678

In the example above, "38678" is our job number. If you miss recording that number, there are other ways you can find it. One is to use "squeue":

squeue -u user1
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  38678 long          myjob.sh user1 R        3:35      2 n[007-008]

In the example above, we specify the user filter -u to limit the returned jobs to only those that we own. Here we can again see the JOBID 38678. We can also see the column "NODELIST" which shows us we are running on nodes n007 and n008. Knowing which nodes our job is using are essential for monitoring job performance.

Once you know what node(s) your job is running on, you may access it directly using SSH. For example, if your job is running on n007:

ssh n007

The most direct way to monitor your running job will be to login to respective node using SSH, and use various tools to examine key performance indicators, such as load, memory usage, threading behavior, etc.

Useful Command Line Tools

Many commands can be accessed directly from the shell, and some others are provided as module file installations. Some commonly used commands include:

  • top - shows a list of running processes, with summary of CPU utilization, memory usage, etc.
  • free - shows memory utilization. Add a flag like "-g" to show memory usage in gigabytes.
  • vmstat - display memory usage statistics
  • lsof - "list open files" is useful for showing open files that are being read and/or written to
  • uptime - shows system loads for 1, 5, and 15 minute averages, respectively
  • ps -e - shows actively running processes
  • pstree - shows a schematic process tree

Each of the commands above may be issued as-is, or with various modifying flags. Other commands like "grep" may be useful for filtering command output to isolate items of interest. The options for how these commands may be used is virtually limitless, and this document cannot hope to cover them all. You can find out command options for any of these by using the "man " command. Presented below are a few useful examples that may prove helpful.

Uptime command and node load

The uptime command is useful for determining the "load" on a compute node. Devana CPU and GPU nodes have 64 processing cores each. If the processes running on that node match the capacity of one processing core, the load would be "1" (or 100%). Therefore, if the processes on a node are fully utilizing all processing cores, we would expect a maximum of load of about "64".

ssh n007
uptime
  13:56:51 up 150 days,  22:48,  1 user,  load average: 64.09, 63.72, 63.49
In the example above, the load averages for 1, 5, and 15 minutes are all right around "64", so this node is fully utilized.

Top command and module load

The top command is particularly useful when trying to determine how many processors your jobs are using. For instance, suppose you see multiple jobs running on node (including yours) and your check of the load indicates that node is being over-utilized. How could you tell whether it was your process or someone else's? The "top" command helps breakdown CPU utilization by process:

ssh n007
top
  top - 14:01:15 up 150 days, 22:43,  1 user,  load average: 64.10, 63.75, 63.52
  Tasks: 716 total,  65 running, 651 sleeping,   0 stopped,   0 zombie
  %Cpu(s): 99.1 us,  0.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
  KiB Mem : 26379443+total, 21490476+free, 36416432 used, 12473228 buff/cache
  KiB Swap:        0 total,        0 free,        0 used. 21900358+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2080 user1 20 0 10.2g 427328 85664 R 100.0 0.2 11:55.89 vasp_std 2081 user1 20 0 10.2g 404920 64316 R 100.0 0.2 11:55.70 vasp_std 2082 user1 20 0 10.2g 390936 57336 R 100.0 0.1 11:56.08 vasp_std 2083 user1 20 0 10.2g 415048 78996 R 100.0 0.2 11:55.64 vasp_std ... 2144 user1 20 0 10.2g 388968 84112 R 99.0 0.1 13:19.48 vasp_std 365 root 20 0 0 0 0 S 0.3 0.0 0:07.53 kworker/8:1 373 root 20 0 0 0 0 S 0.3 0.0 0:07.10 kworker/5:1 2718 root 20 0 173692 3012 1624 R 0.3 0.0 0:00.74 top

Restricted SSH access

Pleas note that you can directly access only nodes, where your application is running. When the job is finished, your connection will be terminated as well.

Past Job Statistics Metrics

sacct - job accounting information

The sacct command can be used to display status information about users historical jobs, based on users name and/or SLURM job ID. By defeault the sacct ill only bring up information about the user’s job from the current day.

sacct --jobs=<jobid> [--format=metric1,...]

Showing information about completed job

 sacct --jobs=<jobid>
   JobID            JobName  Partition    Account  AllocCPUS      State ExitCode
   ------------- ---------- ---------- ---------- ---------- ---------- --------
   <jobid>             name      short  <project>        512    TIMEOUT      0:0
   <jobid>.batch      batch             <project>         64  CANCELLED     0:15
   <jobid>.exte+     extern             <project>        512  COMPLETED      0:0
   <jobid>.0     hydra_bst+             <project>        512     FAILED      5:0

Use -X to aggregate the statistics relevant to the job allocation itself, not taking job steps into consideration.

Showing aggregated information about completed job

 sacct -X --jobs=<jobid>
   JobID            JobName  Partition    Account  AllocCPUS      State ExitCode
   ------------- ---------- ---------- ---------- ---------- ---------- --------
   <jobid>             name      short  <project>        512    TIMEOUT      0:0

By using the --starttime (-S) flag the command will look further back to the given date. This can also be combined with --endtime (-E) to limit the query:

sacct [-X] -u <user>  [-S YYYY-MM-DD] [-E YYYY-MM-DD] [--format=metric1,...] # Specify user and start and end dates

The --format flag can be used to choose the command output (full list of variables can be found with the --helpformat flag):

sacct [-X] -A <account> [--format=metric1,...] # Display information about account jobs

sacct format variable names
Variable Description
Account The account the job ran under.
AveCPU Average (system + user) CPU time of all tasks in job.
AveRSS Average resident set size of all tasks in job.
AveVMSize Average Virtual Memory size of all tasks in job.
CPUTime Formatted (Elapsed time * CPU) count used by a job or step.
Elapsed Jobs elapsed time formated as DD-HH:MM:SS.
ExitCode The exit code returned by the job script or salloc.
JobID The id of the Job.
JobName The name of the Job.
MaxRSS Maximum resident set size of all tasks in job.
MaxVMSize Maximum Virtual Memory size of all tasks in job.
MaxDiskRead Maximum number of bytes read by all tasks in the job.
MaxDiskWrite Maximum number of bytes written by all tasks in the job.
ReqCPUS Requested number of CPUs.
ReqMem Requested amount of memory.
ReqNodes Requested number of nodes.
NCPUS The number of CPUs used in a job.
NNodes The number of nodes used in a job.
User The username of the person who ran the job.

A full list of variables that specify data handled by sacct can be found with the --helpformat flag or by visiting sacct documentation documentation or type sacct --help.

seff - job accounting information

This command can be used to find the job efficiency report for the jobs which are completed and exited from the queue. If you run this command while the job is still in the R(Running) state, this might report incorrect information.

The seff utility will help you track the CPU/Memory efficiency. The command is invoked as:

seff <jobid>

Jobs with different CPU/Memory efficiency
seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 32
  CPU Utilized: 41-01:38:14
  CPU Efficiency: 99.64% of 41-05:09:44 core-walltime
  Job Wall-clock time: 1-11:19:38
  Memory Utilized: 2.73 GB
  Memory Efficiency: 2.13% of 128.00 GB
seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 16
  CPU Utilized: 14:24:49
  CPU Efficiency: 23.72% of 2-12:46:24 core-walltime
  Job Wall-clock time: 03:47:54
  Memory Utilized: 193.04 GB
  Memory Efficiency: 75.41% of 256.00 GB
seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 64
  CPU Utilized: 87-16:58:22
  CPU Efficiency: 86.58% of 101-07:16:16 core-walltime
  Job Wall-clock time: 1-13:59:19
  Memory Utilized: 212.39 GB
  Memory Efficiency: 82.96% of 256.00 TB

This illustrates a very bad job in terms of CPU/memory efficiency (below 4%), which illustrate a case where basically the user wasted 4 hours of computation while mobilizing a full node and its 64 cores.

seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 64
  CPU Utilized: 00:08:33
  CPU Efficiency: 3.55% of 04:00:48 core-walltime
  Job Wall-clock time: 00:08:36
  Memory Utilized: 55.84 MB
  Memory Efficiency: 0.05% of 112.00 GB

seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 64
  CPU Utilized: 34-17:07:26
  CPU Efficiency: 95.80% of 36-05:41:20 core-walltime
  Job Wall-clock time: 13:35:20
  Memory Utilized: 5.18 GB
  Memory Efficiency: 0.00% of 0.00 MB
  Nvidia SXM A100 40GB #1:
    GPU Efficiency: 36.90%
    Memory Utilized: 0.00 GB (0.00%)
  Nvidia SXM A100 40GB #2:
    GPU Efficiency: 36.82%
    Memory Utilized: 0.00 GB (0.00%)
  Nvidia SXM A100 40GB #3:
    GPU Efficiency: 36.92%
    Memory Utilized: 0.00 GB (0.00%)
  Nvidia SXM A100 40GB #4:
    GPU Efficiency: 36.74%
    Memory Utilized: 0.00 GB (0.00%)
Created by: Marek Štekláč