Job monitoring¶
Monitoring jobs in Slurm is essential for understanding resource usage and troubleshooting job performance.
Running Job Statistics Metrics¶
The sstat command allows users to retrieve status information about currently running jobs, including details on CPU usage, task information, node information, memory usage (RSS), and virtual memory (VM).
To check job statistics, use:
sstat --jobs=<jobid>
Showing information about running job
sstat --jobs=<jobid>
JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- --------------- ------------------- ------------------- --------------- ------------------- ------------------- ---------------
152295.0 2884M n143 0 2947336K 253704K n143 0 253704K 11 n143 0 11 00:06:04 n143 0 00:06:04 1 10.35M Unknown Unknown Unknown 0 29006427 n143 0 29006427 11096661 n143 0 11096661 cpu=00:06:04,+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ cpu=n143,energy=n+ cpu=00:00:00,fs/d+ cpu=00:06:04,+ energy=0,fs/di+ energy=0,fs/di+ energy=n143,fs/dis+ fs/disk=0 energy=0,fs/di+ energy=n143,fs/dis+ fs/disk=0 energy=0,fs/di+
By default, sstat
provides extensive output. To customize the displayed metrics, use the --format
flag.
An example of some these variables is listed in the table below:
Showing formatted information about running job
sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 152295
JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize
------------ -------- -------------------- ---------- ---------- ---------- ----------
152295.0 1 n143 183574492K 247315988K 118664K 696216K
If you do not run any srun commands, you will not create any job steps and metrics will not be available for your job. Your batch scripts should follow this format:
#!/bin/bash
#SBATCH ...
#SBATCH ...
# set environment up
module load ...
# launch job steps
srun <command to run> # that would be step 1
srun <command to run> # that would be step 2
The main metrics code you may be interested to review are listed below.
Variable | Description |
---|---|
avecpu |
Average CPU time of all tasks in job. |
averss |
Average resident set size of all tasks. |
avevmsize |
Average virtual memory of all tasks in a job. |
jobid |
The id of the Job. |
maxrss |
Maximum number of bytes read by all tasks in the job. |
maxvsize |
Maximum number of bytes written by all tasks in the job. |
ntasks |
Number of tasks in a job. |
See the sstat documentation for more information or type sstat --help
.
Monitoring CPU Usage of Running Jobs¶
Once a job is launched you will be provided with a job number. This is immediately returned after submitting the job. For example:
sbatch my_script.sh
Submitted batch job 38678
In the example above, "38678" is our job number. If you miss recording that number, there are other ways you can find it. One is to use "squeue":
squeue -u user1
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
38678 long myjob.sh user1 R 3:35 2 n[007-008]
In the example above, we specify the user filter -u
to limit the returned jobs to only those that we own.
Here we can again see the JOBID 38678
. We can also see the column "NODELIST" which shows us we are
running on nodes n007
and n008
. Knowing which nodes our job is using are essential for monitoring job performance.
Once you know what node(s) your job is running on, you may access it directly using SSH. For example, if your job is running on n007
:
ssh n007
The most direct way to monitor your running job will be to login to respective node using SSH, and use various tools to examine key performance indicators, such as load, memory usage, threading behavior, etc.
Useful Command Line Tools¶
Many commands can be accessed directly from the shell, and some others are provided as module file installations. Some commonly used commands include:
- top - shows a list of running processes, with summary of CPU utilization, memory usage, etc.
- free - shows memory utilization. Add a flag like "-g" to show memory usage in gigabytes.
- vmstat - display memory usage statistics
- lsof - "list open files" is useful for showing open files that are being read and/or written to
- uptime - shows system loads for 1, 5, and 15 minute averages, respectively
- ps -e - shows actively running processes
- pstree - shows a schematic process tree
Each of the commands above may be issued as-is, or with various modifying flags.
Other commands like "grep" may be useful for filtering command output to isolate items of interest.
The options for how these commands may be used is virtually limitless, and this document cannot hope to cover them all.
You can find out command options for any of these by using the "man
Uptime command and node load
The uptime command is useful for determining the "load" on a compute node. Devana CPU and GPU nodes have 64 processing cores each. If the processes running on that node match the capacity of one processing core, the load would be "1" (or 100%). Therefore, if the processes on a node are fully utilizing all processing cores, we would expect a maximum of load of about "64".
In the example above, the load averages for 1, 5, and 15 minutes are all right around "64", so this node is fully utilized.
ssh n007
uptime
13:56:51 up 150 days, 22:48, 1 user, load average: 64.09, 63.72, 63.49
Top command and module load
The top
command is particularly useful when trying to determine how many processors your jobs are using.
For instance, suppose you see multiple jobs running on node (including yours) and your check of the load indicates that node is being over-utilized.
How could you tell whether it was your process or someone else's? The "top" command helps breakdown CPU utilization by process:
ssh n007
top
top - 14:01:15 up 150 days, 22:43, 1 user, load average: 64.10, 63.75, 63.52
Tasks: 716 total, 65 running, 651 sleeping, 0 stopped, 0 zombie
%Cpu(s): 99.1 us, 0.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 26379443+total, 21490476+free, 36416432 used, 12473228 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 21900358+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2080 user1 20 0 10.2g 427328 85664 R 100.0 0.2 11:55.89 vasp_std 2081 user1 20 0 10.2g 404920 64316 R 100.0 0.2 11:55.70 vasp_std 2082 user1 20 0 10.2g 390936 57336 R 100.0 0.1 11:56.08 vasp_std 2083 user1 20 0 10.2g 415048 78996 R 100.0 0.2 11:55.64 vasp_std ... 2144 user1 20 0 10.2g 388968 84112 R 99.0 0.1 13:19.48 vasp_std 365 root 20 0 0 0 0 S 0.3 0.0 0:07.53 kworker/8:1 373 root 20 0 0 0 0 S 0.3 0.0 0:07.10 kworker/5:1 2718 root 20 0 173692 3012 1624 R 0.3 0.0 0:00.74 top
Restricted SSH access
Pleas note that you can directly access only nodes, where your application is running. When the job is finished, your connection will be terminated as well.
Past Job Statistics Metrics¶
sacct - job accounting information¶
The sacct
command can be used to display status information about users historical jobs, based on users name and/or SLURM job ID. By defeault the sacct
ill only bring up information about the user’s job from the current day.
sacct --jobs=<jobid> [--format=metric1,...]
Showing information about completed job
sacct --jobs=<jobid>
JobID JobName Partition Account AllocCPUS State ExitCode
------------- ---------- ---------- ---------- ---------- ---------- --------
<jobid> name short <project> 512 TIMEOUT 0:0
<jobid>.batch batch <project> 64 CANCELLED 0:15
<jobid>.exte+ extern <project> 512 COMPLETED 0:0
<jobid>.0 hydra_bst+ <project> 512 FAILED 5:0
Use -X
to aggregate the statistics relevant to the job allocation itself, not taking job steps into consideration.
Showing aggregated information about completed job
sacct -X --jobs=<jobid>
JobID JobName Partition Account AllocCPUS State ExitCode
------------- ---------- ---------- ---------- ---------- ---------- --------
<jobid> name short <project> 512 TIMEOUT 0:0
By using the --starttime (-S)
flag the command will look further back to the given date. This can also be combined with --endtime (-E)
to limit the query:
sacct [-X] -u <user> [-S YYYY-MM-DD] [-E YYYY-MM-DD] [--format=metric1,...] # Specify user and start and end dates
The --format
flag can be used to choose the command output (full list of variables can be found with the --helpformat
flag):
sacct [-X] -A <account> [--format=metric1,...] # Display information about account jobs
sacct
format variable names
Variable | Description |
---|---|
Account | The account the job ran under. |
AveCPU | Average (system + user) CPU time of all tasks in job. |
AveRSS | Average resident set size of all tasks in job. |
AveVMSize | Average Virtual Memory size of all tasks in job. |
CPUTime | Formatted (Elapsed time * CPU) count used by a job or step. |
Elapsed | Jobs elapsed time formated as DD-HH:MM:SS. |
ExitCode | The exit code returned by the job script or salloc. |
JobID | The id of the Job. |
JobName | The name of the Job. |
MaxRSS | Maximum resident set size of all tasks in job. |
MaxVMSize | Maximum Virtual Memory size of all tasks in job. |
MaxDiskRead | Maximum number of bytes read by all tasks in the job. |
MaxDiskWrite | Maximum number of bytes written by all tasks in the job. |
ReqCPUS | Requested number of CPUs. |
ReqMem | Requested amount of memory. |
ReqNodes | Requested number of nodes. |
NCPUS | The number of CPUs used in a job. |
NNodes | The number of nodes used in a job. |
User | The username of the person who ran the job. |
A full list of variables that specify data handled by sacct can be found with the --helpformat
flag or by visiting sacct documentation documentation or type sacct --help
.
seff - job accounting information¶
This command can be used to find the job efficiency report for the jobs which are completed and exited from the queue. If you run this command while the job is still in the R(Running) state, this might report incorrect information.
The seff
utility will help you track the CPU/Memory efficiency. The command is invoked as:
seff <jobid>
Jobs with different CPU/Memory efficiency
seff <jobid>
Job ID: <jobid>
User/Group: user1/group1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 41-01:38:14
CPU Efficiency: 99.64% of 41-05:09:44 core-walltime
Job Wall-clock time: 1-11:19:38
Memory Utilized: 2.73 GB
Memory Efficiency: 2.13% of 128.00 GB
seff <jobid>
Job ID: <jobid>
User/Group: user1/group1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 14:24:49
CPU Efficiency: 23.72% of 2-12:46:24 core-walltime
Job Wall-clock time: 03:47:54
Memory Utilized: 193.04 GB
Memory Efficiency: 75.41% of 256.00 GB
seff <jobid>
Job ID: <jobid>
User/Group: user1/group1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 64
CPU Utilized: 87-16:58:22
CPU Efficiency: 86.58% of 101-07:16:16 core-walltime
Job Wall-clock time: 1-13:59:19
Memory Utilized: 212.39 GB
Memory Efficiency: 82.96% of 256.00 TB
This illustrates a very bad job in terms of CPU/memory efficiency (below 4%), which illustrate a case where basically the user wasted 4 hours of computation while mobilizing a full node and its 64 cores.
seff <jobid>
Job ID: <jobid>
User/Group: user1/group1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 64
CPU Utilized: 00:08:33
CPU Efficiency: 3.55% of 04:00:48 core-walltime
Job Wall-clock time: 00:08:36
Memory Utilized: 55.84 MB
Memory Efficiency: 0.05% of 112.00 GB
seff <jobid>
Job ID: <jobid>
User/Group: user1/group1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 64
CPU Utilized: 34-17:07:26
CPU Efficiency: 95.80% of 36-05:41:20 core-walltime
Job Wall-clock time: 13:35:20
Memory Utilized: 5.18 GB
Memory Efficiency: 0.00% of 0.00 MB
Nvidia SXM A100 40GB #1:
GPU Efficiency: 36.90%
Memory Utilized: 0.00 GB (0.00%)
Nvidia SXM A100 40GB #2:
GPU Efficiency: 36.82%
Memory Utilized: 0.00 GB (0.00%)
Nvidia SXM A100 40GB #3:
GPU Efficiency: 36.92%
Memory Utilized: 0.00 GB (0.00%)
Nvidia SXM A100 40GB #4:
GPU Efficiency: 36.74%
Memory Utilized: 0.00 GB (0.00%)