Job monitoring
Computing Service Status
To know the computing platform occupancy status, you may refer run the following command:
$ sjstat | more
Scheduling pool data:
-------------------------------------------------------------
Pool Memory Cpus Total Usable Free Other Traits
-------------------------------------------------------------
cpu_seq 380000Mb 40 4 4 1
cpu_share 188000Mb 40 18 18 8
cpu_share 380000Mb 40 12 12 0
cpu_dist 188000Mb 40 46 46 28
cpu_test 188000Mb 40 54 54 34
cpu_test 380000Mb 40 14 14 1
gpu 152000Mb 40 2 2 2 gpu_ex,gpu_pro,gpu_ex_pro,gpu_def
gpu 188000Mb 56 2 2 2
gpu_v100 152000Mb 40 2 2 2 gpu_ex,gpu_pro,gpu_ex_pro,gpu_def
gpu_a100 188000Mb 56 2 2 2
gpu_a100_ 500000Mb 48 1 1 1
[..]
Job submission status
The squeue
command is used to display various information about a job. It gives, among other things, the execution time, the current state (ST
column, with possible state R
for running and PD
for pending), the name of the job, and the partition in which the job is executed:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
312 gpu hello <username> R 0:01 1 cholesky-gpu01
The main options to squeue
are:
-t [running|pending]
selects to display the running or pending job state
[[-v] -l] -j
display a specific job. `-l` for a long output, `-v` for a verbose output
Warning
The squeue
output trims some job fields to 8 characters. To get the full name, use the -O
option. Below is an example of the job name:
bash
$ squeue -O JobID,Name
For more information about this command and its outputs, please refer to the SLURM official documentation and the following command line:
Job efficiency
The seff
command displays the resources used by a specific job and calculate efficiency.
$ seff <jobid>
Job ID: <jobid>
Cluster: cholesky
User/Group: <username>/<groupid>
State: CANCELLED (exit code 0)
Cores: 1
CPU Utilized: 00:12:50
CPU Efficiency: 98.59% of 00:13:01 core-walltime
Job Wall-clock time: 00:13:01
Memory Utilized: 120.00 KB
Memory Efficiency: 0.00% of 0.00 MB
Warning
The seff
command samples job activity approximately every 30 seconds. The value of the Memory Utilized
field is therefore not the absolute maximum of the used memory, but the maximum of the sampled values.
Job hold and alteration
The scontrol
command allows jobs management. With the options hold
, update
and release
, it allows respectively to hold a job (take it out of the queue), to modify it, then to put it back in the queue:
$ scontrol [hold|update|release] <job list>
The following attributes can be changed after a job is submitted:
- wall clock limit,
- job name,
- job dependency.
Note
In some cases, these attributes can be updated for pending jobs. The wall clock limit may only be reduced, never increased.
The following job attributes cannot be updated during runtime:
- number of GPUs requested,
- node(s),
- memory.
For more information about this command, please refer to the help scontrol -h
or the SLURM official documentation
Job deletion
The scancel
command allows to delete one or more jobs:
$ scancel <jobid>
or all of a specific user’s jobs:
$ scancel -u <username>
or a whole series of jobs having the same name:
$ scancel -n <jobname>
For more information about this command, please refer to the help scancel -h
.
Ended job status
The sacct
command verifies and displays the state, the partition and the account of a job:
sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1663005 test cpu_seq formation 1 COMPLETED 0:0
1663005.bat+ batch formation 1 COMPLETED 0:0
1663005.ext+ extern formation 1 COMPLETED 0:0