Job monitoring

Computing Service Status

To know the computing platform occupancy status, you may refer run the following command:

$ sjstat | more

Scheduling pool data:
-------------------------------------------------------------
Pool        Memory  Cpus  Total Usable   Free  Other Traits  
-------------------------------------------------------------
cpu_seq    380000Mb    40      4      4      1   
cpu_share  188000Mb    40     18     18      8   
cpu_share  380000Mb    40     12     12      0   
cpu_dist   188000Mb    40     46     46     28   
cpu_test   188000Mb    40     54     54     34   
cpu_test   380000Mb    40     14     14      1   
gpu        152000Mb    40      2      2      2  gpu_ex,gpu_pro,gpu_ex_pro,gpu_def 
gpu        188000Mb    56      2      2      2   
gpu_v100   152000Mb    40      2      2      2  gpu_ex,gpu_pro,gpu_ex_pro,gpu_def 
gpu_a100   188000Mb    56      2      2      2   
gpu_a100_  500000Mb    48      1      1      1   
[..]

Job submission status

The squeue command is used to display various information about a job. It gives, among other things, the execution time, the current state (ST column, with possible state R for running and PD for pending), the name of the job, and the partition in which the job is executed:

$ squeue 
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
312       gpu      hello <username> R       0:01      1 cholesky-gpu01

The main options to squeue are:

-t [running|pending]: selects to display the running or pending job state
[[-v] -l] -j: display a specific job. -l for a long output, -v for a verbose output

Warning

The squeue output trims some job fields to 8 characters. To get the full name, use the -O option. Below is an example of the job name:

$ squeue -O JobID,Name

For more information about this command and its outputs, please refer to the SLURM official documentation and the following command line:

Job efficiency

The seff command displays the resources used by a specific job and calculate efficiency.

$ seff <jobid>
Job ID: <jobid>
Cluster: cholesky
User/Group: <username>/<groupid>
State: CANCELLED (exit code 0)
Cores: 1
CPU Utilized: 00:12:50
CPU Efficiency: 98.59% of 00:13:01 core-walltime
Job Wall-clock time: 00:13:01
Memory Utilized: 120.00 KB
Memory Efficiency: 0.00% of 0.00 MB

Warning

The seff command samples job activity approximately every 30 seconds. The value of the Memory Utilized field is therefore not the absolute maximum of the used memory, but the maximum of the sampled values.

Job hold and alteration

The scontrol command allows jobs management. With the options hold, update and release, it allows respectively to hold a job (take it out of the queue), to modify it, then to put it back in the queue:

$ scontrol [hold|update|release] <job list>

The following attributes can be changed after a job is submitted:

wall clock limit,
job name,
job dependency.

Note

In some cases, these attributes can be updated for pending jobs. The wall clock limit may only be reduced, never increased.

The following job attributes cannot be updated during runtime:

number of GPUs requested,
node(s),
memory.

For more information about this command, please refer to the help scontrol -h or the SLURM official documentation

Job deletion

The scancel command allows to delete one or more jobs:

$ scancel <jobid>

or all of a specific user’s jobs:

$ scancel -u <username>

or a whole series of jobs having the same name:

$ scancel -n <jobname>

For more information about this command, please refer to the help scancel -h.

Ended job status

The sacct command verifies and displays the state, the partition and the account of a job:

$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1663005            test    cpu_seq  formation          1  COMPLETED      0:0 
1663005.bat+      batch             formation          1  COMPLETED      0:0 
1663005.ext+     extern             formation          1  COMPLETED      0:0

The output format may be occasionally customized with the --format option:

$ sacct --format="Account,JobID,NodeList,CPUTime,MaxRSS"
   Account        JobID        NodeList    CPUTime     MaxRSS 
---------- ------------ --------------- ---------- ---------- 
 formation 1663071              node021   00:00:59            
 formation 1663071.ext+         node021   00:00:59

or modify the environment variable SACCT_FORMAT to define a new output:

$ export SACCT_FORMAT=Account,JobID,NodeList,CPUTime,MaxRSS
$ sacct
   Account        JobID        NodeList    CPUTime     MaxRSS 
---------- ------------ --------------- ---------- ---------- 
 formation 1663071              node021   00:01:01            
 formation 1663071.bat+         node021   00:01:01       308K 
 formation 1663071.ext+         node021   00:01:01          0

To display the complete list of available fields:

$ sacct -e
ccount             AdminComment        AllocCPUS           AllocGRES          
AllocNodes          AllocTRES           AssocID             AveCPU             
AveCPUFreq          AveDiskRead         AveDiskWrite        AvePages           
AveRSS              AveVMSize           BlockID             Cluster            
Comment             ConsumedEnergy      ConsumedEnergyRaw   CPUTime            
CPUTimeRAW          DerivedExitCode     Elapsed             ElapsedRaw         
Eligible            End                 ExitCode            GID                
Group               JobID               JobIDRaw            JobName 
[..]

For more information about this command, please refer to the help sacct -h.