3. Slurm Usage

3.1. sinfo

Watch the status of cluster at an interval of 5 seconds:

$ sinfo -i 5
Mon May 12 18:09:08 2025
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      4   idle las[0-3]
high         up   infinite      4   idle las[0-3]

Node-centric format:

$ sinfo -N
NODELIST   NODES PARTITION STATE 
las0           1   normal* idle  
las0           1      high idle  
las1           1   normal* idle  
las1           1      high idle  
las2           1   normal* idle  
las2           1      high idle  
las3           1   normal* idle  
las3           1      high idle

Customized format:

$ sinfo -o "%20N %10c %10m %25f %G"
NODELIST             CPUS       MEMORY     AVAIL_FEATURES            GRES
las3                 8          7935       (null)                    gpu:nvidia:1(S:0-7)
las[0-2]             16         32090      (null)                    (null)

List the reasons nodes are down or drained (a while after stopping slurmd service on las2):

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2025-05-13T14:22:26 las2
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      1  down* las2
normal*      up   infinite      3   idle las[0-1,3]
high         up   infinite      1  down* las2
high         up   infinite      3   idle las[0-1,3]

Tip

The * after the node state means it is not responding.

3.2. srun

Specify the number of nodes:

$ srun -N4 hostname
las3
las1
las2
las0

Run on specified node with specified number of tasks:

$ srun -w las2 -n 2 hostname
las2
las2

Specify the begin time of job:

$ srun -b now+10 hostname
srun: job 24 queued and waiting for resources
srun: job 24 has been allocated resources
las3

In another console, you can see the queued job (before it is finished) with:

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                24    normal hostname   ubuntu PD       0:00      1 (BeginTime)

3.3. REST API

Export the token:

$ export $(scontrol token)

Access the site with the token:

$ curl -v -H X-SLURM-USER-NAME:${USER} -H X-SLURM-USER-TOKEN:${SLURM_JWT} 'http://localhost:6820/openapi'
*   Trying 127.0.0.1:6820...
* Connected to localhost (127.0.0.1) port 6820 (#0)
> GET /openapi HTTP/1.1
> Host: localhost:6820
> User-Agent: curl/7.81.0
> Accept: */*
> X-SLURM-USER-NAME:ubuntu
> X-SLURM-USER-TOKEN:eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3NDcxMTkxODcsImlhdCI6MTc0NzExNzM4Nywic3VuIjoidWJ1bnR1In0.zbjEwmJtkDwWL-yUF_-7Kvbzi5JCfExYxvbAE9LAIx0
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 1620152
< Content-Type: application/json

Unset the token environment variable for it will make srun try to use JWT to authenticate (which is not supported by slurmd):

$ unset SLURM_JWT

3.4. Accounting

List the clusters:

$ sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
       las       127.0.0.1         6817 10752         1                                                                                           normal

Show history of jobs:

$ sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1              hostname     normal                     1  COMPLETED      0:0 
1.0            hostname                                1  COMPLETED      0:0 

Check the database:

$ mysql -u slurmdbd -p slurm_acct_db

3.5. scontrol

After you altered the configuration files of slurm, you can apply it by:

$ sudo scontrol reconfigure

If the state of nodes are not valid, they may stay at the state even if the fail reasons were fixed. You can bring it back mannually by:

$ sudo scontrol update NodeName=las3 State=idle