# Slurm Usage

## sinfo

Watch the status of cluster at an interval of 5 seconds:

```console
$ sinfo -i 5
Mon May 12 18:09:08 2025
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      4   idle las[0-3]
high         up   infinite      4   idle las[0-3]
...
```

Node-centric format:

```console
$ sinfo -N
NODELIST   NODES PARTITION STATE 
las0           1   normal* idle  
las0           1      high idle  
las1           1   normal* idle  
las1           1      high idle  
las2           1   normal* idle  
las2           1      high idle  
las3           1   normal* idle  
las3           1      high idle
```

Customized format:

```console
$ sinfo -o "%20N %10c %10m %25f %G"
NODELIST             CPUS       MEMORY     AVAIL_FEATURES            GRES
las3                 8          7935       (null)                    gpu:nvidia:1(S:0-7)
las[0-2]             16         32090      (null)                    (null)
```

List the reasons nodes are down or drained (a while after stopping `slurmd` service on `las2`):

```console
$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2025-05-13T14:22:26 las2
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      1  down* las2
normal*      up   infinite      3   idle las[0-1,3]
high         up   infinite      1  down* las2
high         up   infinite      3   idle las[0-1,3]
```

:::{tip}
The `*` after the node state means it is not responding.
:::

## srun

Specify the number of nodes:

```console
$ srun -N4 hostname
las3
las1
las2
las0
```

Run on specified node with specified number of tasks:

```console
$ srun -w las2 -n 2 hostname
las2
las2
```

Specify the begin time of job:

```console
$ srun -b now+10 hostname
srun: job 24 queued and waiting for resources
srun: job 24 has been allocated resources
las3
```

In another console, you can see the queued job (before it is finished) with:

```console
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                24    normal hostname   ubuntu PD       0:00      1 (BeginTime)
```

## REST API

Export the token:

```console
$ export $(scontrol token)
```

Access the site with the token:

```console
$ curl -v -H X-SLURM-USER-NAME:${USER} -H X-SLURM-USER-TOKEN:${SLURM_JWT} 'http://localhost:6820/openapi'
*   Trying 127.0.0.1:6820...
* Connected to localhost (127.0.0.1) port 6820 (#0)
> GET /openapi HTTP/1.1
> Host: localhost:6820
> User-Agent: curl/7.81.0
> Accept: */*
> X-SLURM-USER-NAME:ubuntu
> X-SLURM-USER-TOKEN:eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3NDcxMTkxODcsImlhdCI6MTc0NzExNzM4Nywic3VuIjoidWJ1bnR1In0.zbjEwmJtkDwWL-yUF_-7Kvbzi5JCfExYxvbAE9LAIx0
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 1620152
< Content-Type: application/json
...
```

Unset the token environment variable for it will make `srun` try to use JWT to authenticate (which is not supported by `slurmd`):

```console
$ unset SLURM_JWT
```

## Accounting

List the clusters:

```console
$ sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
       las       127.0.0.1         6817 10752         1                                                                                           normal
```

Show history of jobs:

```console
$ sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1              hostname     normal                     1  COMPLETED      0:0 
1.0            hostname                                1  COMPLETED      0:0 
...
```

Check the database:

```console
$ mysql -u slurmdbd -p slurm_acct_db
```

## scontrol

After you altered the configuration files of slurm, you can apply it by:

```console
$ sudo scontrol reconfigure
```

If the state of nodes are not valid, they may stay at the state even if the fail reasons were fixed. You can bring it back mannually by:

```console
$ sudo scontrol update NodeName=las3 State=idle
```