A lot of people have recently been asking me about help with the SLURM system at UCSD, so I decided to write a blog post to help people quickly learn how to use it and get their jobs running. I have a simple script that can be executed via the sbatch command that forks jobs in a job array. This allows you to run multiple of the same jobs in parallel on the cluster.

So say you have a bunch of scripts you want to run in parallel on a SLURM system. We can put all the script commands into a file (calls.txt):

calls.txt:

script1.sh
script2.sh
script3.sh

Then we can create an sbatch script that specifies a job array that runs through these commands and executes them in a parallel job array. Here is the job array script for sbatch (runTest.sh):

runTest.sh:

#!/bin/bash
#SBATCH --account=xxxx
#SBATCH --partition=shared
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH -t 0-03:00 # Runtime in D-HH:MM
#SBATCH -o outputfile%a # File to which STDOUT will be written
#SBATCH -e outputerr%a # File to which STDERR will be written
#SBATCH --job-name=ForestDNMTest
#SBATCH --mail-type=ALL # Typemo of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=xxxx@ucsd.edu # Email to which notifications will be sent
#SBATCH --array=1-3%3

linevar=`sed $SLURM_ARRAY_TASK_ID'q;d' calls.txt`
eval $linevar

As one can see, the sbatch script has several parameters. Some of the useful ones that one should specify:

  • account: The university PI account on SLURM (different for each lab)
  • partition: shared vs. compute. “shared” is generally what people use if you just want to use the cluster for straight-forward jobs. “compute” specifies a more reserved use of the cluster that needs more time (but also costs more).
  • nodes: specifies how many nodes to use (each node has 24 cores)
  • ntasks-per-node: how many tasks there are (in this case, I’m treating this all as 1 task).
  • cpus-per-task: How many cores to use per task in the job array.
  • runtime (how long to run any 1 job). Runtime is specified in D-HH:MM
  • -o specifies that outputfiles and -e specifies error files. You can just keep these as is for now (and change once you try running it if need be).
  • SLURM has the option of sending you email notifications. See comments in script
  • array=1-3%3 specifies that the job array contains three jobs (i.e. calls.txt has three lines) and %3 specifies they should be run 3 at a time (i.e. all at once). If you have 50 jobs you want to run 10 at a time, the command is array=1-50%10.

All the job array script does is step through calls.txt and fork them to the cluster in batches that are specified by the “array” command.

Then on Comet/TSCC,all one has to do is:
sbatch runTest.sh

You can do the following command to check the status of your job array in the Comet or TSSC queue:
squeue -u yourusername

Hopefully this makes use of SLURM a lot simpler for everyone. Cheers!

Advertisements