Use of Slurm Workload Management for job scheduling

Last modified: 1st November 2017

To be able to run many jobs automatically – e.g. running ADMIXTURE for different values for K without having to run each one manually -, you will want to use a job scheduling system, and for that you need to have Slurm installed.

I recommend that you install and configure this software first, although if it does not work you can keep installing the other packages (you will only use this for automated jobs, when you get a grasp of the processes involved).

sudo apt-get install munge
sudo apt-get install slurm-llnl

(usually, all needed dependencies will be installed)

Then it needs to be configured. Configurator files are found in

/usr/share/doc/slurm-ctld/

Or by default maybe in

/usr/share/doc/slurm-llnl/

They are called

slurm-wlm-configurator.easy.html
slurm-wlm-configurator.html

Or

slurm-llnl-configurator.easy.html
slurm-llnl-configurator.html

It is usually enough to open the easy configurator file for our basic needs.

Open one in your webbrowser, and fill the data as required. In my case, I needed to add my hostname (you can find yours with the command hostname -s), for ControlMachine, NodeName, and Nodes, and the number of processors to 2.

The final lines of the slurm.conf file should be something like (change hostname with your hostname):

# COMPUTE NODES
NodeName=hostname CPUs=2 State=UNKNOWN
PartitionName=debug Nodes=hostname Default=YES MaxTime=INFINITE State=UP

NOTE. In my experience, installing slurm without having much idea about how it works can be very tricky, so you are better of sticking to what is known. Some advice, from my (little) experience and the instructions from https://slurm.schedmd.com/quickstart_admin.html:

– Do not add the full domain name. If you have a computer named ubuntu.linux.net , just put ubuntu.

– Because virtual machines sometimes mess with localhost IP vs. your hostname IP (127.0.0.1 vs. 127.0.1.1)* I selected to name ControlMachine and NodeName with my hostname, instead of localhost. You might need to change your hostname to 127.0.0.1, but I wasn’t able to tweak these parameters without errors.

* The reason why is documented in the Debian manual here: http://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_hostname_resolution. Ultimately, it is a bug workaround; the original report is here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=316099

– I was tempted to delete the number of CPUs to use, so that it relies dinamically on the number of cores I selected for the virtual machine, but it didn’t work as expected. So I recomend selecting a specific number of CPUs or Procs (CPUs=2 or Procs=2 in my case).

Press submit, and copy the output text in a new text file called slurm.conf, which needs to be saved (as root) in the default Slurm directory:

sudo cp slurm.conf /etc/slurm-llnl/

You should probably start (or restart) Slurm now. You might want to simply restart your system. Or try:

scontrol ping

will probably show that slurm services are down. Start it:

sudo slurmctld start

or

sudo /etc/init.d/slurmd start

or

sudo /etc/init.d/slurm start

Now you should be able to use sbatch commands. Try

sinfo

To see if everything is running ok.

Every time you submit a job, you can view if it is still working by using the command

squeue

You can kill that job if you want to stop it:

scancel X (where X is the job number)

You can also hold and release that job:

scontrol hold X
scontrol release X

If you encounter any problems, use the following command to get a live report while you work with slurm:

sudo slurmctld -Dvvv

Or you can look into logs to see what has happened:

sudo tail -n 100 var/log/slurm-llnl/slurmctld.log

and/or

sudo tail -n 100 var/log/slurm-llnl/slurmd.log

Join the discussion...

It is good practice to be registered and logged in to comment.
Please keep the discussion of this post on topic.
Civilized discussion. Academic tone.
For other topics, use the forums instead.

Leave a Reply