All the following steps should be done by the root user.
Prepare the environment on all the nodes, make sure all is set up correctly before installing munge and slurm.
Just follow the MaridDB tutorial.
Follow this guide to install the NFS. Remember to change the subnet to your own subnet. I directly closed the ufw
service on all the nodes (not recommand). I shared /storage
, /opt
and /home
directory on all the nodes.
You can also start the service automatically when the node is booted by make this as a systemd service (in my case, I make it sleep for 10 seconds after the node is booted to make sure the network is ready).
Install the NIS according to the NIS tutorial. One can also use OpenLDAP to replace NIS, but I think NIS is easier to use.
Then run the following command on the master node to create a user.
adduser slurm
adduser munge
and update the NIS database.
cd /etc/yp
make
remember each time you change the user, you need to update the NIS database.
Follows 4.4 of this guide, just remember that create the munge
user and group before installing munge!
Follows chapter 3 and 4 of this guide, also you need to create the slurm
user and group before doing so.
A mistake in the tutorial is that in the cgroup.conf
file should be modified as:
# CgroupAutomount=yes # this is not needed
CgroupMountpoint=/sys/fs/cgroup
# CgroupPlugin=
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
# MaxRAMPercent=98
# AllowedRAMSpace=96
others should be the same as the tutorial.
After the installation, you can enjoy the slurm if you successfully run the following commands.
# xzgao @ master in ~ [1:02:33]
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
i96c256g* up infinite 3 idle node[1-3]
# xzgao @ master in ~ [1:02:49]
$ srun -N3 hostname
node1
node2
node3
Here is an example of the slurm script, where I run a julia
script on 1 node with 96 cores, 192 threads.
# xzgao @ master in ~/work/julia_test [1:07:31]
$ cat blas.jl
using BLASBenchmarksCPU, CSV, DataFrames
libs = [:Gaius, :Octavian, :OpenBLAS]
threaded = true
benchmark_result = runbench(Float64; libs, threaded)
df = benchmark_result_df(benchmark_result)
CSV.write("/home/xzgao/work/julia_test/blas_results.csv", df)
# xzgao @ master in ~/work/julia_test [1:05:11]
$ cat run.slurm
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=i96c256g
#SBATCH -n 96
#SBATCH --ntasks-per-node=192
#SBATCH --output=%j.out
#SBATCH --error=%j.err
julia --project=/home/xzgao/work/julia_test -t 192 /home/xzgao/work/julia_test/blas.jl
# xzgao @ master in ~/work/julia_test [1:04:17]
$ sbatch run.slurm
Submitted batch job 26
# xzgao @ master in ~/work/julia_test [1:04:31]
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
26 i96c256g test xzgao R 0:01 1 node1
# xzgao @ master in ~/work/julia_test [1:04:32]
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
i96c256g* up infinite 1 alloc node1
i96c256g* up infinite 2 idle node[2-3]
Since I use NFS to share /home
directory, the program can be run on any nodes once the user install software under the /home
directory.