Introduction to High Performance Computing for Supercomputing Wales: Instructor Notes


HPC background

What is a cluster

why use one:

show cluster diagram

explain nodes vs cores

introduce SCW RSEs history with HPCW current state application process

Logging in

ssh or ssh




Moving Data

sftp or sftp Show filezilla

Scratch dirs in /scratch/username


Running Jobs

Interactive jobs

Explain account and reservation codes. Export SBATCH_ACCOUNT/RESERVATION or SALLOC_ACCOUNT/RESERVATION salloc -n 1 --ntasks-per-node=1 --account=scw1389 --reservation=scw1389_XX srun --pty -n 1 /bin/bash squeue

Batch jobs


Copy example batch job to Explain #! and comments

sbatch explain job id cat hostname.out.<jobid> cat hostname.err.<jobid>

Overriding sbatch

add sleep 70 to script resubmit show time limit error run sbatch --time 0-0:2

Cancelling jobs


Listing jobs that have Running



Running multiple jobs with srun

This runs multiple copies of the same thing within a job. Lets us use multiple nodes.

#!/bin/bash --login


#SBATCH --job-name=hostname

#SBATCH --output=hostname.out.%J

#SBATCH --error=hostname.err.%J

#SBATCH --time=0-00:01

#SBATCH --mem-per-cpu=10

#SBATCH --ntasks=2

#SBATCH --ntasks-per-node=1

#SBATCH --nodes=2 #SBATCH --accountx=scw1389_XX #SBATCH --reservation=scw1389_XX


srun /bin/hostname

run it with sbatch.


Job Arrays

sbatch --array=0-2 squeue show output files

Monitoring jobs

sacct -j JOBID --format=JobID,JobName,ReqMem,MaxRSS,Elapsed

Talk about mem, time, nodes and core allocations.


sinfo show different partitions sbatch -p <part name>

Email alerts

#SBATCH #SBATCH --mail-type=ALL



python3 - command not found module avail module load hpcw python/3.5.1

python modules

pip3 install --user <mod> pip3 install --user sklearn

from sklearn import datasets digits = datasets.load_digits() print(


HPC Best Practice

See webpage

Optimising for Parallel Processing

crude way:

#!/bin/bash --login


#job name

#SBATCH --job-name=test #SBATCH --output=test.out.%J #SBATCH --error=test.err.%J #SBATCH --time=0-00:01 #SBATCH --ntasks=3 #SBATCH --account=scw1389 #SBATCH --reservation=scw1389_XX


command1 &

command2 &

command3 &

what if command1/2/3 take different amounts of time to run? We’ve got CPUs allocated but we aren’t using them. Want to keep usage near to 100%

What if command4 needs to run after command1/2/3.

GNU parallel is a powerful program designed to run multiple jobs on a single node. module called parallel contains it.

parallel can read input from a pipe and apply a command to each line of input

ls | parallel echo {1}

ls | parallel echo

alternate syntax for same thing

parallel echo {1} ::: $(ls)

${1} means first argument, separate each argument with another :::

parallel echo {1} {2} ::: 1 2 3 ::: a b c

Use parallel on Nelle’s pipeline from Unix Shell lesson.

wget unzip cd data-shell/north-pacific-gyre/2012-07-03/

We used to process this with a for loop in series. Switch to parallel

ls NENE*[AB].txt | parallel bash goostats {1} stats-{1}

#!/bin/bash --login


#SBATCH --ntasks 4 #Number of processors we will use

#SBATCH --nodes 1 #request everything runs on the same node

#SBATCH -o output.%J #Job output

#SBATCH -t 00:00:05 #Max wall time for entire job #SBATCH --account=scw1389 #SBATCH --reservation=scw1389_XX


module load hpcw

module load parallel

srun="srun -n1 -N1"

parallel="parallel -j $SLURM_NTASKS --joblog parallel_joblog"

ls NENE*[AB].txt | $parallel "$srun bash goostats {1} stats-{1}"

submit it:


sacct will show 15 subjobs.

parallel_joblog shows how long each took to run.

More complex example, run hello with every combination of 1/2/3 and a/b/c (1a,1b,1c,2a….)

parallel echo "hello {1} {2}" ::: 1 2 3 ::: a b c

Treat arguments as pairs (e.g. 1a, 2b, 3c)

parallel echo "hello {1} {2}" ::: 1 2 3 :::+ a b c


Estimation of Pi on a single core

Buffon’s Needle

Monte carlo method for estimating Pi drop points randomly on a circle/quadrant

draw a circle, take a quadrant drop m random points on a quadrant n is number inside the circle

 4*m Pi = ---

see python implementation of this

x^2 + y^2 < 1 means inside the circle


write code which works, measure performance, optimise

profilers tell us how long each line of code takes

python line_profiler is one of these

install with

module load hpcw python/3.5.1

pip3 install --user line_profiler

we have to tell the profiler which function to profile with @profile tag put this before def main(): refactor so there’s a main function

try to find an empty head node to do this, ssl003 is a good bet

run profiler:

~/.local/bin/kernprof -l ./ 50000000

output stored in

view it with

python3 -m line_profiler

estimate_pi function takes 100% of the time

remove annotation from main and move it to estimate_pi

repeat profiling

inside_circle now shows 100%

move profiling to there and repeat again

generating random numbers takes about 60% of the time. this is our prime target for optimisation.



Parallel estimation of Pi

Showed previously random number generation was 60-70% of time. show profiler output again.

python3 -m line_profiler

X and Y are indepedendent variables (see first figure in notes) can generate them in parallel

random numbers genreated with numpy, similar to

a = np.random.uniform(size=10)

b = np.random.uniform(size=10)

c = a + b

a and b are lists, final line is concatenating the two lists

for i in range(len(a)): ` c[i] = a[i] + b[i]`

achieves same thing, but makes it clearer whats going on

we could generate each pair of X/Y values in parallel (see second figure in notes)


Data independence 1,2,3

Amdahl’s law

What is the overall speedup of a program when some of it is done in paralle?

       1 S = ---------------
(1 - p) + (p/s)

p = portion of program sped up s = speedup achieved

parallel calculation of x and y occupied 70% of time speedup of 2 in that time 1 S = —————- (1 - 0.7) + (0.7/2)

S = 1.538


More Amdhal’s law

Show Amdahl’s law graph

we can’t infinitely parallelise, limit to number of cores etc additional limits from I/O and memory bottlenecks

in the example Lola splits data into partitions, see figure

PyMP and OpenMP, parallel loops simpler than threads can also use the multiprocessing library in python

explain shared vs private variables, locking

run pymp version python3 ./ 1000000000

time it time python3 ./ 1000000000 compare to serial one time python3 ./ 1000000000




Message Passing Interface

passes messages between cluster nodes

useful when problem is too big for one node

copy example to

sbatch -n 4

repeat with more cores

sbatch -n 16

order of output might be a bit random. Merging of the file done by synchronising on each line.

MPI libraries available for lots of languages including C/C++, Fortran and Python

Install mpi4py

module load mpi

module load hpcw python/3.5.1

pip3 install --user mpi4py

create with example contents

submit with sbatch -n 16


MPI calculation of Pi

MPI size tells us how many instances of the code are running

rank tells us which instance we are. Usually the instance with rank 0 does the coordination.


code to get size/rank:



Every line of code running will be in parallel in a different MPI process, possibly on a different node

Rank 0 will often do something different to other ranks. Show hello world and pi example in notes.

MPI’s scatter function will scatter an array in equal parts across all instances. The gather function will gather data from all instances and merge them back together.

In example final computation of Pi done on rank 0 only.

run it, make sbatch script with time mpirun python3 1000000000

run with sbatch -n 48

Investigate time output

Show performance graphs

MPI vs PyMP performance, different nodes. Try PyMP as an sbatch job.