This lesson is in the early stages of development (Alpha version)

Supercomputing for Beginners: Cheatsheets for Queuing System Quick Reference

Key Points

Why Use a Cluster?
  • High Performance Computing (HPC) typically involves connecting to very large computing systems elsewhere in the world.

  • These other systems can be used to do work that would either be impossible or much slower or smaller systems.

  • The standard method of interacting with such systems is via a command line interface called Bash.

Working on a remote HPC system
  • An HPC system is a set of networked machines.

  • HPC systems typically provides login nodes and a set of worker nodes.

  • The resources found on independent (worker) nodes can vary in volume and type (amount of RAM, processor architecture, availability of network mounted file systems, etc.).

  • Files saved on one node are available on all nodes.

Scheduling jobs
  • The scheduler handles how compute resources are shared between users.

  • Everything you do should be run through the scheduler.

  • A job is just a shell script.

  • If in doubt, request more resources than you will need.

Filesystems and Storage
  • The home directory is the default place to store data.

  • The scratch directory is a larger space for temporary files.

  • On Hawk in Cardiff home is backed up but is also a slower disk.

  • Quotas on home are much smaller than scratch.

Accessing software
  • Discover available software with module avail

  • Load software with module load softwareName

  • Unload software with module purge

  • The module system handles software versioning and package conflicts for you automatically.

Transferring files
  • wget downloads a file from the internet.

  • scp transfer files to and from your computer.

  • You can use an SFTP client like FileZilla to transfer files through a GUI.

Using resources effectively
  • The smaller your job, the faster it will schedule.

Using shared resources responsibly
  • Be careful how you use the login node.

  • Your data on the system is your responsibility.

  • Plan and test large data transfers.

  • It is often best to convert many files to a single archive file before transferring.

  • Again, don’t run stuff on the login node.

  • Don’t be a bad person and run stuff on the login node.

Cheatsheets for Queuing System Quick Reference

Glossary

The following list captures terms that need to be added to this glossary. This is a great way to contribute.

Accelerator
to be defined
Beowulf cluster
to be defined
Central processing unit
to be defined
Cloud computing
to be defined
Cluster
a collection of computers configured to enable collaboration on a common task by means of purposefully configured hardware (e.g., networking) and software (e.g. workload management).
Distributed memory
to be defined
Grid computing
to be defined
High availability computing
to be defined
High performance computing
to be defined
Interconnect
to be defined
Node
to be defined
Parallel
to be defined
Serial
to be defined
Server
to be defined
Shared memory
to be defined
Slurm
to be defined
Supercomputer
… “a major scientific instrument” …
Workstation
to be defined
Grid Engine
to be defined
Parallel File System
to be defined