This lesson is still being designed and assembled (Pre-Alpha version)

Advanced topics in MPI

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • What is the best way to read and write data to disk?

  • Can MPI optimise commnications by itself?

  • How can I use OpenMP and MPI together?

Objectives
  • Use the appropriate reading and writing methods for your data.

  • Understand the topology of the problem affects communications.

  • Understand the use of hybrid coding and how it interacts.

Having coverred simple point to point communication and collective communications, this section covers topics that are not required but useful to know exist.

MPI-IO

When reading and writing data to disk from a program using MPI, there are a number of approaches:

All forms of MPI-IO require the use of MPI_File_Open. mpi4py uses:

MPI.File.Open(comm, filename, amode)

Where amode is the access mode - such as MPI.MODE_WRONLY, MPI.MODE_RDWR, MPI.MODE_CREATE. This can be combined with bitwise-or | operator.

There are 2 types of I/O, independent and collective. Independent I/O is like standard Unix I/O whilst Collective I/O is where all MPI tasks in the communicator must be involved in the operation. Increases the opportunity for MPI to take advantage of optimisations such as large block I/O that is much more efficient that small block I/O.

Independent I/O

Just like Unix like open, seek, read/write, close. MPI has a way of allowing a single task to read and write from a file.

mpi4py have its similar versions e.g. File.Seek - see help(MPI.File)

For example to write from the leader:

from mpi4py import MPI
import numpy as np

amode = MPI.MODE_WRONLY|MPI.MODE_CREATE
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
fh = MPI.File.Open(comm, "./datafile.mpi", amode)

buffer = np.empty(10, dtype=np.int)
buffer[:] = rank

offset = comm.Get_rank() * buffer.nbytes
if rank == 0:
    fh.Write_at(offset, buffer)

fh.Close()

Useful where collective calls do not naturally fit in code or overhead of collective calls outweigh benefits (e.g. small I/O). You can find an example in mpiio.independent.py and its corresponding mpiio.independent-slurm.sh.

Collective non-contiguous I/O

If a file operation needs to be performed across the whole file but not contigious (e.g. gaps between data that the each task reads). This uses the concept of a File view set with MPI_File_set_view. mpi4py uses fh.Set_view where fh is the file handle returned from MPI.File.Open.

fh.Set_view(displacement, filetype=filetype)

displacement is the location in the file and filetype is a description of the data for each task.

Key functions are:

The _at_ versions of the functions are better than performing a seek and then an read or write. See help(MPI.File)

For example:

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

amode = MPI.MODE_WRONLY|MPI.MODE_CREATE
fh = MPI.File.Open(comm, "./datafile.noncontig", amode)

item_count = 10

buffer = np.empty(item_count, dtype='i')
buffer[:] = rank

filetype = MPI.INT.Create_vector(item_count, 1, size)
filetype.Commit()

displacement = MPI.INT.Get_size() * rank
fh.Set_view(displacement, filetype=filetype)

fh.Write_all(buffer)
filetype.Free()
fh.Close()

You can find an example in mpiio.non.contiguous.py and its corresponding mpiio.non.contiguous-slurm.sh.

Contiguous collective I/O

Contiguous collective I/O is where all tasks are used to perform the I/O operation across the whole data.

Contiguous example

Write a file where each MPI task write 10 elements with value of its rank in order of rank.

Solution

This is very similar to the non-contiguous I/O but the file view is not required.

offset = comm.Get_rank() * buffer.nbytes
fh.Write_at_all(offset, buffer)

You can find an example in mpiio.collective.py and its corresponding mpiio.collective-slurm.sh.

Access patterns

Access patterns can greatly impact the performance of the I/O. It is expressed with 4 levels - level 0 to level 3.

In summary MPI-IO requires some special care but worth looking at closer if you have large data access requirements in your code. Level 3 would give the best performance.

Application toplogies

MPI tasks have no favoured orientation or priority, however it is possible to map tasks onto a virtual topology. The topologies are:

The functions used are:

In mpi4py to create a Cartesian toplogy:

cart = comm.Create_cart(dims=(axis, axis),
                        periods=(False, False), reorder=True)

Then the methods are applied to the cart object.

cart.Get_topo(...)
cart.Get_cart_rank(...)
cart.Get_coords(...)
cart.Shift(...)

See: help(MPI.Cartcomm)

There is an example cartesian.py.

Hybrid MPI and OpenMP

With Python this would be tricker (but not impossible to do). You can instead create threads in Python with the multiprocessing module. The MPI implementation you use would need to support the threading method you want.

Basically the following can be suggested:

The levels of threading in MPI is described with:

MPI_THREAD_SINGLE - only one thread will execute MPI commands. MPI_THREAD_FUNNELED - the process may be multi-threaded but only the main thread will make MPI calls. MPI_THREAD_SERIALIZED - the process may be multi-threaded and multiple threads may make MPI calls, but only one at a time. MPI_THREAD_MULTIPLE - multiple threads may cal MPI, with no restrictions.

mpi4py calld MPI_Init_thread requesting MPI_THREAD_MULTIPLE. The MPI implementation tries to fulfil that but can provide a different level of threading support but closed to the requested level.

This really only becomes a concern when writing MPI with C and Fortran.

Key Points

  • MPI can deliver efficient disk I/O if designed

  • Providing MPI with some knowledge of topology can make MPI do all the hard work.

  • The different ways threads can work with MPI dictates how you can use MPI and OpenMP together.