Profiling Single Core Performance

Overview

Teaching: 20 min
Exercises: 40 min

Questions

How do I find the portion of a code snipped that consumes the longest time?

Objectives

Perform an estimation of pi using only one CPU core.

Measure the run time of the serial implementation for this estimate of pi.

Find the line of code in a python program that took the longest.

Lola Lazy is a researcher learning to use an HPC system. As an introductory task her supervisor has tasked her with producing a high precision estimate of Pi. Even though calculating Pi can be considered a solved problem, this piece of code is used at the institute to benchmark new hardware. So far, the institute has only acquired larger single machines for each lab to act as work horse per group. But currently, need for distributed computations has arisen and hence a distributed code is needed, that yields both simplicity, efficiency and scalability.

The algorithm was pioneered by Georges-Louis Leclerc de Buffon in 1733.

Estimating Pi with Buffon's needle

Overlay a unit square over a quadrant of a circle. Throw m random number pairs and count how many of the pairs lie inside the circle (the number pairs inside the circle is denoted by n). Pi is then approximated by:

     4*m
Pi = ---
      n

The implementation of this algorithm using total_count random number pairs in a nutshell is given in the program below:

import numpy

np.random.seed(2017)

def inside_circle(total_count):
   
    x = np.float32(np.random.uniform(size=total_count))
    y = np.float32(np.random.uniform(size=total_count))

    radii = np.sqrt(x*x + y*y)

    filtered = np.where(radii<=1.0)
    count = len(radii[filtered])
   
    return count
   
def estimate_pi(total_count):

    count = inside_circle(total_count)
    return (4.0 * count / total_count)

For generating pseudo-random numbers, we sample the uniform probability distribution in the default floating point interval from 0 to 1. The sqrt step is not required directly, but Lola includes it here for clarity. numpy.where is used obtain the list of indices that correspond to radii which are equal or smaller than 1.0. At last, this list of indices is used to filter-out the numbers in the radii array and obtain its length, which is the number Lola are after.

Lola finishes writing the pi estimation and comes up with a small python script, that she can launch from the command line:

$ module load python/3.7.0
$ python3 ./serial_numpi.py 1000000000

[serial version] required memory 11444.092 MB
[serial version] pi is 3.141557 from 1000000000 samples

She must admit that the application takes quite long to finish. Yet another reason to use a cluster or any other remote resource for these kind of applications that take quite a long time. But not everyone has a cluster at his or her disposal. So she decides to parallelize this algorithm first so that it can exploit the number cores that each machine on the cluster or even her laptop has to offer.

Premature Optimisation is the root of all evil!

Before venturing out and trying to accelerate a program, it is utterly important to find the hot spots of it by means of measurements. For the sake of this tutorial, we use the Line_Profiler of python. Your language of choice most likely has similar utilities.

to install the profiler, please issue the following commands. These commands work around a bug in the interaction between the Line Profiler and the Python modules used by Supercomputing Wales. To install the profiler on your own comptuer simply run pip3 install line_profiler.

$ module load python/3.7.0

# make a directory for the python libraries
$ mkdir -p ~/.local/lib/python3.7/site-packages/
$ mkdir -p ~/.local/bin

# install cython module
pip3 install --user cython 
export PATH=$PATH:~/.local/bin

# download the code for line profiler
$ git clone https://github.com/rkern/line_profiler.git

# install line profiler in ~/.local
$ find line_profiler -name '*.pyx' -exec cython {} \;
$ cd line_profiler/
$ python3 setup.py install --prefix=~/.local

The profiler can be started with the command

$ ~/.local/bin/kernprof

Note that on systems where pip can install to system directories the kernprof (or kernprof-3) command will be available without specifying a path.

Profilers

Each programming language typically offers some open-source and/or free tools on the web, with which you can profile your code. Here are some examples of tools. Note though, depending on the nature of the language of choice, the results can be hard or easy to interpret. In the following we will only list open and free tools:

python: line_profiler, prof

java script: firebug

ruby: ruby-prof

C/C++: xray, perf,

R: profvis

Next, you have to annotate your code in order to indicate to the profiler what you want to profile. For this, we add the @profile annotation to a function definition of our choice. If we don’t do this, the profiler will do nothing. So let’s refactor our code a little bit:

def main():

    parser = argparse.ArgumentParser(description='Estimate Pi using a Monte Carlo method.')
    parser.add_argument('n_samples', metavar='N', type=int, nargs=1,
                        default=10000,
                        help='number of times to draw a random number')

    args = parser.parse_args()

    n_samples = args.n_samples[0]
    my_pi = estimate_pi(n_samples)
    sizeof = np.dtype(np.float32).itemsize

    print("[serial version] required memory %.3f MB" % (n_samples*sizeof*3/(1024*1024)))
    print("[serial version] pi is %f from %i samples" % (my_pi,n_samples))

    sys.exit(0)


if __name__=='__main__':
    main()

With this trick, we can make sure that we profile the entire application. Note, that this is a necessity when using line_profiler. We can now carry on, and annotate the main function.

...
@profile
def main():
  ...

Let’s save this to serial_numpi_annotated.py. After this is done, the profiler is run with a reduced input parameter that does take only about 2-3 seconds:

$ ~/.local/bin/kernprof -l ./serial_numpi.py 50000000
[serial version] required memory 572.205 MB
[serial version] pi is 3.141728 from 50000000 samples
Wrote profile results to serial_numpi_annotated.py.lprof

You can see that the profiler just adds one line to the output, i.e. the last line. In order to view, the output we can use the line_profile module in python:

$ python3 -m line_profiler serial_numpi_profiled.py.lprof
Timer unit: 1e-06 s

Total time: 2.07893 s
File: ./serial_numpi_profiled.py
Function: main at line 24

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    24                                           @profile
    25                                           def main():
    26         1            2      2.0      0.0      n_samples = 10000
    27         1            1      1.0      0.0      if len(sys.argv) > 1:
    28         1            3      3.0      0.0          n_samples = int(sys.argv[1])
    29                                          
    30         1      2078840 2078840.0    100.0      my_pi = estimate_pi(n_samples)
    31         1           11     11.0      0.0      sizeof = np.dtype(np.float32).itemsize
    32                                          
    33         1           50     50.0      0.0      print("[serial version] required memory %.3f MB" % (n_samples*sizeof*3/(1024*1024)))
    34         1           23     23.0      0.0      print("[serial version] pi is %f from %i samples" % (my_pi,n_samples)

Aha, as expected the function that consumes 100% of the time is estimate_pi. So let’s remove the annotation from main and move it to estimate_pi:

    return count

@profile
def estimate_pi(total_count):

    count = inside_circle(total_count)
    return (4.0 * count / total_count)

def main():
    n_samples = 10000
    if len(sys.argv) > 1:

And run the same cycle of record and report:

$ kernprof-3 -l ./serial_numpi_annotated.py 50000000
[serial version] required memory 572.205 MB
[serial version] pi is 3.141728 from 50000000 samples
Wrote profile results to serial_numpi_annotated.py.lprof
$ python3 -m line_profiler serial_numpi_profiled.py.lprof
Timer unit: 1e-06 s

Total time: 2.0736 s
File: ./serial_numpi_profiled.py
Function: estimate_pi at line 19

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    19                                           @profile
    20                                           def estimate_pi(total_count):
    21                                          
    22         1      2073595 2073595.0    100.0      count = inside_circle(total_count)
    23         1            5      5.0      0.0      return (4.0 * count / total_count)

Ok, one function to consume it all! So let’s rinse and repeat again and annotate only inside_circle.

@profile
def inside_circle(total_count):

    x = np.float32(np.random.uniform(size=total_count))
    y = np.float32(np.random.uniform(size=total_count))

    radii = np.sqrt(x*x + y*y)

    filtered = np.where(radii<=1.0)
    count = len(radii[filtered])

    return count

And run the profiler again:

$ kernprof-3 -l ./serial_numpi_annotated.py 50000000
[serial version] required memory 572.205 MB
[serial version] pi is 3.141728 from 50000000 samples
Wrote profile results to serial_numpi_annotated.py.lprof
$ python3 -m line_profiler serial_numpi_profiled.py.lprof
Timer unit: 1e-06 s

Total time: 2.04205 s
File: ./serial_numpi_profiled.py
Function: inside_circle at line 7

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     7                                           @profile
     8                                           def inside_circle(total_count):
     9                                          
    10         1       749408 749408.0     36.7      x = np.float32(np.random.uniform(size=total_count))
    11         1       743129 743129.0     36.4      y = np.float32(np.random.uniform(size=total_count))
    12                                          
    13         1       261149 261149.0     12.8      radii = np.sqrt(x*x + y*y)
    14                                          
    15         1       195070 195070.0      9.6      filtered = np.where(radii<=1.0)
    16         1        93290  93290.0      4.6      count = len(radii[filtered])
    17                                          
    18         1            2      2.0      0.0      return count

So generating the random numbers appears to be the bottleneck as it accounts for 37+36=73% of the total runtime time. So this is a prime candidate for acceleration.

Word count

Download this python script to your current directory. Run it by executing:

>>>>>> upstream/gh-pages
$ python3 count_pylibs.py
4231827 characters and 418812 words found in standard python libs

Use the line_profile module to find the hot spot in this program!

Solution

Timer unit: 1e-06 s

Total time: 0.334168 s
File: ./count_pylibs_annotated.py
Function: main at line 38

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    38                                           @profile
    39                                           def main():
    40                                          
    41         1        63994  63994.0     19.2      text = load_text()
    42         1            5      5.0      0.0      nchars = len(text)
    43         1       270108 270108.0     80.8      nwords = word_count(text)
    44         1           53     53.0      0.0      print("%i characters and %i words found in standard python lib" % (nchars, nwords))
    45                                          
    46         1            2      2.0      0.0      if len(text):
    47         1            6      6.0      0.0          sys.exit(0)
    48                                               else:
    49                                                   sys.exit(1)

The word_count function takes the longest time. Inside it, re.split hogs runtime the most.

Faster is always better, right?

Pair up and discuss the implementation of count_pylibs_annotated.py from the previous exercise. Discuss and answer the following points:

Find other ways to implement the word count without parallelizing the code!

For every alternative implementation found in 1., check the output of the program. Did the number of words change? Could such a check be automated?

Compare the runtimes that you achieved throughout this exercise. Was your time worth it?

Solution

One could use the ‘count’ method of strings and just count the words by using the number of spaces and add 1. Based on the assumption that our input text consists of sentences only, this can provide an estimate of the words. In this case, this is far from accurate as source code contains tabs, newlines etc to distinguish words. However, the count method improves runtime by more than one order of magniture in comparison to the re.split approach. Take away: Real life is a balance. But above all, your code should produce the correct result.

Key Points

Each programming language typically provides tools called profilers with which you can analyse the runtime of your code.

The estimate of pi spends most of it’s time while generating random numbers.

The estimation of pi with the Monte Carlo method is a compute bound problem because pseudo-random numbers are just algorithms.

previous episode

Introduction to High Performance Computing for Supercomputing Wales

next episode

Profiling Single Core Performance

Overview

Premature Optimisation is the root of all evil!

Profilers

Word count

Solution

Faster is always better, right?

Solution

Key Points

previous episode

next episode