Intro to Parallel Programming with OpenMP

Used resources:

Tim Mattson, lectures on YouTube
Further reading at openmp.org: intro

OpenMP in a nutshell

OpenMP is a library that supports shared memory multiprocessing. The OpenMP programming model is SMP (symmetric multi-processors, or shared-memory processors): that means when programming with OpenMP all threads share memory and data.

Parallel code with OpenMP marks, through a special directive, sections to be executed in parallel. The part of the code that’s marked to run in parallel will cause threads to form. The main tread is the master thread. The slave threads all run in parallel and run the same code. Each thread executes the parallelized section of the code independently. When a thread finishes, it joins the master. When all threads finished, the master continues with code following the parallel section.

Each thread has an ID attached to it that can be obtained using a runtime library function (called omp_get_thread_num()). The ID of the master thread is 0.

OpenMP supports C, C++ and Fortran.

The OpenMP functions are included in a header file called

The OpenMP parts in the code are specified using #pragmas

OpenMP has directives that allow the programmer to:

specify the parallel region (create threads)
specify how to parallelize loops
specify the scope of the variables in the parallel section (private and shared)
specify if the threads are to be synchronized
specify how the works is divided between threads (scheduling)

OpenMP hides the low-level details and allows the programmer to describe the parallel code with high-level constructs, which is as simple as it can get.

Compiling and running OpenMP code

The public linux machines dover and foxcroft have gcc/g++ installed with OpenMP support. All you need to do is use the -fopenmp flag on the command line:

gcc -fopenmp hellosmp.c  -o  hellosmp

It’s also pretty easy to get OpenMP to work on a Mac. A quick search with google reveals that the native apple compiler clang is installed without openmp support. When you installed gcc it probably got installed without openmp support. To test, go to the terminal and try to compile something:

gcc -fopenmp hellosmp.c  -o  hellosmp

If you get an error message saying that “omp.h” is unknown, that mans your compiler does not have openmp support.

hellosmp.c:12:10: fatal error: 'omp.h' file not found
#include 
         ^
1 error generated.
make: *** [hellosmp.o] Error 1

Here’s what I did: 1. I installed Homebrew, the missing package manager for MacOS, http://brew.sh/index.html

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

2. Then I asked brew to install gcc:

 brew install gcc

3. Then type ‘gcc’ and press tab; it will complete with all the versions of gcc installed:

$gcc
gcc           gcc-6         gcc-ar-6      gcc-nm-6      gcc-ranlib-6  gccmakedep

4. The obvious guess here is that gcc-6 is the latest version, so I use it to compile:

gcc-6  -fopenmp hellosmp.c

Works!

Specifying the parallel region (creating threads)

The basic directive is:

#pragma omp parallel 
{


}

This is used to fork additional threads to carry out the work enclosed in the block following the #pragma construct. The block is executed by all threads in parallel. The original thread will be denoted as master thread with thread-id 0.

Example (C program): Display "Hello, world." using multiple threads.

#include < stdio.h >

int main(void)
{
    #pragma omp parallel
    {
    printf("Hello, world.\n");
    }

  return 0;
}

Use flag -fopenmp to compile using gcc:

$ gcc -fopenmp hello.c -o hello

Output on a computer with two cores, and thus two threads:

Hello, world.
Hello, world.

On dover, I got 24 hellos, for 24 threads. On my desktop I get (only) 8. How many do you get?

Note that the threads are all writing to the standard output, and there is a race to share it. The way the threads are interleaved is completely arbitrary, and you can get garbled output:

Hello, wHello, woorld.
rld.

Private and shared variables

In a parallel section variables can be private (each thread owns a copy of the variable) or shared among all threads. Shared variables must be used with care because they cause race conditions.

shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.

The type of variables is specified following the #pragma omp

Example:

int main (int argc, char *argv[]) {

  int th_id, nthreads;
  
  
#pragma omp parallel private(th_id)
  //  th_id is declared above.  It is is specified as private; so each thread will have its own copy of th_id 
  {
    th_id = omp_get_thread_num();
    printf("Hello World from thread %d\n", th_id);
  }

Sharing variables is sometimes what you want, other times its not, and can lead to race conditions. Put differently, some variables need to be shared, some need to be private, and you the programmer have to specify what you want.

Synchronization

OpenMP lets you specify how to synchronize the threads. Here’s what’s available:

critical: the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.
atomic: the memory update (write, or read-modify-write) in the next instruction will be performed atomically. It does not make the entire statement atomic; only the memory update is atomic. A compiler might use special hardware instructions for better performance than when using critical.
ordered: the structured block is executed in the order in which iterations would be executed in a sequential loop
barrier: each thread waits until all of the other threads of a team have reached this point. A work-sharing construct has an implicit barrier synchronization at the end.
nowait: specifies that threads completing assigned work can proceed without waiting for all threads in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct.

More on barriers: If we wanted all threads to be at a specific point in their execution before proceeding, we would use a barrier. A barrier basically tells each thread, "wait here until all other threads have reached this point...".

Barrier example:

int main (int argc, char *argv[]) {

  int th_id, nthreads;
  
#pragma omp parallel private(th_id)
    {
    th_id = omp_get_thread_num();
    printf("Hello World from thread %d\n", th_id);

#pragma omp barrier        <----------- master waits until all threads finish before printing 
    if ( th_id == 0 ) {
      nthreads = omp_get_num_threads();
      printf("There are %d threads\n",nthreads);
    }
  }

}//main

Note above the function omp_get_num_threads(). Can you guess what it’s doing? Some other runtime functions are:

omp_get_num_threads
omp_get_num_procs
omp_set_num_threads
omp_get_max_threads

Parallelizing loops

Parallelizing loops with OpenMP is straightforward. One simply denotes the loop to be parallelized and a few parameters, and OpenMP takes care of the rest. Can't be easier!

The directive is called a work-sharing construct:

#pragma omp for 
//specify a for loop to be parallelized; no curly braces

The “#pragma omp for” distributes the loop among the threads. It must be used inside a parallel block:

#pragma omp parallel 
{
…
#pragma omp for 
//for loop to parallelize 
…
}//end of parallel block

Example:

//compute the sum of two arrays in parallel 
#include < stdio.h >
#include < omp.h >
#define N 1000000
int main(void) { 
  float a[N], b[N], c[N];
  int i;

  /* Initialize arrays a and b */
  for (i = 0; i < N; i++) {
    a[i] = i * 2.0;
    b[i] = i * 3.0;
  }

 /* Compute values of array c = a+b in parallel. */
  #pragma omp parallel shared(a, b, c) private(i)
  { 
    #pragma omp for             
    for (i = 0; i < N; i++) {
      c[i] = a[i] + b[i];
      printf ("%f\n", c[10]);
    }
  }
}

Another example (here): adding all elements in an array.

//example4.c: add all elements in an array in parallel  
#include < stdio.h > 


int main() {

  const int N=100; 
  int a[N]; 

  //initialize 
  for (int i=0; i < N; i++)
    a[i] = i; 

  //compute sum 
  int local_sum, sum; 
#pragma omp parallel private(local_sum) shared(sum) 
  { 
    local_sum =0; 
    
    //the array is distributde statically between threads
#pragma omp for schedule(static,1) 
    for (int i=0; i< N; i++) {
      local_sum += a[i]; 
    }

    //each thread calculated its local_sum. ALl threads have to add to
    //the global sum. It is critical that this operation is atomic.

#pragma omp critical 
    sum += local_sum;
  } 


  printf("sum=%d should be %d\n", sum, N*(N-1)/2);
}

There exists also a “parallel for” directive which combines a parallel and a for (no need to nest a for inside a parallel):

int main(int argc, char **argv)
{
    int a[100000];

    #pragma omp parallel for
    for (int i = 0; i < 100000; i++) {
        a[i] = 2 * i;
	printf(“assigning i=%d\n”);
    }

    return 0;
}

Exactly how the iterations are assigned to ecah thread, that is specified by the schedule (see below). Note:Since variable i is declared inside the parallel for, each thread will have its own private version of i.

Loop scheduling

OpenMP lets you control how the threads are scheduled. The type of schedule available are:

static: Each thread is assigned a chunk of iterations in fixed fashion (round robin). The iterations are divided among threads equally. Specifying an integer for the parameter chunk will allocate chunk number of contiguous iterations to a particular thread. Note: is this the default? check.
dynamic: Each thread is initialized with a chunk of threads, then as each thread completes its iterations, it gets assigned the next set of iterations. The parameter chunk defines the number of contiguous iterations that are allocated to a thread at a time.
guided: Iterations are divided into pieces that successively decrease exponentially, with chunk being the smallest size.

This is specified by appending schedule(type, chunk) after the pragma for directive:

#pragma omp for schedule(static, 5)

More complex directives

...which you probably won't need.

can define “sections” inside a parallel block
can request that iterations of a loop are executed in order
specify a block to be executed only by the master thread
specify a block to be executed only by the first thread that reaches it
define a section to be “critical”: will be executed by each thread, but can be executed only by a single thread at a time. This forces threads to take turns, not interrupt each other.
define a section to be “atomic”: this forces threads to write to a shared memory location in a serial manner to avoid race conditions

#include < stdio.h >
#include < omp.h >

int main(void) {
  int count = 0;
  #pragma omp parallel shared(count)
  {
     #pragma omp atomic
     count++; // count is updated by only a single thread at a time
  }
  printf_s("Number of threads: %d\n", count);
}

Performance considerations

Critical sections and atomic sections serialize the execution and eliminate the concurrent execution of threads. If used unwisely, OpenMP code can be worse than serial code because of all the thread overhead.

Some comments

OpenMP is not magic. A loop must be obviously parallelizable in order for OpenMP to unroll it and facilitate the assignment of iterations among threads. If there are any data dependencies from one iteration to the next, then OpenMP can't parallelize it.

The for loop cannot exit early, for example:

// BAD - can;t parallelize with OpenMP
for (int i=0;i < 100; i++) {
  if (i > 50)
      break; <----- breaking when i greater than 50
}

Values of the loop control expressions must be the same for all iterations of the loop. For example:

// BAD - can;t parallelize with OpenMP
for (int i=0;i < 100; i++) {
  if (i == 50)
      i = 0; 
}