PWR056: Consider applying offloading parallelism to scalar reduction loop

Issue

The loop containing the scalar reduction pattern can be sped up by offloading it to an accelerator.

Actions

Implement a version of the scalar reduction loop using an Application Program Interface (API) that enables offloading to accelerators, such as OpenMP or OpenACC.

Relevance

Offloading a loop to an accelerator is one of the ways to speed it up. Accelerators offer a huge computational power, but writing code for accelerators is not straightforward. Essentially, the programmer must explicitly manage the data transfers between the host and the accelerator, specify how to execute the loop in parallel on the accelerator, as well as add the appropriate synchronization to avoid race conditions at runtime.

Typically, minimizing the computational overhead of offloading is the biggest challenge to speedup the code using accelerators.

note

Offloading scalar reduction loops incurs an overhead due to the synchronization needed to avoid race conditions and ensure the correctness of the code. Note appropriate data scoping of shared and private variables is still a must.

Code example

C

double example(double *A, int n) {
  double sum = 0;

  for (int i = 0; i < n; ++i) {
    sum += A[i];
  }

  return sum;
}

The loop body has a scalar reduction pattern, meaning that each iteration of the loop reduces its computational result to a single value; in this case, sum. Thus, any two iterations of the loop executing concurrently can potentially update the value of the scalar sum at the same time. This creates a potential race condition that must be handled through appropriate synchronization.

The code snippet below shows an implementation that uses the OpenACC compiler directives to offload the loop to an accelerator. Note the synchronization added to avoid race conditions, while the data transfer clauses manage the data movement between the host memory and the accelerator memory:

double example(double *A, int n) {
  double sum = 0;

  #pragma acc data copyin(A[0:n], n) copy(sum)
  #pragma acc parallel
  #pragma acc loop reduction(+: sum)
  for (int i = 0; i < n; ++i) {
    sum += A[i];
  }

  return sum;
}

Fortran

function example(A) result(sum)
  use iso_fortran_env, only: real32
  implicit none
  real(kind=real32), intent(in) :: A(:)
  real(kind=real32) :: sum
  integer :: i

  sum = 0.0
  do i = 1, size(A, 1)
    sum = sum + A(i)
  end do
end function example

function example(A) result(sum)
  use iso_fortran_env, only: real32
  implicit none
  real(kind=real32), intent(in) :: A(:)
  real(kind=real32) :: sum
  integer :: i

  sum = 0.0

  !$acc data copyin(A) copy(sum)
  !$acc parallel
  !$acc loop reduction(+: sum)
  do i = 1, size(A, 1)
    sum = sum + A(i)
  end do
  !$acc end parallel
  !$acc end data
end function example

warning

OpenACC/OpenMP offloading directives are not allowed in Fortran procedures marked with the pure attribute. To enable their use, pure must be removed.

PWR056 examples

Issue​

Actions​

Relevance​

Code example​

C​

Fortran​

Related resources​

References​

Issue

Actions

Relevance

Code example

C

Fortran

Related resources

References