Skip to main content

PWR026: Annotate function for OpenMP offload

Issue

An OpenMP offload version of the function can be generated by the compiler.

Actions

Annotate the function with the OpenMP declare target directive.

Relevance

When a loop is offloaded to the GPU, the compiler translates the relevant code sections into GPU-friendly instructions, which are embedded into the main CPU program. The runtime is in charge of executing these code sections in the GPU and handles data movements between the CPU and GPU memories.

If an offloaded loop calls functions, GPU versions of those functions must be generated as well. To achieve this, mark the relevant functions with the OpenMP declare target directive. If this isn't done, the CPU version will be called instead. This results in significant performance issues because the computation data must be moved from the GPU to the CPU to execute the function, and then back to the GPU once the function returns.

Code example

C

While the loop below is correctly offloaded to the GPU, the function foo() isn't marked for GPU execution. Consequently, the compiler will only create a CPU version of foo(). This means every time foo() is called, the data must be transferred between the GPU and the CPU, negatively impacting performance:

__attribute__((const)) int foo(int a) {
return 2 * a;
}

void example(int n, int *A) {
#pragma omp target teams distribute parallel for default(none) shared(A, n)
for (int i = 0; i < n; i++) {
A[i] = foo(i);
}
}

To prevent the performance loss caused by constant data transfers, add the #pragma omp declare target directive to foo(). This will instruct the compiler to create a GPU version of the function, allowing the loop to run entirely on the device:

#pragma omp declare target
__attribute__((const)) int foo(int a) {
return 2 * a;
}
#pragma omp end declare target

void example(int n, int *A) {
#pragma omp target teams distribute parallel for default(none) shared(A, n)
for (int i = 0; i < n; i++) {
A[i] = foo(i);
}
}

Fortran

While the loop below is correctly offloaded to the GPU, the function foo() isn't marked for GPU execution. Consequently, the compiler will only create a CPU version of foo(). This means every time foo() is called, the data must be transferred between the GPU and the CPU, negatively impacting performance:

subroutine example(A)
implicit none
integer, intent(out) :: A(:)
integer :: i

!$omp target teams distribute parallel do default(none) private(i) shared(A)
do i = 1, size(A, 1)
A(i) = foo(i)
end do

contains

pure integer function foo(a)
implicit none
integer, intent(in) :: a
foo = 2 * a
end function foo
end subroutine example

To prevent the performance loss caused by constant data transfers, add the !$omp declare target directive to foo(). This will instruct the compiler to create a GPU version of the function, allowing the loop to run entirely on the device:

subroutine example(A)
implicit none
integer, intent(out) :: A(:)
integer :: i

!$omp target teams distribute parallel do default(none) private(i) shared(A)
do i = 1, size(A, 1)
A(i) = foo(i)
end do

contains

pure integer function foo(a)
!$omp declare target
implicit none
integer, intent(in) :: a
foo = 2 * a
end function foo
end subroutine example

References