PWR026: Annotate function for OpenMP offload
Issue
An OpenMP offload version of the function can be generated by the compiler.
Actions
Annotate the function with the OpenMP declare target
directive.
Relevance
When a loop is offloaded to the GPU, the compiler translates the relevant code sections into GPU-friendly instructions, which are embedded into the main CPU program. The runtime is in charge of executing these code sections in the GPU and handles data movements between the CPU and GPU memories.
If an offloaded loop calls functions, GPU versions of those functions must be
generated as well. To achieve this, mark the relevant functions with the OpenMP
declare target
directive. If this isn't done, the CPU version will be called
instead. This results in significant performance issues because the computation
data must be moved from the GPU to the CPU to execute the function, and then
back to the GPU once the function returns.
Code example
C
While the loop below is correctly offloaded to the GPU, the function foo()
isn't marked for GPU execution. Consequently, the compiler will only create a
CPU version of foo()
. This means every time foo()
is called, the data must
be transferred between the GPU and the CPU, negatively impacting performance:
__attribute__((const)) int foo(int a) {
return 2 * a;
}
void example(int n, int *A) {
#pragma omp target teams distribute parallel for default(none) shared(A, n)
for (int i = 0; i < n; i++) {
A[i] = foo(i);
}
}
To prevent the performance loss caused by constant data transfers, add the
#pragma omp declare target
directive to foo()
. This will instruct the
compiler to create a GPU version of the function, allowing the loop to run
entirely on the device:
#pragma omp declare target
__attribute__((const)) int foo(int a) {
return 2 * a;
}
#pragma omp end declare target
void example(int n, int *A) {
#pragma omp target teams distribute parallel for default(none) shared(A, n)
for (int i = 0; i < n; i++) {
A[i] = foo(i);
}
}
Fortran
While the loop below is correctly offloaded to the GPU, the function foo()
isn't marked for GPU execution. Consequently, the compiler will only create a
CPU version of foo()
. This means every time foo()
is called, the data must
be transferred between the GPU and the CPU, negatively impacting performance:
subroutine example(A)
implicit none
integer, intent(out) :: A(:)
integer :: i
!$omp target teams distribute parallel do default(none) private(i) shared(A)
do i = 1, size(A, 1)
A(i) = foo(i)
end do
contains
pure integer function foo(a)
implicit none
integer, intent(in) :: a
foo = 2 * a
end function foo
end subroutine example
To prevent the performance loss caused by constant data transfers, add the
!$omp declare target
directive to foo()
. This will instruct the compiler to
create a GPU version of the function, allowing the loop to run entirely on the
device:
subroutine example(A)
implicit none
integer, intent(out) :: A(:)
integer :: i
!$omp target teams distribute parallel do default(none) private(i) shared(A)
do i = 1, size(A, 1)
A(i) = foo(i)
end do
contains
pure integer function foo(a)
!$omp declare target
implicit none
integer, intent(in) :: a
foo = 2 * a
end function foo
end subroutine example