PWR027: Annotate function for OpenACC offload
Issue
A OpenACC offload version of the function can be generated by the compiler.
Actions
Annotate the function with the OpenACC routine
directive.
Relevance
When a loop is offloaded to the GPU, the compiler translates the relevant code sections into GPU-friendly instructions, which are embedded into the main CPU program. The runtime is in charge of executing these code sections in the GPU and handles data movements between the CPU and GPU memories.
If an offloaded loop calls functions, GPU versions of those functions must be
generated as well. To achieve this, mark the relevant functions with the
OpenACC routine
directive. If this isn't done, the CPU version will be called
instead. This results in significant performance issues because the computation
data must be moved from the GPU to the CPU to execute the function, and then
back to the GPU once the function returns.
Code example
C
While the loop below is correctly offloaded to the GPU, the function foo()
isn't marked for GPU execution. Consequently, the compiler will only create a
CPU version of foo()
. This means every time foo()
is called, the data must
be transferred between the GPU and the CPU, negatively impacting performance:
__attribute__((const)) int foo(int a) {
return 2 * a;
}
void example(int n, int *A) {
#pragma acc kernels
for (int i = 0; i < n; i++) {
A[i] = foo(i);
}
}
To prevent the performance loss caused by constant data transfers, add the
#pragma acc routine
directive to foo()
. This will instruct the compiler to
create a GPU version of the function, allowing the loop to run entirely on the
device:
#pragma acc routine
__attribute__((const)) int foo(int a) {
return 2 * a;
}
void example(int n, int *A) {
#pragma acc kernels
for (int i = 0; i < n; i++) {
A[i] = foo(i);
}
}
Fortran
While the loop below is correctly offloaded to the GPU, the function foo()
isn't marked for GPU execution. Consequently, the compiler will only create a
CPU version of foo()
. This means every time foo()
is called, the data must
be transferred between the GPU and the CPU, negatively impacting performance:
subroutine example(A)
implicit none
integer, intent(out) :: A(:)
integer :: i
!$acc kernels
do i = 1, size(A, 1)
A(i) = foo(i)
end do
!$acc end kernels
contains
pure integer function foo(a)
implicit none
integer, intent(in) :: a
foo = 2 * a
end function foo
end subroutine example
To prevent the performance loss caused by constant data transfers, add the
!$acc routine
directive to foo()
. This will instruct the compiler to create
a GPU version of the function, allowing the loop to run entirely on the device:
subroutine example(A)
implicit none
integer, intent(out) :: A(:)
integer :: i
!$acc kernels
do i = 1, size(A, 1)
A(i) = foo(i)
end do
!$acc end kernels
contains
pure integer function foo(a)
!$acc routine
implicit none
integer, intent(in) :: a
foo = 2 * a
end function foo
end subroutine example