Skip to main content

PWR027: Annotate function for OpenACC offload


A OpenACC offload version of the function can be generated by the compiler.


Annotate the function with the OpenACC routine directive.


When a loop is offloaded to the GPU, the compiler translates the relevant code sections into GPU-friendly instructions, which are embedded into the main CPU program. The runtime is in charge of executing these code sections in the GPU and handles data movements between the CPU and GPU memories.

If an offloaded loop calls functions, GPU versions of those functions must be generated as well. To achieve this, mark the relevant functions with the OpenACC routine directive. If this isn't done, the CPU version will be called instead. This results in significant performance issues because the computation data must be moved from the GPU to the CPU to execute the function, and then back to the GPU once the function returns.

Code example


While the loop below is correctly offloaded to the GPU, the function foo() isn't marked for GPU execution. Consequently, the compiler will only create a CPU version of foo(). This means every time foo() is called, the data must be transferred between the GPU and the CPU, negatively impacting performance:

__attribute__((const)) int foo(int a) {
return 2 * a;

void example(int n, int *A) {
#pragma acc kernels
for (int i = 0; i < n; i++) {
A[i] = foo(i);

To prevent the performance loss caused by constant data transfers, add the #pragma acc routine directive to foo(). This will instruct the compiler to create a GPU version of the function, allowing the loop to run entirely on the device:

#pragma acc routine
__attribute__((const)) int foo(int a) {
return 2 * a;

void example(int n, int *A) {
#pragma acc kernels
for (int i = 0; i < n; i++) {
A[i] = foo(i);


While the loop below is correctly offloaded to the GPU, the function foo() isn't marked for GPU execution. Consequently, the compiler will only create a CPU version of foo(). This means every time foo() is called, the data must be transferred between the GPU and the CPU, negatively impacting performance:

subroutine example(A)
implicit none
integer, intent(out) :: A(:)
integer :: i

!$acc kernels
do i = 1, size(A, 1)
A(i) = foo(i)
end do
!$acc end kernels


pure integer function foo(a)
implicit none
integer, intent(in) :: a
foo = 2 * a
end function foo
end subroutine example

To prevent the performance loss caused by constant data transfers, add the !$acc routine directive to foo(). This will instruct the compiler to create a GPU version of the function, allowing the loop to run entirely on the device:

subroutine example(A)
implicit none
integer, intent(out) :: A(:)
integer :: i

!$acc kernels
do i = 1, size(A, 1)
A(i) = foo(i)
end do
!$acc end kernels


pure integer function foo(a)
!$acc routine
implicit none
integer, intent(in) :: a
foo = 2 * a
end function foo
end subroutine example
