Skip to main content

PWR009: Use OpenMP teams to offload work to GPU

Issue

OpenMP teams should be used to distribute work offloaded to the GPU.

Actions

Use the target teams distribute parallel for combined construct to offload work to the GPU using two levels of parallelism.

Relevance

GPUs are notably different from CPUs, being composed by a high number of processing units instead of a low level of cores. Moreover, these processing units are organized following a hierarchy within the GPU that requires some specific setup in order to better exploit its capabilities.

The OpenMP parallel construct specifies a parallel region of the code that will be executed by a team of threads. It is normally accompanied by a worksharing construct so that each thread of the team takes care of part of the work (e.g the for construct assigns a subset of the loop iterations to each thread). This attains a single level of parallelism since all work is distributed across a team of threads. This works well for multi-core CPUs but GPUs are composed of a high number of processing units organized into groups that can share memory and synchronize. This must be taken into account in order to get the better performance out of GPUs.

The OpenMP teams distribute construct can be used to introduce an additional level of parallelism by creating multiple teams of threads and distributing loop iterations across them. Each team forms a contention group, meaning that threads can only synchronize with other threads in its team. This allows the work to be distributed better fitting the hierarchical organization of the processing units of GPUs. Additionally, using teams enhances performance portability, ensuring a more predictable performance no matter which compiler and hardware combination is used.

Code example

C

The following code offloads a matrix multiplication computation through the target construct and then creates a parallel region and distributes the work through for construct (note that the matrices are statically sized arrays):

#pragma omp target map(to: A[0:m][0:p], B[0:p][0:n], m, n, p) \
map(tofrom: C[0:m][0:n])
{
#pragma omp parallel default(none) shared(A, B, C, m, n, p)
{
#pragma omp for schedule(auto)
for (size_t i = 0; i < m; i++) {
for (size_t j = 0; j < n; j++) {
for (size_t k = 0; k < p; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
} // end parallel
} // end target

When offloading to the GPU, it is recommended to use an additional level of parallelism. This can be achieved by using the teams and distribute constructs; in this case, in combination with parallel for:

#pragma omp target teams distribute parallel for \
map(to: A[0:m][0:p], B[0:p][0:n], m, n, p) shared(A, B, m, n, p) \
map(tofrom: C[0:m][0:n]) schedule(auto)
for (size_t i = 0; i < m; i++) {
for (size_t j = 0; j < n; j++) {
for (size_t k = 0; k < p; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}

Fortran

The following code offloads a matrix multiplication computation through the target construct and then creates a parallel region and distributes the work through the do construct:

!$omp target map(to: A, B) map(tofrom: C)
!$omp parallel default(none) private(i, j, k) shared(A, B, C)
!$omp do
do j = 1, size(C, 2)
do k = 1, size(C, 2)
do i = 1, size(C, 1)
C(i, j) = C(i, j) + A(i, k) * B(k, j)
end do
end do
end do
!$omp end do
!$omp end parallel
!$omp end target

When offloading to the GPU, it is recommended to use an additional level of parallelism. This can be achieved by using the teams and distribute constructs; in this case, in combination with parallel do:

!$omp target teams distribute map(to: A, B) map(tofrom: C)
!$omp parallel default(none) private(i, j, k) shared(A, B, C)
!$omp do
do j = 1, size(C, 2)
do k = 1, size(C, 2)
do i = 1, size(C, 1)
C(i, j) = C(i, j) + A(i, k) * B(k, j)
end do
end do
end do
!$omp end do
!$omp end parallel
!$omp end target

References