PWR062: Consider loop interchange by removing accumulation on array value
Issue
The performance of the nested loops can benefit from loop interchange, but an accumulation on the array values prevents it.
Actions
Perform loop fission to split the accumulation into two parts: (1) a loop that initializes the accumulation, and (2) a second loop that performs the actual computation of the accumulation. This allows for a perfect nesting in the latter loop (performance critical), which in turn enables loop interchange.
Relevance
Inefficient memory access patterns and low locality of reference are among the main reasons for low performance on modern computer systems. Matrices are stored in a row-major order in C/C++, and column-major order in Fortran. Iterating over them column-wise (in C) and row-wise (in Fortran) hinders performance as a result of a suboptimal usage of the memory subsystem.
Nested loops that iterate over matrices inefficiently can be optimized by applying loop interchange. This technique allows to replace the inefficient memory access pattern with a more efficient one. Moreover, loop interchange usually enables vectorization of the innermost loop as well, which can lead to additional performance improvements.
To perform the loop interchange, the loops need to be perfectly nested; i.e., all the statements need to be inside the innermost loop. However, in this scenario, the initialization and computation of the accumulation is in separate statements, preventing the loop interchange.
Code example
C
Have a look at the following matrix multiplication code:
void matmul(int n, const double *A, const double *B, double *C) {
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
C[i * n + j] = 0.0;
for (int k = 0; k < n; ++k) {
C[i * n + j] += A[i * n + k] * B[k * n + j];
}
}
}
}
Note that, in the innermost loop, the memory access pattern of the matrix B
is strided. Hence, the computation can benefit from loop interchange for a more
efficient memory usage. However, the accumulation initialization C[i * n + j] = 0.0
prevents it.
To make the loop interchangeable, loop fission can be applied to move the initialization to a separate loop:
void matmul(int n, const double *A, const double *B, double *C) {
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
C[i * n + j] = 0.0;
}
}
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
for (int k = 0; k < n; ++k) {
C[i * n + j] += A[i * n + k] * B[k * n + j];
}
}
}
}
Now that the performance critical loop (the second one) is perfectly nested, the loop interchange can be applied to leverage an efficient memory access pattern:
void matmul(int n, const double *A, const double *B, double *C) {
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
C[i * n + j] = 0.0;
}
}
for (int i = 0; i < n; ++i) {
for (int k = 0; k < n; ++k) {
for (int j = 0; j < n; ++j) {
C[i * n + j] += A[i * n + k] * B[k * n + j];
}
}
}
}
Fortran
Have a look at the following matrix multiplication code:
subroutine matmul(n, A, B, C)
implicit none
integer, intent(in) :: n
real, dimension(:, :), intent(in) :: A, B
real, dimension(:, :), intent(out) :: C
integer :: i, j, k
do j = 1, n
do i = 1, n
C(i, j) = 0
do k = 1, n
C(i, j) = C(i, j) + A(i, k) * B(k, j)
end do
end do
end do
end subroutine matmul
Note that, in the innermost loop, the memory access pattern of the matrix A
is strided. Hence, the computation can benefit from loop interchange for a more
efficient memory usage. However, the accumulation initialization C(i, j) = 0
prevents it.
To make the loop interchangeable, loop fission can be applied to move the initialization to a separate loop:
subroutine matmul(n, A, B, C)
implicit none
integer, intent(in) :: n
real, dimension(:, :), intent(in) :: A, B
real, dimension(:, :), intent(out) :: C
integer :: i, j, k
! No need to write an explicit loop to zero the whole matrix
C = 0
do j = 1, n
do i = 1, n
do k = 1, n
C(i, j) = C(i, j) + A(i, k) * B(k, j)
end do
end do
end do
end subroutine matmul
Now that the performance critical loop (the second one) is perfectly nested, the loop interchange can be applied to leverage an efficient memory access pattern:
subroutine matmul(n, A, B, C)
implicit none
integer, intent(in) :: n
real, dimension(:, :), intent(in) :: A, B
real, dimension(:, :), intent(out) :: C
integer :: i, j, k
C = 0
do j = 1, n
do k = 1, n
do i = 1, n
C(i, j) = C(i, j) + A(i, k) * B(k, j)
end do
end do
end do
end subroutine matmul