Sum and assign of array is slower in derived types-CodePudding

I was comparing the performance of doing a sum followed by an assignment of two arrays, in the form of c=a b, between a native Fortran type, real, and a derived data type that only contains one array of real. The class is very simple: it contains operators for addition and assignment and a destructor, as follows:

module type_mod

use iso_fortran_env

type :: class_t
  real(8), dimension(:,:), allocatable :: a
contains
  procedure :: assign_type
  generic, public :: assignment(=) => assign_type
  procedure :: sum_type
  generic :: operator( ) => sum_type
  final :: destroy
end type class_t

contains

  subroutine assign_type(lhs, rhs)
    class(class_t), intent(inout) :: lhs
    type(class_t), intent(in) :: rhs
    lhs % a = rhs % a
  end subroutine assign_type

  subroutine destroy(this)
    type(class_t), intent(inout) :: this
    if (allocated(this % a)) deallocate(this % a)
  end subroutine destroy

  function sum_type (lhs, rhs) result(res)
    class(class_t), intent(in) :: lhs
    type(class_t), intent(in) :: rhs
    type(class_t) :: res
    res % a = lhs % a   rhs % a
  end function sum_type

end module type_mod

The assign subroutine contains different modes of operations, just for the sake of benchmarking.

To test it against performing the same operations on a real I created the following module

module subroutine_mod

  use type_mod, only: class_t

  contains

  subroutine sum_real(a, b, c)
    real(8), dimension(:,:), intent(inout) :: a, b, c
    c = a   b
  end subroutine sum_real


  subroutine sum_type(a, b, c)
    type(class_t), intent(inout) :: a, b, c
    c = a   b
  end subroutine sum_type

end module subroutine_mod

Everything is executed in the program below, considering arrays of size (10000,10000) and repeating the operation 100 times:

program test

  use subroutine_mod

  integer :: i
  integer :: N = 100 ! Number of times to repeat the assign
  integer :: M = 10000 ! Size of the arrays
  real(8) :: tf, ts
  real(8), dimension(:,:), allocatable :: a, b, c
  type(class_t) :: a2, b2, c2

  allocate(a2%a(M,M), b2%a(M,M), c2%a(M,M))
  a2%a = 1.0d0
  b2%a = 2.0d0
  c2%a = 3.0d0

  allocate(a(M,M), b(M,M), c(M,M))
  a = 1.0d0
  b = 2.0d0
  c = 3.0d0

  ! Benchmark timing with
  call cpu_time(ts)
  do i = 1, N
    call sum_type(a2, b2, c2)
  end do
  call cpu_time(tf)
  write(*,*) "Type : ", tf-ts

  call cpu_time(ts)
  do i = 1, N
    call sum_real(a, b, c)
  end do
  call cpu_time(tf)
  write(*,*) "Real : ", tf-ts

end program test

To my surprise, the operation with my derived datatype consistently underperformed the operation with the Fortran arrays by a factor of 2 with gfortran and a factor of 10 with ifort. For instance, using the CHECK_SIZE mode, which saves allocation time, I got the following timings compiling with the -O2 flag:

gfortran

Data type: 33 s
Real : 13 s

ifort

Data type: 30 s
Real : 3 s

Question

Is this normal behaviour? If so, are there any recommendations to achieve better performance?

Context

To provide some context, the type with a single array will be very useful for a code refactoring task, where we need to keep similar interfaces to a previous type.

Compiler versions

gfortran 9.4.0
ifort 2021.6.0 20220226

CodePudding user response：

You are worried about allocation time, but you do a lot of allocations of arrays of shape [M,M] for the derived type, and almost none for the intrinsic type.

The only allocations for the intrinsic type are in the main program, for a, b and c. These are outside the timing loop.

For the derived type, you allocate for a2%a, b2%a and c2%a (again outside the timing loop), but also res%a in the function sum, N times inside the timing loop.

Equally, inside the sum_real subroutine the assignment statement c=a b involves no allocatable object but inside sum_type the c in c=a b is an allocatable array: the compiler checks whether c is allocated and if so, whether its shape matches the right-hand side expression.

In summary: you are not comparing like with like. There's a lot of overhead in wrapping an intrinsic array as an allocatable component of a derived type.

Tangential to your timing concerns is the "cleverness" of the subroutine assign. It's horrible.

Calling an argument lhs when it's associated with the right-hand side of the assignment statement is a little confusing, but the select case construct is confusing beyond a little.

case (ASSUMED_SIZE)
  this % a = lhs % a

under rules where the rest of the program makes any sense, invokes a couple of checks:

is this%a allocated? If not, allocate it to the shape of lhs%a.
if it is allocated, check whether the shape matches lhs%a, if not deallocate it then allocate it to the shape of lhs%a.

Those checks and actions which are done manually in the CHECK_SIZE case, in other words.

The final subroutine does nothing of value, so the entire assign subroutine's execution can be replaced by this%a = lhs%a.

(Things would be different if the final subroutine had substantive effect or the compiler had been asked to ignore the rules of intrinsic assignment using -fno-realloc-arrays or -nostandard-realloc-lhs for example, or this%a(:,:)=lhs%a had been used.)