Profiling

Example: Multiplication

Matrix operator*(const Matrix &lhs, const Matrix &rhs) noexcept {
  auto result = Matrix{lhs.rows(), rhs.cols()};

  for (std::size_t k = 0; k < lhs.cols(); ++k)
    for (std::size_t i = 0; i < lhs.rows(); ++i)
      for (std::size_t j = 0; j < rhs.cols(); ++j)
        result(i, j) += lhs(i, k) * rhs(k, j);

  return result;
}

Configuration

Optimized executables with debug symbols

module load cmake ninja # on Euler cluster only
cmake -S src \
    -B build \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -G Ninja
cmake --build build

Callgrind

valgrind --tool=callgrind build/main
callgrind_annotate --inclusive=yes callgrind.out*

Multiplication

--------------------------------------------------------------------------------
Ir             file:function
--------------------------------------------------------------------------------
1,964,186,524  ???:0x0000000000001100 [/usr/lib/x86_64-linux-gnu/ld-2.31.so]
1,961,995,291  ???:_start [/home/manuel/Dokumente/profiling/build/main]
1,961,995,279  /build/glibc-eX1tMB/glibc-2.31/csu/../csu/libc-start.c:(below main) [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
1,961,883,340  /home/manuel/Dokumente/profiling/src/main.cpp:main
1,946,522,553  src/main.cpp:main [/home/manuel/Dokumente/profiling/build/main]
1,023,496,630  src/main.cpp:operator*(Matrix const&, Matrix const&) [/home/manuel/Dokumente/profiling/build/main]

Visualization

KCacheGrind visualizes output

Investigation

perf is faster than callgrind
perf stat -d -r5 build/main

 Performance counter stats for 'build/main' (5 runs):

          1’265.91 msec task-clock                #    0.998 CPUs utilized            ( +-  1.82% )
                28      context-switches          #    0.022 K/sec                    ( +- 36.06% )
                 1      cpu-migrations            #    0.001 K/sec                    ( +- 36.42% )
            32’912      page-faults               #    0.026 M/sec                    ( +-  0.00% )
     2’725’958’586      cycles                    #    2.153 GHz                      ( +-  1.10% )  (49.80%)
     2’102’075’504      instructions              #    0.77  insn per cycle           ( +-  0.53% )  (62.41%)
       362’906’440      branches                  #  286.677 M/sec                    ( +-  0.37% )  (62.38%)
           146’567      branch-misses             #    0.04% of all branches          ( +-  7.89% )  (62.54%)
       410’411’322      L1-dcache-loads           #  324.203 M/sec                    ( +-  0.28% )  (62.66%)
        41’618’630      L1-dcache-load-misses     #   10.14% of all L1-dcache accesses  ( +-  0.49% )  (62.75%)
        17’904’419      LLC-loads                 #   14.144 M/sec                    ( +-  0.50% )  (50.03%)
        15’112’462      LLC-load-misses           #   84.41% of all LL-cache accesses  ( +-  1.76% )  (49.84%)

            1.2687 +- 0.0240 seconds time elapsed  ( +-  1.89% )

Idea: LLC (Last Level Cache)

the LLC is almost always skipped
Hotspot simplifies using perf

record events: cpu-clock,LLC-loads,LLC-load-misses

perf record -o perf.data \
            -e cpu-clock,LLC-loads,LLC-load-misses \
            --call-graph dwarf build/main

Hotspot

operator* produces the cache misses

Alternative: terminal

perf report # select LLC-load-misses

Access

odd memory access pattern

for (std::size_t k = 0; k < lhs.cols(); ++k)
  for (std::size_t i = 0; i < lhs.rows(); ++i)
    for (std::size_t j = 0; j < rhs.cols(); ++j)
      result(i, j) += lhs(i, k) * rhs(k, j);

Solution

access in row-major order

for (std::size_t i = 0; i < lhs.rows(); ++i)
  for (std::size_t j = 0; j < rhs.cols(); ++j)
    for (std::size_t k = 0; k < lhs.cols(); ++k)
      result(i, j) += lhs(i, k) * rhs(k, j);

4x Better performance

 Performance counter stats for 'build/main' (5 runs):

            275.27 msec task-clock                #    0.980 CPUs utilized            ( +-  2.46% )
                14      context-switches          #    0.052 K/sec                    ( +- 32.62% )
                 0      cpu-migrations            #    0.001 K/sec                    ( +-100.00% )
            32’912      page-faults               #    0.120 M/sec                    ( +-  0.00% )
     1’089’431’506      cycles                    #    3.958 GHz                      ( +-  0.95% )  (49.11%)
     1’799’930’776      instructions              #    1.65  insn per cycle           ( +-  0.28% )  (62.40%)
       240’095’134      branches                  #  872.222 M/sec                    ( +-  0.42% )  (63.33%)
           149’096      branch-misses             #    0.06% of all branches          ( +- 13.85% )  (63.52%)
       349’218’671      L1-dcache-loads           # 1268.648 M/sec                    ( +-  1.83% )  (63.53%)
         9’999’605      L1-dcache-load-misses     #    2.86% of all L1-dcache accesses  ( +-  1.47% )  (63.05%)
           231’426      LLC-loads                 #    0.841 M/sec                    ( +- 17.77% )  (48.79%)
            62’802      LLC-load-misses           #   27.14% of all LL-cache accesses  ( +-  9.36% )  (48.67%)

           0.28085 +- 0.00556 seconds time elapsed  ( +-  1.98% )