Matrix operator*(const Matrix &lhs, const Matrix &rhs) noexcept {
auto result = Matrix{lhs.rows(), rhs.cols()};
for (std::size_t k = 0; k < lhs.cols(); ++k)
for (std::size_t i = 0; i < lhs.rows(); ++i)
for (std::size_t j = 0; j < rhs.cols(); ++j)
result(i, j) += lhs(i, k) * rhs(k, j);
return result;
}
valgrind --tool=callgrind build/main
callgrind_annotate --inclusive=yes callgrind.out*
--------------------------------------------------------------------------------
Ir file:function
--------------------------------------------------------------------------------
1,964,186,524 ???:0x0000000000001100 [/usr/lib/x86_64-linux-gnu/ld-2.31.so]
1,961,995,291 ???:_start [/home/manuel/Dokumente/profiling/build/main]
1,961,995,279 /build/glibc-eX1tMB/glibc-2.31/csu/../csu/libc-start.c:(below main) [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
1,961,883,340 /home/manuel/Dokumente/profiling/src/main.cpp:main
1,946,522,553 src/main.cpp:main [/home/manuel/Dokumente/profiling/build/main]
1,023,496,630 src/main.cpp:operator*(Matrix const&, Matrix const&) [/home/manuel/Dokumente/profiling/build/main]
KCacheGrind
visualizes outputperf
is faster than callgrind
perf stat -d -r5 build/main
Performance counter stats for 'build/main' (5 runs):
1’265.91 msec task-clock # 0.998 CPUs utilized ( +- 1.82% )
28 context-switches # 0.022 K/sec ( +- 36.06% )
1 cpu-migrations # 0.001 K/sec ( +- 36.42% )
32’912 page-faults # 0.026 M/sec ( +- 0.00% )
2’725’958’586 cycles # 2.153 GHz ( +- 1.10% ) (49.80%)
2’102’075’504 instructions # 0.77 insn per cycle ( +- 0.53% ) (62.41%)
362’906’440 branches # 286.677 M/sec ( +- 0.37% ) (62.38%)
146’567 branch-misses # 0.04% of all branches ( +- 7.89% ) (62.54%)
410’411’322 L1-dcache-loads # 324.203 M/sec ( +- 0.28% ) (62.66%)
41’618’630 L1-dcache-load-misses # 10.14% of all L1-dcache accesses ( +- 0.49% ) (62.75%)
17’904’419 LLC-loads # 14.144 M/sec ( +- 0.50% ) (50.03%)
15’112’462 LLC-load-misses # 84.41% of all LL-cache accesses ( +- 1.76% ) (49.84%)
1.2687 +- 0.0240 seconds time elapsed ( +- 1.89% )
the LLC is almost always skipped
Hotspot simplifies
using perf
record events:
cpu-clock
,LLC-loads
,LLC-load-misses
operator*
produces the cache misses Performance counter stats for 'build/main' (5 runs):
275.27 msec task-clock # 0.980 CPUs utilized ( +- 2.46% )
14 context-switches # 0.052 K/sec ( +- 32.62% )
0 cpu-migrations # 0.001 K/sec ( +-100.00% )
32’912 page-faults # 0.120 M/sec ( +- 0.00% )
1’089’431’506 cycles # 3.958 GHz ( +- 0.95% ) (49.11%)
1’799’930’776 instructions # 1.65 insn per cycle ( +- 0.28% ) (62.40%)
240’095’134 branches # 872.222 M/sec ( +- 0.42% ) (63.33%)
149’096 branch-misses # 0.06% of all branches ( +- 13.85% ) (63.52%)
349’218’671 L1-dcache-loads # 1268.648 M/sec ( +- 1.83% ) (63.53%)
9’999’605 L1-dcache-load-misses # 2.86% of all L1-dcache accesses ( +- 1.47% ) (63.05%)
231’426 LLC-loads # 0.841 M/sec ( +- 17.77% ) (48.79%)
62’802 LLC-load-misses # 27.14% of all LL-cache accesses ( +- 9.36% ) (48.67%)
0.28085 +- 0.00556 seconds time elapsed ( +- 1.98% )