Introduce a custom LineWriter for high-performance hex line printing

Authored by mwolff on Apr 25 2018, 9:05 PM.

Description

Introduce a custom LineWriter for high-performance hex line printing

The LineWriter fills the output buffer with a custom implementation
of an unsigned-integer to hex-string converter. Benchmarking this
with tests/manual/threaded shows a noticeable performance gain:

Before:
Performance counter stats for 'heaptrack ./tests/manual/threaded' (5 runs):

  2246,525202      task-clock (msec)         #    1,772 CPUs utilized            ( +-  4,44% )
      109.471      context-switches          #    0,049 M/sec                    ( +- 14,85% )
       25.173      cpu-migrations            #    0,011 M/sec                    ( +- 20,76% )
       56.174      page-faults               #    0,025 M/sec                    ( +-  0,17% )
6.828.825.882      cycles                    #    3,040 GHz                      ( +-  4,36% )
6.068.732.957      instructions              #    0,89  insn per cycle           ( +-  2,68% )
1.187.665.333      branches                  #  528,668 M/sec                    ( +-  2,66% )
   20.658.618      branch-misses             #    1,74% of all branches          ( +-  4,26% )

  1,267992445 seconds time elapsed                                          ( +-  6,18% )

After:
Performance counter stats for 'heaptrack ./tests/manual/threaded' (5 runs):

  1283,488536      task-clock (msec)         #    2,178 CPUs utilized            ( +-  0,19% )
       27.880      context-switches          #    0,022 M/sec                    ( +-  0,26% )
        6.034      cpu-migrations            #    0,005 M/sec                    ( +-  0,72% )
       56.193      page-faults               #    0,044 M/sec                    ( +-  0,04% )
3.824.835.231      cycles                    #    2,980 GHz                      ( +-  0,22% )
4.324.071.695      instructions              #    1,13  insn per cycle           ( +-  0,19% )
  851.553.131      branches                  #  663,468 M/sec                    ( +-  0,19% )
   14.281.586      branch-misses             #    1,68% of all branches          ( +-  0,14% )

  0,589248478 seconds time elapsed                                          ( +-  0,45% )

The drastic reduction in CPU time for the string conversion also
leads to a significant reduction in lock contention. Paired together,
this gives us a big 2x runtime improvement in this test!

Details

Committed
mwolffApr 27 2018, 11:18 AM
Parents
R45:4e72941cd56a: Always use std::mutex for locking, never a custom spin lock
Branches
Unknown
Tags
Unknown