Introduce a custom LineWriter for high-performance hex line printing
The LineWriter fills the output buffer with a custom implementation
of an unsigned-integer to hex-string converter. Benchmarking this
with tests/manual/threaded shows a noticeable performance gain:
Before:
Performance counter stats for 'heaptrack ./tests/manual/threaded' (5 runs):
2246,525202 task-clock (msec) # 1,772 CPUs utilized ( +- 4,44% ) 109.471 context-switches # 0,049 M/sec ( +- 14,85% ) 25.173 cpu-migrations # 0,011 M/sec ( +- 20,76% ) 56.174 page-faults # 0,025 M/sec ( +- 0,17% ) 6.828.825.882 cycles # 3,040 GHz ( +- 4,36% ) 6.068.732.957 instructions # 0,89 insn per cycle ( +- 2,68% ) 1.187.665.333 branches # 528,668 M/sec ( +- 2,66% ) 20.658.618 branch-misses # 1,74% of all branches ( +- 4,26% ) 1,267992445 seconds time elapsed ( +- 6,18% )
After:
Performance counter stats for 'heaptrack ./tests/manual/threaded' (5 runs):
1283,488536 task-clock (msec) # 2,178 CPUs utilized ( +- 0,19% ) 27.880 context-switches # 0,022 M/sec ( +- 0,26% ) 6.034 cpu-migrations # 0,005 M/sec ( +- 0,72% ) 56.193 page-faults # 0,044 M/sec ( +- 0,04% ) 3.824.835.231 cycles # 2,980 GHz ( +- 0,22% ) 4.324.071.695 instructions # 1,13 insn per cycle ( +- 0,19% ) 851.553.131 branches # 663,468 M/sec ( +- 0,19% ) 14.281.586 branch-misses # 1,68% of all branches ( +- 0,14% ) 0,589248478 seconds time elapsed ( +- 0,45% )
The drastic reduction in CPU time for the string conversion also
leads to a significant reduction in lock contention. Paired together,
this gives us a big 2x runtime improvement in this test!