Pooled memory allocations
Needs RevisionPublic

Authored by poke1024 on Sep 28 2017, 2:11 PM.

Details

Reviewers
dkazakov
Group Reviewers
Krita
Summary

Dynamic allocations of KisOpenGLUpdateInfo show up in profiling with a couple of percent. There's quite a lot of these items being created and deleted during drawing,

This PR adds a lock free (thread safe) memory pool for these objects, so that no new/delete will happen anymore (constructor and destructor calls do happen as expected though).

Diff Detail

Repository
R37 Krita
Lint
Lint Skipped
Unit
Unit Tests Skipped
poke1024 created this revision.Sep 28 2017, 2:11 PM
alvinhochun added a subscriber: alvinhochun.

I would like to know how the profiling times compare before and after this change. Do you have any data?

libs/ui/canvas/kis_update_info.cpp
20

How was 64 decided? Is it just a random number choice or are there any reasons to it?

libs/ui/canvas/kis_update_info.h
34

This template class might be useful in other places too, so it might make sense to put this in its own header file somewhere...

48

If you don't pass std::nothrow to ::operator new, wouldn't it throw std::bad_alloc already anyway? Seems like this condition will always be true.

HI, @poke1024!

There are several questions that should be answered before the patch is pushed (some are already asked by @alvinhochun :

  1. What is the speed benefit of this change (numbers before and after)?
  2. Does it fix any memory fragmentation problem?
  3. Can we use our own KisLockfreeStack for that purpose? (stack is more trivial lockfree structure, than queue, therefore, preferred)
  4. [main one] What is the strategy for releasing this pool? When do we release? There might be thousands of update objects requested for big images. How and when do we release them?
  5. [funny one] KisOpenGLUpdateInfo object has tileList object, which may have up to 1000+ items. Should we also pool it?

From my experience, most memory allocators do quite a good job for allocating small objects 200-300 bytes long. They start to fail and chop the memory only when the object sizes grow to dozens of kilobytes, which is not the case, I guess.

@alvinhochun @dkazakov, the specific use case for this PR, KisUpdateInfo is indeed bogus (I actually had confused profiler values for the KoUpdateInfo constructor with malloc times) ; and as it turns out, now I've taken more measurements, you're right, Dmitry, the whole time spent in (non-QT) memory allocation even under load usually seems to be not an issue (under 3%).

Here are some measurements to substantiate that claim.

First I ran this code to measure the mean time of allocations via "new" on OS X:

QElapsedTimer timer;
const int size = 8192;
static quint8 *buf[8192];
timer.start();
for (int i = 0; i < size; i++) {
    buf[i] = new quint8[64];
}
printf("mean alloc time %f\n", timer.nsecsElapsed() / (float)size);

The result was 52,6 ns per allocation, which is reasonably fast and pretty exactly matches what other people are measuring for standard new and malloc calls (see https://stackoverflow.com/questions/23591196/are-calloc-malloc-faster-than-operator-new-in-c). Adding delete[], i.e. measuring mean alloc-free time, I get 66 ns.

Now I added custom new and delete operators to count allocations (I put them in kis_debug):

#include <QAtomicInt>
#include <QElapsedTimer>

QAtomicInt _total_count;
QElapsedTimer _timer;

void * operator new(std::size_t n) throw(std::bad_alloc)
{
    int count = _total_count.fetchAndAddRelaxed(1);
    const int slice = 16384;
    if ((count % slice) == 0) {
        long ms = _timer.elapsed();
        _timer.restart();
        printf("alloc %d %.1f\n", count, slice / (float)ms);
    }

    return malloc(n);
}

void operator delete(void * p) throw()
{
    free(p);
}

Here are some outputs while drawing very big strokes with "deevad 1e digital smooth ink":

alloc 7766016 390,1
alloc 7782400 819,2
alloc 7798784 819,2
alloc 7815168 780,2
alloc 7831552 585,1
alloc 7847936 420,1
alloc 7864320 630,2
alloc 7880704 399,6
alloc 7897088 1092,3
alloc 7913472 862,3
alloc 7929856 780,2
alloc 7946240 1024,0
alloc 7962624 390,1
alloc 7979008 2730,7
alloc 7995392 655,4
alloc 8011776 468,1
alloc 8028160 565,0
alloc 8044544 655,4
alloc 8060928 630,2
alloc 8077312 780,2
alloc 8093696 268,6
alloc 8110080 409,6
alloc 8126464 381,0
alloc 8142848 1365,3
alloc 8159232 819,2
alloc 8175616 606,8
alloc 8192000 468,1
alloc 8208384 210,1
alloc 8224768 963,8
alloc 8241152 862,3
alloc 8257536 712,3
alloc 8273920 356,2
alloc 8290304 1820,4
alloc 8306688 1170,3
alloc 8323072 248,2
alloc 8339456 546,1
alloc 8355840 348,6
alloc 8372224 2340,6
alloc 8388608 744,7
alloc 8404992 334,4
alloc 8421376 334,4
alloc 8437760 780,2
alloc 8454144 252,1
alloc 8470528 1170,3

The first number is the total of allocations, the second is the mean number of allocations per millisecond.

Let's make a worst case calculation: 2000 allocations per millisecond, and each allocation-free takes 60 ns, so you have 450 * 60 = 120000 ns allocation time per millsecond (i.e. 1000000 ns), roughly 12% of the time.

But: most of the times (when drawing smaller strokes, or when using other brushes) the numbers are well below 500, i.e. <= 3%.

There are definitely other bottlenecks (QT seems to eat quite some time), but memory bottlenecks are probably best attacked by finding code that excessively allocates complex structures in inner loops; it's not a general problem if the memory allocator is solid.

libs/ui/canvas/kis_update_info.h
48

you're right, this check is bogus

Hi, @poke1024!

Let's make a worst case calculation: 2000 allocations per millisecond, and each allocation-free takes 60 ns, so you have 450 * 60 = 120000 ns allocation time per millsecond (i.e. 1000000 ns), roughly 12% of the time.

As far as I can tell from the code, that thing measures *all* memory allocations happening in Krita, not only KisOpenGLUpdateInfo. Is it right?

If so, how does this number changes after the pooling is turned on? Does it change actually?

Could you please answer the following questions:

  1. How did you choose the size of the pool (64)? Our updates are usually tiled into pieces of 256x256 pixes. It means that for a normal image of size 6000x4000, there will be at least 550 KisUpdateInfo objects per full update. Can we somehow automate the pool size (or use boost::pool)?
  2. Each KisOpenGLUpdateInfo object has a set of shared-pointer-guarded KisTextureTileUpdateInfo objects. The number of these objects is much higher that the number of KisUpdateInfo objects, so if we decide to optimize memory allocation, we should start with texture tiles instead. Did you check what portion of allocation take KisTextureTileUpdateInfo objects?
  3. Actually, each KisTextureTileUpdateInfo object also has a pointer to a buffer object, that stores real pixels. We already have a pool for these buffers (KisTextureTileInfoPool), which is based on boost::pool.
  4. Can we use standard boost::pool for the purpose of pooling? It has the only one drawback, one can purge data from the pool *only* when it has no allocations at all.
  5. If we choose a lockfree structure, can we use our own KisLockfreeStack for that purpose? It is already used for caching in a few places. And a queue implementation is generally more complex than the one of a stack.

Hi @dkazakov, regarding your questions:

(1) I chose 64 because there was an allocation of about 20 to 30 items on my system during drawing, so I doubled it for good measure for slower systems.
(2) I didn't until now have KisTextureTileUpdateInfo on my radar, but indeed it's a hit (at least 5% in the main thread alone, see profile screenshot below), so yes, we should definitely tackle that (esp. as it blocks the main thread)
(3), (4) Using boost:pool is definitely an option, as long as we stay in one thread (boost::pool is not thread safe). Is this the case here? As long as the number of objects needed over an application lifetime stays roughly constant (which should be the case for tile infos) the deallocation is no problem I believe, as these objects are basically always in use and only temporarily pooled.
(5) I don't think we can use a stack for pooling. If we need a thread safe memory pool, we need to come up with something lock free of our own, as boost::pool with standard mutexes kills the benefits of using a memory pool. I did one a few months ago (https://github.com/poke1024/cmathics/blob/master/concurrent/pool.h), but a simple lock free queue as described in this PR originally should really do the trick if we focus on one object type like KisTextureTileUpdateInfo.

Hi @dkazakov, after looking at the profiling again and the code, I'm a bit confused. This looks like the m_pool.purge_memory(); call inside KisTextureTileInfoPoolSingleSize::free(), but it's really not called that often. I did that profile on an iMac 5K, where performance is abysmal, so maybe in that specific configuration there's something fishy. Will investigate further.

Hi, @poke1024!

Hi @dkazakov, regarding your questions:

(1) I chose 64 because there was an allocation of about 20 to 30 items on my system during drawing, so I doubled it for good measure for slower systems.

Well... Ideally, it should be somehow dynamically calculated... The problem is: we had a lockfree tiles pool of fixed size until recently, and recently I found out that it only degrades performance. So I'm a bit suspicious to such (especially constant-size) pools.

(2) I didn't until now have KisTextureTileUpdateInfo on my radar, but indeed it's a hit (at least 5% in the main thread alone, see profile screenshot below), so yes, we should definitely tackle that (esp. as it blocks the main thread)

If we go into "pool everything we allocate", then we need some uniform way of doing that with auto-estimation of the pool size. Right now we have at least four types of objects that might need pooling:

  1. KisOpenGLUpdateInfo
  2. KisTextureTileUpdateInfo
  3. DataBuffer
  4. DataBuffer::m_data (already pooled with boost::pool)

There is a slight difference between the cases: in cases 1-3 the object size is small and constant. In case 4, the chunks can have a bit different size and they might be big. It means that the requirements for case 4 are a bit more strict: it should also fight against memory fragmentation problem. It was actually the reason why we started to use boost::pool: we got huge memory fragmentation on Windows (it has a weird memory allocator, at least in MSVC).

(3), (4) Using boost:pool is definitely an option, as long as we stay in one thread (boost::pool is not thread safe). Is this the case here?

No, it is not the case, all the four objects are accessed from different threads. And, according to your screenshot, QMutex::lock() in KisTextureTileInfoPool::free() takes 0% of time, so it should not be a problem :)

As long as the number of objects needed over an application lifetime stays roughly constant (which should be the case for tile infos) the deallocation is no problem I believe, as these objects are basically always in use and only temporarily pooled.

Well, the problem is that the user expects the memory be released when he closes the document. That is more "mental" than "technical" problem, but we get bug reports telling us "Krita eats too much memory and does not release it back when I close the document".
That is why I'm asking about ways of releasing this memory. It might be not the major point in cases 1-3, but that is perfectly true for case 4, where it can allocate up to a GiB of memory.

(5) I don't think we can use a stack for pooling. If we need a thread safe memory pool, we need to come up with something lock free of our own

KisLockfreeStack is already lockfree. The only difference with boost::queue is that it doesn't have a size limit.

as boost::pool with standard mutexes kills the benefits of using a memory pool.

As your screenshot says, the mutexes take 0% in KisTextureTileInfoPool::free(), so that shouldn't be a problem.

{F4852518}

5% of time is spent on cleaning up the memory chunks when they are not used anymore. That usually happens in the end of the stroke after the rendering of the stroke has been completed.

I would actually propose having some "unified" pool system for "constant-sized" objects, but put it into boost::pool+mutex instead. We should also do the cleaning of such pools if they occupy more than XX MiB of memory (that is count not in the number of objects, but in occupied memory). It would solve both problems: memory fragmentation and allocation time

dkazakov requested changes to this revision.Oct 23 2017, 12:53 PM

I'm not sure if this patch is still actual after D8279. Anyway, I will just mark this patch as needs changes, so it would not show up in the "must review" list

This revision now requires changes to proceed.Oct 23 2017, 12:53 PM