Improving the performance of the floodfill algorithm
Open, Needs TriagePublic

Description

Improving the performance floodfill algorithm

Abstract

In this task I want to talk about some experiments I made to explore possible improvements in the performance of the floodfill algorithm.

The main approach revolves around the parallelization of the algorithm, that is, dividing the work spatially in multiple threads. Each thread will compute the floodfill in a small region of the image. Each of those regions do not overlap, avoiding that way the need for synchronization.

The parallelization alone improves a bit the performance, but not much. I think that the main bottleneck of an algorithm like the floodfill is the cache misses due to randomly accessing a large data chunk like an image. So a way to improve cache locality is also presented. That's what really improves the performance, but for this to work, parallelization is still needed.

I made small test app that implements four floodfill algorithms to be able to compare different approaches. Those are: the naive stack-based floodfill, the scanline floodfill, and the corresponding multithreaded versions of those. Following I provide a figure that shows a graph comparing the four approaches. It can be seen that the naive algorithm does not scale well with the image size, since it needs to randomly access pixels in very different locations of the image. The scanline floodfill is much better in terms of cache locality because it works with horizontal spans of pixels, which are contiguous in memory. In my tests the multithreaded naive floodfill is faster than the scanline floodfill, because it reduces even more the cache misses. But the king is with no doubt the multithreaded scanline floodfill, since it combines the cache locality inherent to the scanline floodfill with the cache locality of the multithreaded version.

I also made a test app that implements the algorithm. It uses plain QImage. I don't know how feasible will be implementing this approach in Krita.

Parallelizing the floodfill algorithm

At first, it may seem that an algorithm like the floodfill is hard to parallelize. In fact, after searching the internet for how to do that, I couldn't find any solution of interest. But at the end it is very simple.

First of all we need some mechanism to distribute work in different threads, some sort of thread pool. I'll refer to that as TP in the following text for brevity sake.

I'll refer to the global image region of interest as GlobalRect. That is the region where the overall floodfill must be performed. So, conceptually, we divide that GlobalRect into smaller contiguous, non-overlaping, regions or tiles. I'll call those TileRect. We'll perform the floodfill locally on those TileRects. Here the tiles are just subregions of the image, not real tiles.

We maintain a global (in contrast to tile-local) hash that I'll call TilePropagationInfo. A TilePropagationInfo has a TileId as a key and a SeedList as value for each entry. A TileId is just a pair of numbers that identify a tile by its position on the tile grid. The SeedList is a list of seeds that indicates where in the tile the local floodfill should be performed. Since we work locally on each tile, it may be necessary to perform multiple floodfills to ensure that the tile is correctly filled (think of a tile that is divided in the middle by some non-fillable pixels, creating two non-tile-connected regions of fillable pixels that are connected globally).

In the case of the naive floodfill, the SeedList is a list of points inside the tile where the floodfill needs to be performed. On the other hand, in the scanline floodfill it is a list of spans.

We also need a routine that can perform the floodfill on a locally on a tile. I'll call that localFloodFill(...). The parameters passed to the localFloodFill are not important since they may depend on the implementation, but all the information needed to perform the floodfill locally should be passed. This routine needs very few modifications with respect to the normal flood fill. When it finds a border of the GlobalRect it stops propagating, as expected. The main differences are:

  1. It takes a list of seeds instead of just one, so at the end it performs various smaller floodfills.
  1. A local TilePropagationInfo is maintained. It has four s corresponding to the neighbor tiles. When a border of the TileRect is reached, the floodfill stops propagating, but in this case the intention or need of keep propagating outside the tile is achieved by adding some seeds to the corresponding TilePropagationInfo entry.
  1. The routine returns the local TilePropagationInfo.

At first the global TilePropagationInfo is filled with just an entry that has as key the TileId of the seed point passed to the algorithm, and the single seed point inserted in the SeedList (or a span constructed from it) as value.

In subsequent passes the TilePropagationInfo hash may contain multiple entries.

The following steps are repeated until the TilePropagationInfo is empty:

  1. We then need to perform a local floodfill on each tile which id is in the hash, using the corresponding seeds. For each entry in the hash we do TP.addJob(localFloodFill(...)).
  1. After all the jobs are added, we need to wait for all of them to finish, so we insert kind of a barrier: TP.waitForFinished().
  1. Then, each job returns a local TilePropagationInfo object. It has four entries corresponding to the neighbor tiles of the tile that was processed. If any of those neighbor entries has some seeds, then a new job will have to be added to process that neighbor tile with those seeds. Different jobs may return a local TilePropagationInfo that has entries with the same TileId. This may happen for example if we just processed two tiles that share some neighbor tile. So we clear the global TilePropagationInfo and then add all the entries in the local TilePropagationInfo returned by the jobs (as long as they have seeds) while combining them if they have the same TileId (this is easy since we use a hash. There will be no repeated keys, so no repeated TileIds).

Making the algorithm cache friendly

This is what really speeds up the algorithm. At first it may seem like a hard task, but the solution is very straightforward, although it may seem counter-intuitive when one thinks about it.

Since we now work on tiles, instead of having the localFloodFill access the pixels of the images directly, we can do the following:

  1. When we enter the localFloodFill routine, we make an array of tileWidth * tileHeight contiguous entries and copy the contents of the images that lie inside TileRect to it.
  1. We perform the floodfill using that array. This ensures that not only the pixels on the same scanline are near, but also the pixels on different scanlines.
  1. We copy back the contents of the array to the images.

In the test app I made a reference image and a fill mask image are passed to the localFloodFill routine. The reference image has the pixel data that we need to know if a pixel should be selected. The fill mask image is used to store the selected pixels as well as to know if a pixel was already selected. So the type of each entry in the array is TileData, a struct that contains a reference pixel value and a fill mask pixel value. I guess this will depend on the implementation.

deiflou created this task.Jul 10 2022, 8:52 PM
deiflou renamed this task from Improving the performance floodfill algorithm to Improving the performance of the floodfill algorithm.

Hi, @deiflou!

Comments

So, conceptually, we divide that GlobalRect into smaller contiguous, non-overlaping, regions or tiles

We have routines for that: KritaUtils::splitRectIntoPatches and KritaUtils::splitRectIntoPatchesTight

When we enter the localFloodFill routine, we make an array of tileWidth * tileHeight contiguous entries and copy the contents of the images that lie inside TileRect to it.

In real Krita you can mimic the same behavior without copying. If your tile size is equal to the size of tiles and aligned with the tiles grid of the device (which may be offset by device->{x,y}()), then you will operate on a "contiguous" memory. Just make sure that the tiles grid is aligned and you use KisRandomAccessor for accessing the pixels (you can verify the "optimized" state by checking accessor->numContiguousRows/Columns).

Concerns

The main question for me in this algorithm is how to define spans on the borderline of the tiles. KisScanlineFill uses "running" spans as a border for other span propagation. So you'll need to keep "already executed" spans of the tiles to merge them later. I'm not sure I know how to do that (though I haven't cheked your app yet).

Implementation suggestions

As far as I understood the algorithm, it can be implemented quite easily in Krita within the strokes framework. You need two types of jobs. First is "tile fill job", which is KisStrokeJobData::CONCURRENT and fills a single tile. Second is the "tiles merge job", which merges the tiles' spans and restarts the filling process in them. This latter job should obviously be KisStrokeJobData::SEQUENTIAL.

So first, you implement "tiles merge job", which generates/merges the tiles and then adds a bunch of jobs using addMutatedJobs right after itself. This bunch consists of a set of concurrent "tile fill jobs" and a single "tiles merge job" at the end that will restart the loop.