ccache
Closed, WontfixPublic

Description

ccache certain builds that take way too long for their own good.

As we are using multiple build nodes, ccache is a tricky thing to do well. There is the general concern of running out of disk space on the slaves due to the fact that builds take twice as much as disk space with cache vs. without. OTOH build time can be improved by some 80-90% for the majority of builds.
While we can't do much about the disk space bit we can deal with the distributed nature of things.

For disk space problems jobs should be forced onto a suitable fat node (master?). Builds that would risk disk space issues are the ones that benefit the most from a cache. Changing the node a build can appear on is easy to do and should be the way to deal with this.

As for the distributed slave problem:

  • There will be a cache directory on master on the DO block volume.
  • The cache will contain tar.gz ccache dirs. One per binary job. So there'll be xenial_unstable_frameworks_kio.tar.gz and xenial_unstable_frameworks_ki18n.tar.gz and each will contain a fully qualified ccache dir for use with the env var CCACHE_DIR.
  • Upon build start the slaves retrieve this cache tarball from the master. The way this should happen remains to be determined. Reverse SSH access to the jenkins user is a no-go as that'd expose the entire build system to potentially compromised nodes. Ideally the master would actually push the tarball into the slave, to my knowledge there is no jenkins plugin for this though. We could somehow archive the tarball and unarchvie it via jenkins. Another alternative would be to run an rsyncd on the master and rsync on the slave.
  • The cache should probably be unpacked in $WORKSPACE/ccache so they are isolated per-job and get cleaned up along with the build artifacts.
  • Inside the docker container CCACHE_DIR is set accordingly (i.e. /workspace/ccache)
  • Upon successful build the ccache dir is pruned (or whatever the ccache clean for dropping now unused crap is), tar'd, gzip'd and pushed back to the server. Pushing might best be done via Jenkins job artifacts. Reverse-access again is a no-go and since workspaces on the slaves are deleted when the bin_amd64 job ends, we can't easily have the master pull the tarball after that.
  • If done via Jenkins job artifacts the master job (i.e. xenial_unstable_frameworks_kio for xenial_unstable_frameworks_kio_bin_amd64) moves the tarball out of the Jenkins job's archive directory into its cache directory.
sitter created this task.Mar 14 2017, 10:48 AM
sitter updated the task description. (Show Details)Mar 28 2017, 6:59 AM
sitter moved this task from Doing to Ready To Do on the Neon board.Mar 28 2017, 7:03 AM

Did look at the prototype behavior and adjusted the game plan accordingly.

Observations:

  • we can't have per-slaves caches, they are not only not efficient, they are also useless as we can't have huge caches due to disk space restrictions. with the number of builds we have by the time another build of foobar triggers the previous might have been kicked out of the cache by other builds due to space exhaustion
  • per-slave caches are also silly as we do not try to build jobs on the previous slave but any available one, so chances are good a build does not even happen on the node the most recent cache is on
  • smaller builds benefit next to nothing from caching and we may well want a metric by which to decide whether to cache at all or not. I suspect we may actually slow things down if we cache on a super small build.
sitter updated the task description. (Show Details)Mar 28 2017, 7:04 AM
sitter removed sitter as the assignee of this task.

used on arm64 and doesn't make a difference says harald

sitter closed this task as Wontfix.Nov 19 2019, 2:04 PM
sitter claimed this task.

The arm setup actually disappeared since (possibly wasn't carried into new server) so there are no up to date metrics. It does not really matter though since we do not have a way to pull this off for amd64. We'd require a persistent storage for the cache as nodes are largely ephemeral so result need "offsite" storage. This would be tricky to do since we currently have no space for this and the actual transfer overhead would like destroy any gains made by the cache. Plus there is of course the risk of false positives anyway. All in all no longer viable to implement.