KWin tests requiring opengl are failing
Closed, ResolvedPublic

Description

Hi all,

the KWin tests which require OpenGL started to fail on build.kde.org. The problem started with https://build.kde.org/job/Plasma/job/kwin/job/kf5-qt5%20SUSEQt5.11/262/

Could it be that the vgem devices are not passed to the container?

graesslin created this task.Jan 1 2019, 6:27 PM
Restricted Application added a subscriber: sysadmin. · View Herald TranscriptJan 1 2019, 6:27 PM

I've investigated this and it appears shortly before this regression occurred, we did a rebuild of our SUSE images.
So it would seem that this regression is due to something in the software stack (probably Mesa/X/Wayland)

The VGem devices are still being passed through and the permissions on them still permit them to be accessed by the tests.

Can we get some more debug output from Mesa / KWin's tests to figure out why OpenGL isn't initializing?

What we have from the tests is:

'''
QDEBUG : WobblyWindowsShadeTest::initTestCase() kwin_core: Compositing forced to OpenGL mode by environment variable
QDEBUG : WobblyWindowsShadeTest::initTestCase() org.kde.kcoreaddons: Checking for plugins in ("/home/jenkins/workspace/Plasma/kwin/kf5-qt5 SUSEQt5.11/build/bin/org.kde.kwin.scenes", "/home/jenkins/install-prefix/lib64/plugins/org.kde.kwin.scenes", "/usr/lib64/qt5/plugins/org.kde.kwin.scenes")
QDEBUG : WobblyWindowsShadeTest::initTestCase() kwin_scene_opengl: Initializing OpenGL compositing
QDEBUG : WobblyWindowsShadeTest::initTestCase() kwin_platform_virtual: Found a device: /dev/dri/card0
MESA-LOADER: failed to retrieve device information gbm: failed to open any driver (search paths /usr/lib64/dri) gbm: Last dlopen error: /usr/lib64/dri/vgem_dri.so: cannot open shared object file: No such file or directory failed to load driver: vgem
QDEBUG : WobblyWindowsShadeTest::initTestCase() kwin_core: Instantiated compositing plugin: "SceneQPainter"
'''

Afaik there are some Mesa env variables which might provide more information. They are documented on mesa's web page

It would appear this is our cause:

MESA-LOADER: failed to retrieve device information gbm: failed to open any driver (search paths /usr/lib64/dri) gbm: Last dlopen error: /usr/lib64/dri/vgem_dri.so: cannot open shared object file: No such file or directory failed to load driver: vgem

Checking on the current SUSE image I see:

d9e3e9339f89:/ # ls  /usr/lib64/dri/
i915_dri.so  i965_drv_video.so  nouveau_drv_video.so  r300_dri.so  r600_drv_video.so  radeonsi_dri.so        swrast_dri.so      vmwgfx_dri.so
i965_dri.so  kms_swrast_dri.so  r200_dri.so           r600_dri.so  radeon_dri.so      radeonsi_drv_video.so  virtio_gpu_dri.so

If I compare it with what we used to have I see:

d07a16340691:/ # ls /usr/lib64/dri/
i915_dri.so  i965_dri.so  kms_swrast_dri.so  nouveau_drv_video.so  r200_dri.so  r300_dri.so  r600_dri.so  r600_drv_video.so  radeon_dri.so  radeonsi_dri.so  radeonsi_drv_video.so  swrast_dri.so  virtio_gpu_dri.so  vmwgfx_dri.so

I'm speculating here, but given the major change was a Mesa update, i'd say we've found a Mesa regression here.
Is that a possibility?

zzag added a subscriber: zzag.Jan 2 2019, 2:50 PM

It's a possibility

For comparison: on my system the debug output is

QDEBUG : SceneOpenGLTest::initTestCase() kwin_core: Compositing forced to OpenGL mode by environment variable
QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: Initializing OpenGL compositing
QDEBUG : SceneOpenGLTest::initTestCase() kwin_platform_virtual: Found a device:  /dev/dri/card1
MESA-LOADER: failed to retrieve device information
gbm: failed to open any driver (search paths /usr/lib/x86_64-linux-gnu/dri:${ORIGIN}/dri:/usr/lib/dri)
gbm: Last dlopen error: /usr/lib/dri/vgem_dri.so: cannot open shared object file: No such file or directory
failed to load driver: vgem
QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: Egl Initialize succeeded
QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: EGL version:  1 . 4
QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: Created EGL context with attributes: 
Version requested:      false
Robust: false
Forward compatible:     false
Core profile:   false
Compatibility profile:  false
High priority:  false
OpenGL vendor string:                   VMware, Inc.
OpenGL renderer string:                 llvmpipe (LLVM 6.0, 256 bits)
OpenGL version string:                  3.0 Mesa 18.0.5
OpenGL shading language version string: 1.30
Driver:                                 LLVMpipe
GPU class:                              Unknown
OpenGL version:                         3.0
GLSL version:                           1.30
Mesa version:                           18.0.5
Linux kernel version:                   4.15
Requires strict binding:                no
GLSL shaders:                           yes
Texture NPOT support:                   yes
Virtual Machine:                        no
QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: OpenGL 2 compositing enforced by environment variable
QWARN  : SceneOpenGLTest::initTestCase() libkwinglutils: Skipping self test as it is reported to return false positive results on Mesa drivers
QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: OpenGL 2 compositing successfully initialized

So we also have the failure for loading the driver but it works nevertheless.

With test coverage I found that we fail in: https://phabricator.kde.org/source/kwin/browse/master/platformsupport/scenes/opengl/abstract_egl_backend.cpp$99

Interestingly we seem to be missing come debug output on build.kde.org. Looks like the KWIN_OPENGL category is not enabled.

graesslin added a comment.EditedJan 2 2019, 7:02 PM

There is one additional env variable which could be set: EGL_LOG_LEVEL=debug

Most likely the mesa package is misconfigured. Mesa switched to meson and we already had bugs due to distros incorrectly configuring mesa.

I've now re-run the builds - does that output give any clues?

New additional output:
libEGL debug: EGL user error 0x3001 (EGL_NOT_INITIALIZED) in eglInitialize
libEGL debug: EGL user error 0x3001 (EGL_NOT_INITIALIZED) in eglInitialize
libEGL debug: EGL user error 0x3001 (EGL_NOT_INITIALIZED) in eglDestroyContext

According to the documentation: EGL_NOT_INITIALIZED is generated if display cannot be initialized.

But this all doesn't tell us why...

So it's not liking connecting to our Xvfb instance?

So it's not liking connecting to our Xvfb instance?

No that should be impossible. We have the egldisplay already and that is created for gbm.

When you try to reproduce locally, are you using the docker images we ship at Dockerhub, or a SUSE VM?

On another note - does KWin's test infrastructure start it's own display services or anything along those lines by any chance? (I wonder if the issue is because the CI Tooling starts Xvfb and then KWin does some additional stuff, and DRM/GBM/Mesa freaks out as a result)

When you try to reproduce locally, are you using the docker images we ship at Dockerhub, or a SUSE VM?

Neither, nor. I just use my normal system.

On another note - does KWin's test infrastructure start it's own display services or anything along those lines by any chance? (I wonder if the issue is because the CI Tooling starts Xvfb and then KWin does some additional stuff, and DRM/GBM/Mesa freaks out as a result)

Yes, the test infrastructure starts a display server, but the failure is prior to starting it.

Okay. The CI system is currently using Mesa 18.3.1 - how does this compare to your local system?

My guess at this point is that we're dealing with a Mesa behaviour change / regression. Given that it "can't initialize the display" I would hazard a guess that it wants to see a monitor - which VGem of course is never going to have (hence causing the failure).

I can't see anything further that I can do at this point aside from giving you the details needed to reproduce this in a local environment - this probably requires assistance from the Mesa developers at this point unless you have any other ideas as to why a display couldn't be initialized?

zzag added a comment.Jan 4 2019, 8:25 PM

Okay. The CI system is currently using Mesa 18.3.1 - how does this compare to your local system?

I have 18.3.1, all tests pass.

zzag added a comment.Jan 4 2019, 8:39 PM

Oh, I forgot to load vgem. Now, tests fail.

That would seem to support the theory that Mesa has a regression with regards to VGem support...

zzag added a comment.Jan 4 2019, 9:02 PM

I downgraded Mesa to 18.2.6 and tests pass again.

Guess that confirms it's a Mesa regression - will you and Martin handle sorting it with the Mesa devs?

Could you downgrade Mesa please on the CI system?

I would love to, but unfortunately Tumbleweed has already withdrawn the older versions of Mesa from it's repositories, so we're unable to do so :(
The only repositories containing older Mesa's for Tumbleweed now are people's personal home:* repositories (per https://software.opensuse.org/package/Mesa)

So what should we do? We cannot keep the tests failing till Mesa got that fixed - if at all. Can we get another CI base system which is not constantly rolling and introducing issues?

I btw. cannot report it to Mesa as my Mesa does not expose the problem, so for any debug information I could not provide it.

graesslin added a subscriber: fvogt.Jan 5 2019, 8:19 AM

Pulling in @fvogt - our unit tests found a regression in Mesa shipped in Tumbleweed. Could you please trigger an investigation inside openSUSE how it could happen that this was not discovered prior to shipping the update.

I'll ask our OpenSUSE packagers if something can be done. Given that the packaging for the older version is still around it should hopefully be fairly straight forward for them to provide something.

As for why the CI system base is constantly rolling, this was put in place to handle KWin & Plasma's very bleeding edge requirements at the time around Mesa / Wayland / EGL / etc. In any event, a regression is a regression, so we would have been hit eventually.

Note that per the investigation by @zzag above, it looks like this is only triggered in scenarios where VGem is present, so it's likely most environments wouldn't see this.

Note that per the investigation by @zzag above, it looks like this is only triggered in scenarios where VGem is present, so it's likely most environments wouldn't see this.

That's true, but if openSUSE run our unit tests it would be found.

I investigated a little bit on what changed in the new mesa version. It introduces a EGL_MESA_device_software which seems to support what we do with vgem without needing vgem. So it might be that we need to adjust our code to support this new extension in addition to vgem.

Ok, I found something and implemented in D17980. @zzag can you please test it?

Unfortunately this didn't work - we now have: Failed to create surfaceless platform, trying with vgem device

I played a little bit with strace to see what is used.

If I use the EGL_MESA_platform_surfaceless the intel driver for my intel gpu is used. If I use vgem, the kms_swrast_dri driver is used.

More results from playing with strace and reading mesa code: @bcooksley please add LIBGL_ALWAYS_SOFTWARE=true to the environment variables for running the test. With that env variable set I got the test to use the kms_swrast device instead of the intel device

bshah added a subscriber: bshah.Jan 5 2019, 5:26 PM

Please retry.

@graesslin Should I remove the VGem device as well? It seems the tests still fail after setting that environment variable.

I don't think that removing vgem will help. The next code path is then using default platform and that will certainly fail. What I would like to see is strace of a failing test.

Some further hint: https://build.opensuse.org/public/build/X11:XOrg/openSUSE_Tumbleweed/x86_64/Mesa/_log shows --with-platforms=x11,drm,wayland while on https://sources.debian.org/src/mesa/18.2.8-2/debian/rules/ and https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/mesa we see additional surfaceless enabled. This explains why the change for sufaceless did not work. Looks like an openSUSE packaging issue to me.

Luca and Fabian, what's the best way of getting this rectified? Filing a bug against Mesa at SUSE?

In the meantime, would it be possible to get the older version or a tweaked version of Mesa so we can restore normal service?

We can't roll back as far as I know. Please file a bug at bugzilla.opensuse.org against Mesa, adding all the relevant information.

Martin and Vlad, could you please file the relevant bug? I'm not familiar enough with Mesa to provide all the information which they may need.

fvogt added a comment.EditedJan 6 2019, 12:14 AM

That's true, but if openSUSE run our unit tests it would be found.

That's being worked on, but it's not in Tumbleweed yet. Sorry for the inconvenience!

In the meantime, would it be possible to get the older version or a tweaked version of Mesa so we can restore normal service?

I did a quick test build and submitted the change to X11:XOrg - if you created a bug, please tell me the number so that I can link it.

KDE:Qt:5.11 (do you need other projects as well, such as 5.9, 5.10, 5.12?) has Mesa binaries with surfaceless enabled now (version 18.3.1-914.1 and up),
so it should work after rebuilding the CI image. Not tested though.

If it for some reason does not work as expected, we could rebuild an older revision of Mesa in the project as well.

Edit: Binaries from older Tumbleweed snapshots are available at http://download.opensuse.org/history/. So if you know which file needs to be replaced, you could install the RPM from there.

Many thanks for enabling surfaceless in those packages Fabian - very much appreciated.
I can confirm that has fixed the issue with KWin tests - https://build.kde.org/job/Plasma/job/kwin/job/kf5-qt5%20SUSEQt5.11/

It's probably not a bad idea to add them into Qt 5.12 for when Plasma bumps to that version. (I probably should setup a 5.12 image)

Frameworks will bump to 5.10 soon with the next release due to go out - so Qt 5.9 support will be discontinued then.

Thanks everyone for the help

@bcooksley I think we can remove vgem now from the CI system

Cool, i've now made that change to remove it.

Is any further action needed from the CI system side on this?

bcooksley closed this task as Resolved.Jul 5 2019, 9:59 PM
bcooksley claimed this task.

Closing due to lack of response.