Hi all,
the KWin tests which require OpenGL started to fail on build.kde.org. The problem started with https://build.kde.org/job/Plasma/job/kwin/job/kf5-qt5%20SUSEQt5.11/262/
Could it be that the vgem devices are not passed to the container?
Hi all,
the KWin tests which require OpenGL started to fail on build.kde.org. The problem started with https://build.kde.org/job/Plasma/job/kwin/job/kf5-qt5%20SUSEQt5.11/262/
Could it be that the vgem devices are not passed to the container?
I've investigated this and it appears shortly before this regression occurred, we did a rebuild of our SUSE images.
So it would seem that this regression is due to something in the software stack (probably Mesa/X/Wayland)
The VGem devices are still being passed through and the permissions on them still permit them to be accessed by the tests.
Can we get some more debug output from Mesa / KWin's tests to figure out why OpenGL isn't initializing?
What we have from the tests is:
'''
QDEBUG : WobblyWindowsShadeTest::initTestCase() kwin_core: Compositing forced to OpenGL mode by environment variable
QDEBUG : WobblyWindowsShadeTest::initTestCase() org.kde.kcoreaddons: Checking for plugins in ("/home/jenkins/workspace/Plasma/kwin/kf5-qt5 SUSEQt5.11/build/bin/org.kde.kwin.scenes", "/home/jenkins/install-prefix/lib64/plugins/org.kde.kwin.scenes", "/usr/lib64/qt5/plugins/org.kde.kwin.scenes")
QDEBUG : WobblyWindowsShadeTest::initTestCase() kwin_scene_opengl: Initializing OpenGL compositing
QDEBUG : WobblyWindowsShadeTest::initTestCase() kwin_platform_virtual: Found a device: /dev/dri/card0
MESA-LOADER: failed to retrieve device information gbm: failed to open any driver (search paths /usr/lib64/dri) gbm: Last dlopen error: /usr/lib64/dri/vgem_dri.so: cannot open shared object file: No such file or directory failed to load driver: vgem
QDEBUG : WobblyWindowsShadeTest::initTestCase() kwin_core: Instantiated compositing plugin: "SceneQPainter"
'''
Afaik there are some Mesa env variables which might provide more information. They are documented on mesa's web page
It would appear this is our cause:
MESA-LOADER: failed to retrieve device information gbm: failed to open any driver (search paths /usr/lib64/dri) gbm: Last dlopen error: /usr/lib64/dri/vgem_dri.so: cannot open shared object file: No such file or directory failed to load driver: vgem
Checking on the current SUSE image I see:
d9e3e9339f89:/ # ls /usr/lib64/dri/ i915_dri.so i965_drv_video.so nouveau_drv_video.so r300_dri.so r600_drv_video.so radeonsi_dri.so swrast_dri.so vmwgfx_dri.so i965_dri.so kms_swrast_dri.so r200_dri.so r600_dri.so radeon_dri.so radeonsi_drv_video.so virtio_gpu_dri.so
If I compare it with what we used to have I see:
d07a16340691:/ # ls /usr/lib64/dri/ i915_dri.so i965_dri.so kms_swrast_dri.so nouveau_drv_video.so r200_dri.so r300_dri.so r600_dri.so r600_drv_video.so radeon_dri.so radeonsi_dri.so radeonsi_drv_video.so swrast_dri.so virtio_gpu_dri.so vmwgfx_dri.so
I'm speculating here, but given the major change was a Mesa update, i'd say we've found a Mesa regression here.
Is that a possibility?
For comparison: on my system the debug output is
QDEBUG : SceneOpenGLTest::initTestCase() kwin_core: Compositing forced to OpenGL mode by environment variable QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: Initializing OpenGL compositing QDEBUG : SceneOpenGLTest::initTestCase() kwin_platform_virtual: Found a device: /dev/dri/card1 MESA-LOADER: failed to retrieve device information gbm: failed to open any driver (search paths /usr/lib/x86_64-linux-gnu/dri:${ORIGIN}/dri:/usr/lib/dri) gbm: Last dlopen error: /usr/lib/dri/vgem_dri.so: cannot open shared object file: No such file or directory failed to load driver: vgem QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: Egl Initialize succeeded QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: EGL version: 1 . 4 QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: Created EGL context with attributes: Version requested: false Robust: false Forward compatible: false Core profile: false Compatibility profile: false High priority: false OpenGL vendor string: VMware, Inc. OpenGL renderer string: llvmpipe (LLVM 6.0, 256 bits) OpenGL version string: 3.0 Mesa 18.0.5 OpenGL shading language version string: 1.30 Driver: LLVMpipe GPU class: Unknown OpenGL version: 3.0 GLSL version: 1.30 Mesa version: 18.0.5 Linux kernel version: 4.15 Requires strict binding: no GLSL shaders: yes Texture NPOT support: yes Virtual Machine: no QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: OpenGL 2 compositing enforced by environment variable QWARN : SceneOpenGLTest::initTestCase() libkwinglutils: Skipping self test as it is reported to return false positive results on Mesa drivers QDEBUG : SceneOpenGLTest::initTestCase() kwin_scene_opengl: OpenGL 2 compositing successfully initialized
So we also have the failure for loading the driver but it works nevertheless.
With test coverage I found that we fail in: https://phabricator.kde.org/source/kwin/browse/master/platformsupport/scenes/opengl/abstract_egl_backend.cpp$99
Interestingly we seem to be missing come debug output on build.kde.org. Looks like the KWIN_OPENGL category is not enabled.
There is one additional env variable which could be set: EGL_LOG_LEVEL=debug
Most likely the mesa package is misconfigured. Mesa switched to meson and we already had bugs due to distros incorrectly configuring mesa.
New additional output:
libEGL debug: EGL user error 0x3001 (EGL_NOT_INITIALIZED) in eglInitialize
libEGL debug: EGL user error 0x3001 (EGL_NOT_INITIALIZED) in eglInitialize
libEGL debug: EGL user error 0x3001 (EGL_NOT_INITIALIZED) in eglDestroyContext
According to the documentation: EGL_NOT_INITIALIZED is generated if display cannot be initialized.
But this all doesn't tell us why...
No that should be impossible. We have the egldisplay already and that is created for gbm.
When you try to reproduce locally, are you using the docker images we ship at Dockerhub, or a SUSE VM?
On another note - does KWin's test infrastructure start it's own display services or anything along those lines by any chance? (I wonder if the issue is because the CI Tooling starts Xvfb and then KWin does some additional stuff, and DRM/GBM/Mesa freaks out as a result)
Neither, nor. I just use my normal system.
On another note - does KWin's test infrastructure start it's own display services or anything along those lines by any chance? (I wonder if the issue is because the CI Tooling starts Xvfb and then KWin does some additional stuff, and DRM/GBM/Mesa freaks out as a result)
Yes, the test infrastructure starts a display server, but the failure is prior to starting it.
Okay. The CI system is currently using Mesa 18.3.1 - how does this compare to your local system?
My guess at this point is that we're dealing with a Mesa behaviour change / regression. Given that it "can't initialize the display" I would hazard a guess that it wants to see a monitor - which VGem of course is never going to have (hence causing the failure).
I can't see anything further that I can do at this point aside from giving you the details needed to reproduce this in a local environment - this probably requires assistance from the Mesa developers at this point unless you have any other ideas as to why a display couldn't be initialized?
That would seem to support the theory that Mesa has a regression with regards to VGem support...
Guess that confirms it's a Mesa regression - will you and Martin handle sorting it with the Mesa devs?
I would love to, but unfortunately Tumbleweed has already withdrawn the older versions of Mesa from it's repositories, so we're unable to do so :(
The only repositories containing older Mesa's for Tumbleweed now are people's personal home:* repositories (per https://software.opensuse.org/package/Mesa)
So what should we do? We cannot keep the tests failing till Mesa got that fixed - if at all. Can we get another CI base system which is not constantly rolling and introducing issues?
I btw. cannot report it to Mesa as my Mesa does not expose the problem, so for any debug information I could not provide it.
Pulling in @fvogt - our unit tests found a regression in Mesa shipped in Tumbleweed. Could you please trigger an investigation inside openSUSE how it could happen that this was not discovered prior to shipping the update.
I'll ask our OpenSUSE packagers if something can be done. Given that the packaging for the older version is still around it should hopefully be fairly straight forward for them to provide something.
As for why the CI system base is constantly rolling, this was put in place to handle KWin & Plasma's very bleeding edge requirements at the time around Mesa / Wayland / EGL / etc. In any event, a regression is a regression, so we would have been hit eventually.
Note that per the investigation by @zzag above, it looks like this is only triggered in scenarios where VGem is present, so it's likely most environments wouldn't see this.
I investigated a little bit on what changed in the new mesa version. It introduces a EGL_MESA_device_software which seems to support what we do with vgem without needing vgem. So it might be that we need to adjust our code to support this new extension in addition to vgem.
Unfortunately this didn't work - we now have: Failed to create surfaceless platform, trying with vgem device
I played a little bit with strace to see what is used.
If I use the EGL_MESA_platform_surfaceless the intel driver for my intel gpu is used. If I use vgem, the kms_swrast_dri driver is used.
More results from playing with strace and reading mesa code: @bcooksley please add LIBGL_ALWAYS_SOFTWARE=true to the environment variables for running the test. With that env variable set I got the test to use the kms_swrast device instead of the intel device
@graesslin Should I remove the VGem device as well? It seems the tests still fail after setting that environment variable.
I don't think that removing vgem will help. The next code path is then using default platform and that will certainly fail. What I would like to see is strace of a failing test.
Some further hint: https://build.opensuse.org/public/build/X11:XOrg/openSUSE_Tumbleweed/x86_64/Mesa/_log shows --with-platforms=x11,drm,wayland while on https://sources.debian.org/src/mesa/18.2.8-2/debian/rules/ and https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/mesa we see additional surfaceless enabled. This explains why the change for sufaceless did not work. Looks like an openSUSE packaging issue to me.
The spec file: https://build.opensuse.org/package/view_file/X11:XOrg/Mesa/Mesa.spec?expand=1 - no surfacless
Luca and Fabian, what's the best way of getting this rectified? Filing a bug against Mesa at SUSE?
In the meantime, would it be possible to get the older version or a tweaked version of Mesa so we can restore normal service?
We can't roll back as far as I know. Please file a bug at bugzilla.opensuse.org against Mesa, adding all the relevant information.
Martin and Vlad, could you please file the relevant bug? I'm not familiar enough with Mesa to provide all the information which they may need.
That's true, but if openSUSE run our unit tests it would be found.
That's being worked on, but it's not in Tumbleweed yet. Sorry for the inconvenience!
In the meantime, would it be possible to get the older version or a tweaked version of Mesa so we can restore normal service?
I did a quick test build and submitted the change to X11:XOrg - if you created a bug, please tell me the number so that I can link it.
KDE:Qt:5.11 (do you need other projects as well, such as 5.9, 5.10, 5.12?) has Mesa binaries with surfaceless enabled now (version 18.3.1-914.1 and up),
so it should work after rebuilding the CI image. Not tested though.
If it for some reason does not work as expected, we could rebuild an older revision of Mesa in the project as well.
Edit: Binaries from older Tumbleweed snapshots are available at http://download.opensuse.org/history/. So if you know which file needs to be replaced, you could install the RPM from there.
Many thanks for enabling surfaceless in those packages Fabian - very much appreciated.
I can confirm that has fixed the issue with KWin tests - https://build.kde.org/job/Plasma/job/kwin/job/kf5-qt5%20SUSEQt5.11/
It's probably not a bad idea to add them into Qt 5.12 for when Plasma bumps to that version. (I probably should setup a 5.12 image)
Frameworks will bump to 5.10 soon with the next release due to go out - so Qt 5.9 support will be discontinued then.