Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add linux-drm-syncobj-v1 protocol #411

Merged
merged 1 commit into from
Jan 15, 2025
Merged

Add linux-drm-syncobj-v1 protocol #411

merged 1 commit into from
Jan 15, 2025

Conversation

ids1024
Copy link
Member

@ids1024 ids1024 commented Apr 9, 2024

Based on Smithay/smithay#1356.

I think the blocker logic should be correct for handling acquire points (if I properly understand the transaction system in Smithay). Though I don't see rendering issues with Mesa git when the blocker is removed... maybe it needs to be tested with something heavier than vkcube. (Or is there something still forcing implicit sync?).

The logic I added in Smithay for signaling releases may be a little less correct. Though maybe not more incorrect that how buffer releases are currently handled? (If I understand DRM correctly, with direct scanout we should make sure not to release until we've committed a new buffer and are sure the display controller won't want to read the buffer.)

We'll be able to test more when the next Nvidia driver is released. This at least gives us a way to test the explicit sync support they're adding.

Presumably we should test if drmSyncobjEventfd is supported... maybe just creating a syncobj and calling that to see if it works? I'm also still a little unsure how this ends up working with multiple GPUs... particularly if one is Nvidia.

@Drakulix
Copy link
Member

I think the blocker logic should be correct for handling acquire points (if I properly understand the transaction system in Smithay). Though I don't see rendering issues with Mesa git when the blocker is removed... maybe it needs to be tested with something heavier than vkcube. (Or is there something still forcing implicit sync?).

For that you probably have to remove the old dmabuf.generate_blocker logic, which pulls a fence out of the dmabuf to do essentially the same thing. We basically should check, if the client uses explicit-sync and then use the acquire fence and otherwise fallback to polling the dmabuf directly.

The logic I added in Smithay for signaling releases may be a little less correct. Though maybe not more incorrect that how buffer releases are currently handled? (If I understand DRM correctly, with direct scanout we should make sure not to release until we've committed a new buffer and are sure the display controller won't want to read the buffer.)

Yeah, I am pretty sure smithay's code isn't correct in what it does today, but given direct-scanout mostly works, that is probably using implicit-sync in the background.

What needs to happen is storing the fence generated by compositing or in case of direct scanout, we need to get the OUT_FENCE_PTR property of the respective plane. Then we can wait for that fence, once the buffer is replaced and won't be used for new rendering/scanout operations and signal release once that fence is done. (Maybe there is even kernel api to "signal once this other fence is signalled"?)

Presumably we should test if drmSyncobjEventfd is supported... maybe just creating a syncobj and calling that to see if it works? I'm also still a little unsure how this ends up working with multiple GPUs... particularly if one is Nvidia.

Yeah, imo this needs a compile test similar to what we do for gbm in smithay: https://github.com/Smithay/smithay/blob/master/build.rs#L99-L125

So if the local kernel/drm headers of the system support it, we enable the feature and assume the kernel does as well. I don't know how well runtime detection would work, but we have to make sure to not advertise the global, if this function isn't supported.

@ids1024
Copy link
Member Author

ids1024 commented Apr 10, 2024

For that you probably have to remove the old dmabuf.generate_blocker logic, which pulls a fence out of the dmabuf to do essentially the same thing. We basically should check, if the client uses explicit-sync and then use the acquire fence and otherwise fallback to polling the dmabuf directly.

This should already be doing that. If there's an acquire point, it adds a DrmSyncPointBlocker and skips adding the DmabufBlocker.

Yeah, I am pretty sure smithay's code isn't correct in what it does today, but given direct-scanout mostly works, that is probably using implicit-sync in the background.

I'm not sure if implicit sync does does something to help with releases (blocking client writes to the buffer until the display controller is done with the buffer... but yeah, it does seem to mostly work. If implicit sync isn't involved, this won't be more problematic with explicit sync and the same limitation.

(Maybe there is even kernel api to "signal once this other fence is signalled"?)

It should be possible to do something with drmSyncobjTransfer and such. Probably the other implementations of the protocol do something like that, so we can look at those.

We just have to make sure to only do that once the buffer is no longer used elsewhere in the compositor.

@Drakulix
Copy link
Member

I'm not sure if implicit sync does does something to help with releases (blocking client writes to the buffer until the display controller is done with the buffer... but yeah, it does seem to mostly work. If implicit sync isn't involved, this won't be more problematic with explicit sync and the same limitation.

It definitely does, but the nvidia-driver at least isn't doing that correctly, when you send dmabufs directly to KMS without going through egl-gbm and allocating a EGLSurface. Which is why we have the needs_sync workaround in smithay. But this should work for mesa.

(Maybe there is even kernel api to "signal once this other fence is signalled"?)

It should be possible to do something with drmSyncobjTransfer and such. Probably the other implementations of the protocol do something like that, so we can look at those.

👍

We just have to make sure to only do that once the buffer is no longer used elsewhere in the compositor.

I am pretty sure, that is what our current release logic does, with the exception of handling direct-scanout. Which would be handled by the fence anywhere in this case, so that should be correct. We just need to adjust the DrmCompositor to extract the out-fence and expose that in the RenderResult.

@ids1024
Copy link
Member Author

ids1024 commented Apr 11, 2024

I am pretty sure, that is what our current release logic does, with the exception of handling direct-scanout. Which would be handled by the fence anywhere in this case, so that should be correct. We just need to adjust the DrmCompositor to extract the out-fence and expose that in the RenderResult.

I mean if we want to have OUT_FENCE_PTR directly signal the release point:

  • We can't do that if we are using the same buffer for direct scanout or rendering elsewhere (on a different monitor; screencopy)
  • We can't do that if we might use the buffer again. Which could even be the case if another buffer has been committed, since the new buffer may still be blocked when we want to render the next frame.

So I'm not sure when we could actually do that? Maybe with commit-queue-v1, where we might know the next buffer is ready, but aren't using it until the next frame.

So at least for now I think we need to stick to signaling the release point from CPU? But should still track when OUT_FENCE_PTR has signaled scanout is done with the buffer.

@Drakulix
Copy link
Member

I mean if we want to have OUT_FENCE_PTR directly signal the release point:

* We can't do that if we are using the same buffer for direct scanout or rendering elsewhere (on a different monitor; screencopy)

Right, so this rather needs to be a list of fences to wait for. Meaning we probably have to wait and signal ourselves instead of relying on drmSyncobjTransfer.

* We can't do that if we might use the buffer again. Which could even be the case if another buffer has been committed, since the new buffer may still be blocked when we want to render the next frame.

But merge of the state should only happen, once all blockers are resolved. And only then we release, so I believe that issue is already handled correctly. Nothing would be able to use that buffer for rendering any more at that point.

But we still need to track the buffer to be able to signal later, so we might as well unify the approach and handle the release-event the same. I feel like this could benefit from some infrastructure and refactoring in smithay.

So I'm not sure when we could actually do that? Maybe with commit-queue-v1, where we might know the next buffer is ready, but aren't using it until the next frame.

I think we can implement both fifo and commit-queue with blockers as well.

So at least for now I think we need to stick to signaling the release point from CPU? But should still track when OUT_FENCE_PTR has signaled scanout is done with the buffer.

Yeah, I am coming to the same conclusion, but that isn't too bad, as that is just another fd in the loop and a very small action.

@@ -450,6 +451,9 @@ pub fn init_backend(
// Create relative pointer global
RelativePointerManagerState::new::<State>(&dh);

// TODO check if main device supports syncobj eventfd?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess worst case we can still fail the import_timeline-request, right?
Is Xwayland/mesa able to handle this / fallback to implicit sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any way to fallback once we expose the global. The protocol doesn't make import_timeline failable, except as a protocol error.

It looks like Mutter uses drmSyncobjEventfd (drm_fd, 0, 0, -1, 0) != -1 || errno != ENOENT) to check if the call is supported. So we probably want to do the same, and only create the global if it is supported.

@ryzendew
Copy link

ryzendew commented Aug 7, 2024

Can we get this rebased on the newest master please

@gabriele2000
Copy link

Is this going to be implemented soon?
It's the only thing that forces me to use vsync in every game, something that I'd rather turn off since Wayland doesn't need it.

@ids1024
Copy link
Member Author

ids1024 commented Sep 19, 2024

Is this going to be implemented soon?

I've just marked Smithay/smithay#1356 as ready for review, so hopefully if no issues come up this should be merged soon.

I'm not sure if explicit sync will help with whatever issues you're seeing, but if you'd like to test this PR and report how it impacts behavior (for a certain game / graphics card / driver), that could be helpful.

@gabriele2000
Copy link

gabriele2000 commented Sep 19, 2024

I'm not sure if explicit sync will help with whatever issues you're seeing, but if you'd like to test this PR and report how it impacts behavior (for a certain game / graphics card / driver), that could be helpful.

I think it will.
The issue that I'm seeing is that with vsync disabled anything less than 60FPS feels like 15FPS, maybe 20FPS, it's definitely the sync issue that I began to encounter half a year ago.

I will report back.

@ids1024 There's definitely something weird with my setup, the issue is still here, no weird tearing now, but FPS are halved if they're less than 60 (they're even more than halved).

  • iGPU Intel 630
  • dGPU Nvidia GTX 1050TI

I'm not telling anything to the game, the just uses the nvidia gpu, I do not force any GPU.
Every game is affected by this, turning on Vsync will fix it immediately.

Mesa is at version Mesa 24.0.3-1pop1~1711635559~22.04~7a9f319
Nvidia driver is at version 560.35.03-1pop0~1726601312~22.04~92f4f94

EDIT: for more details refer to this pop-os/cosmic-epoch#184 (comment)

@ids1024
Copy link
Member Author

ids1024 commented Sep 19, 2024

FPS being lower than expected (often around 30 on a 60 fps monitor) with vsync is something I've noticed with Intel-rendered windows on a 1650 mobile (later NVIDIA cards don't seem to have this issue). In my testing Gnome Wayland seemed similar, so I assume it's something in the driver.

#211 has some previous testing I've done with that.

@gabriele2000
Copy link

gabriele2000 commented Sep 19, 2024

FPS being lower than expected (often around 30 on a 60 fps monitor) with vsync is something I've noticed with Intel-rendered windows on a 1650 mobile (later NVIDIA cards don't seem to have this issue). In my testing Gnome Wayland seemed similar, so I assume it's something in the driver.

#211 has some previous testing I've done with that.

In my case the inverse is true.
Internal framerate is good with vsync disabled and enabled, but with vsync disabled what I see isn't what the computer sees.

Personally I see extreme lag, the game renders fine though because if you record a gameplay and watch it, you'll see it buttery-smooth

UPDATE: Basically the chronological order is the whole thing is:

  • Frames would jump back with vsync disabled
  • One day the issue was fixed thanks to PRIME offloading
  • cosmic-comp started to show this "high internal FPS but what you see is a stuttery mess" with vsync off (still no noticeable diagonal tearing)
  • I try this fix that I've been waiting for months and don't fix anything
  • Disappointment

@Tipcat-98
Copy link

Tried this briefly with NVIDIA 560.35.03.
Firefox would frequently crash with:
"MozCrashReason":"Error flushing display: Broken pipe"

Other things seemed to work fine though.

Using Pop!_os 22.04 with popdev:master branch.

@gabriele2000
Copy link

Tried this briefly with NVIDIA 560.35.03. Firefox would frequently crash with: "MozCrashReason":"Error flushing display: Broken pipe"

Other things seemed to work fine though.

Using Pop!_os 22.04 with popdev:master branch.

Can you read my comment and please tell me if you don't have the issue?
Are you on a hybrid setup?

@Tipcat-98
Copy link

I'm using a desktop, with my intel integrated gpu disabled in BIOS.
I couldn't really see any frame rate oddities with or without v-sync.

@ids1024
Copy link
Member Author

ids1024 commented Sep 20, 2024

Hm, are there still issues with Firefox? nvidia/egl-wayland 1.1.15 was supposed to fix some issues like that. Not sure if Firefox also needed fixes.

@Tipcat-98
Copy link

Appears so.
Can confirm that I'm on egl-wayland 1.1.16

I could send the firefox crash log if you think it could help.

@ptr1337
Copy link

ptr1337 commented Sep 26, 2024

I think everything should be good now. So this can be merged if no more issues are occurring.

Ive tested this MR on my 4070 Super with 560 Drivers. I was not able to open discord, nor cosmic settings or equal.
I will check tomorrow to gather some logs.

@ids1024
Copy link
Member Author

ids1024 commented Sep 27, 2024

Hm. cosmic-settings should be using Vulkan (via wgpu), and Nvida's Vulkan doesn't even use explicit sync yet in the 560 driver (while their EGL implementation does). Unless it's a multi-gpu system and is running on the integrated GPU.

Hopefully the logs provide more context.

@skygrango
Copy link
Contributor

skygrango commented Sep 27, 2024

Nvida's Vulkan doesn't even use explicit sync yet in the 560 driver (while their EGL implementation does).

sorry, my env has WGPU_BACKEND=vulkan, so my test should not be considered a valid reference


I'm trying this.

spec : gtx 1080
driver : 560.35.03
cosmic version: alpha 2

quick check the following apps

work list:
chromium
firefox
evolution
cosmic-term
cosmic-files
cosmic-edit
filezilla
kate
vlc

not work list:
cosmic-settings on vulkan.
discord
vs-code
steam
gimp

It can be started, but the menu cannot be displayed:
konsole
gnome-terminal

cosmic-settings vulkan log : log
cosmic-settings opengl log : log
discord log : log

maybe I also need to test latest cosmic-comp

Hopefully the logs provide more context.

what I can do for you

cosmic-comp keep printing in journal ...

cosmic-comp[1718]: [GL] Buffer detailed info: Buffer object 3 (bound to GL_VERTEX_ATTRIB_ARRAY_BUFFER_BINDING_ARB (0), usage hint is GL_STATIC_DRAW) will use VIDEO memory as the source for buffer object operations.

If my gpu is not supported, please let me know

@skygrango
Copy link
Contributor

skygrango commented Sep 27, 2024

I test this pr on my rx7900xtx

driver : mesa 24.3.0_devel.194431.6f3c003433f-1
monitor: 4k 240hz , render x11 application at native resolution

I tried to play DBD and encountered the following situation, I don't see this regression in cosmic-comp 1.0.0.alpha.2

And it only happens in full screen mode, windowed mode has no problem

IMG_7135

@ids1024
Copy link
Member Author

ids1024 commented Sep 27, 2024

If my gpu is not supported, please let me know

We expect COSMIC to work on any GPU supported by the current NVIDIA drivers. (Though if there's a bug in their drivers on certain cards, we may not be able to do anything about it.)

And it only happens in full screen mode, windowed mode has no problem

Hm, probably something involving direct scanout. We should be waiting properly for the explicit sync point before the buffer is used...

@skygrango
Copy link
Contributor

We expect COSMIC to work on any GPU supported by the current NVIDIA drivers. (Though if there's a bug in their drivers on certain cards, we may not be able to do anything about it.)

got it :)

Hm, probably something involving direct scanout. We should be waiting properly for the explicit sync point before the buffer is used...

very much like what you surmised, this damage is not visible in the capture screenshots.

@Tipcat-98
Copy link

Tipcat-98 commented Oct 3, 2024

Since the latest update, cosmic settings seems to be working on nvidia.

@ids1024
Copy link
Member Author

ids1024 commented Oct 3, 2024

Yep, Smithay/smithay#1554 should fix that issue on Nvidia, at least.

Not sure about the issue on the rx7900xtx, which does look like a real synchronization issue. I assume that's using vkd3d on wine on xwayland for rendering? The only RDNA hardware I have is a Steam Deck, and I'm not seeing that issue running a couple games on it under cosmic-comp.

@skygrango
Copy link
Contributor

Yep, Smithay/smithay#1554 should fix that issue on Nvidia, at least.

Not sure about the issue on the rx7900xtx, which does look like a real synchronization issue. I assume that's using vkd3d on wine on xwayland for rendering? The only RDNA hardware I have is a Steam Deck, and I'm not seeing that issue running a couple games on it under cosmic-comp.

yep, the game I run with dx12, and it support dx11 too.

You are welcome to ping me at any time with your list of testing needs and I will be available to assist with testing after get off work today :)

my work computer is equipped with gtx 1080, I will test it again when I have time

@skygrango
Copy link
Contributor

skygrango commented Oct 4, 2024

GTX1080 (Proprietary driver)

after updating the branch and turning off WGPU_BACKEND=vulkan
all my commonly used apps can now be used.

but there are still problems with the menus of konsole and gnome terminal

konsole : menu can be opened, but the submenu cannot be displayed
gnome terminal : menu cannot be opened

There is also a problem with cosmic editor I am not sure if it is related to cosmic-comp

the submenu opens to the left and cannot be displayed correctly (this PR does not affect this?)
screenshot-2024-10-04-02-09-50

update:

I back to kde wayland, all problems disappeared

gnome terminal
螢幕截圖_20241004_102105
konsole
螢幕截圖_20241004_102028
cosmic editor
螢幕截圖_20241004_102411

@gabriele2000
Copy link

Apparently my weird issue translates to VR too since I see "backwards" frames or something.
Yes, I can play in VR using cosmic, amazing!

@skygrango
Copy link
Contributor

I'm still having the same sync issues playing dx12 and dx11 games on the latest update on my 7900xtx, and some temporary freezes during daily use.

I might have to try to collect more logs, but I'm not sure what information in the journal would be useful.

@StayBlue
Copy link

I'm using this with the latest Nvidia driver (560.35.03), which has worked great for me. If any other NixOS users would like to try this, you may try applying the overlay I made for it, which is available here.

@skygrango
Copy link
Contributor

skygrango commented Dec 7, 2024

as a 7900xtx user : I upgraded to cosmic-epoch alpha 4 and I rebase this branch on master, everything is working fine now regardless of whether vrr is enabled or not!

edit : What I tested was that the strange tear line no longer appeared.
edit 2 : I upload my wayland-info : https://gist.github.com/skygrango/aac43ac2e789c9a708fee4410244d4ba , and it show interface: 'wp_linux_drm_syncobj_manager_v1', version: 1, name: 51

@Tipcat-98
Copy link

Does this require any testing still or is there anything else an end-user can do to help?

@ids1024
Copy link
Member Author

ids1024 commented Dec 19, 2024

I need to look into this more. But there are a couple reports here of visual issues due to synchronization issues (so a bug in cosmic-comp/smithay, or the driver), which I haven't so far reproduced on hardware I have. It's hard to debug without having a way to reproduce the issue, though I could try a few more things to see if I can.

More testing on different hardware and software and specific details about what does and does not work is always welcome.

@ptr1337
Copy link

ptr1337 commented Jan 11, 2025

Hey,

would you mind to rebase the patchset on latest master?

@skygrango
Copy link
Contributor

linux : 6.12.9
spec : gtx 1080
driver : proprietary 565.77
cosmic version: alpha 5 with drm-syncobj

quick check the following apps again

work list:
chromium
firefox
evolution
cosmic-term
cosmic-files
cosmic-edit
cosmic-settings
filezilla
kate
vlc
mpv
vs-code
steam
gimp
discord (On startup, the animation appears black and the auto-tiling seems to have some incorrect borders. but it may be a problem with nvidia and should have nothing to do with drm-syncobj patch.

I still record the results here for reference
border issue :
screenshot-2025-01-14-01-57-29

it should be :
screenshot-2025-01-14-02-00-46)

not work list: empty for me !

Thanks a lot !!!

@ids1024
Copy link
Member Author

ids1024 commented Jan 14, 2025

So the only known issue with this, other than the Nvidia issue that was fixed a while ago, seems to be #411 (comment) on the RX7900 XTX. That was a while ago, but does look like a synchronization issue, and I can't recall that we've merged anything that would fix it since (unless it was a bug in Mesa/XWayland/DXVK or VKD3D/etc and fixed there).

@skygrango How much have you tested this on the RX7900 XTX lately? It seems to be working now?

@skygrango
Copy link
Contributor

So the only known issue with this, other than the Nvidia issue that was fixed a while ago, seems to be #411 (comment) on the RX7900 XTX. That was a while ago, but does look like a synchronization issue, and I can't recall that we've merged anything that would fix it since (unless it was a bug in Mesa/XWayland/DXVK or VKD3D/etc and fixed there).

@skygrango How much have you tested this on the RX7900 XTX lately? It seems to be working now?

I'm still testing the GTX1080 today and I noticed that when VRR is enabled and watch full screen video, there is a constant screen flashing white randomly.

As far as I can remember from last time, everything was working fine with the 7900xtx, I will update branch to test AMD 7900XTX again.

@skygrango
Copy link
Contributor

skygrango commented Jan 15, 2025

on GTX 1080, There is an interesting problem.

the menus of kate and gnome-terminal cannot be displayed, but if I click on another window or desktop first, and then go back and click on the menu, it will be displayed successfully, and only the first clicked menu can be displayed normally. as long as the sub-menu is triggered, the main menu will disappear directly, just like I no longer focus on the window.

cosmic-comp show warnning about it :

cosmic-comp[1337]: Client bug: Unable to re-configure repositioned popup.

kate[138639]: qt.qpa.wayland: Creating a popup with a parent, QWidgetWindow(0x57c81b95a140, name="MainWindow#1Window") which does not match the current topmost grabbing popup, QWidgetWindow(0x57c81bec0ef0, name="LSPClient MenubarWindow") With some shell surface protocols, this is not allowed. The wayland QPA plugin is currently handling it by setting the parent to the topmost grabbing popup. Note, however, that this may cause positioning errors and popups closing unxpectedly. Please fix the transient parent of the popup.

cosmic-comp[1337]: Client bug: Unable to re-configure repositioned popup.

Gimp's menu can be used normally, but what's interesting is that if you click on other windows while the menu is open, the menu will not close, but the main window will be hidden under other windows normally.

I'm not sure if this is specifically Nvidia's or drm-syncobj's responsibility, I'm just documenting it

@ids1024
Copy link
Member Author

ids1024 commented Jan 15, 2025

The errors about popup positioning are unrelated to the rendering code, so they won't be dependent on driver or on drm-syncobj. So that's probably unrelated. Though that's interesting, I don't think I've seen issues like that with menus, on native Wayland clients.

#1122 may be relevant, and has some improvements to popup positioning.

@Drakulix
Copy link
Member

Merging for more widespread testing after 5.1 and before alpha 6. This seems to be good on a bunch of our test machines. Thanks everybody else as well for providing test results.

@Drakulix Drakulix merged commit 9dddead into master Jan 15, 2025
7 of 9 checks passed
@Drakulix Drakulix deleted the drm-syncobj branch January 15, 2025 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants