Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes misses drive detection with lots of drives #1133

Closed
martinpitt opened this issue Jun 29, 2023 · 18 comments · Fixed by #1221
Closed

Sometimes misses drive detection with lots of drives #1133

martinpitt opened this issue Jun 29, 2023 · 18 comments · Fixed by #1221

Comments

@martinpitt
Copy link
Contributor

martinpitt commented Jun 29, 2023

In cockpit we have a storage scaling integration test. It creates 200 SCSI devices/drives with modprobe scsi_debug num_tgts=200, and then checks that they all propagate correctly through the kernel → udev → udisks → dbus → cockpit chain.

This test now fails very often on Arch Linux: It's often showing not the expected 200 SCSI drives (plus the virto root disk, and the cloud-init iso, thus 202 in total), but fewer, e.g. 188 or 193.

This seems to be an udisks bug. In the "good" case, udisks should pick up all 200 devices and drives:

# udisksctl dump | grep devices/sd|wc -l
200
# udisksctl dump | grep ^/org.*drives/Linux_scsi_debug_|wc -l
200

But when the test fails, it has fewer drives:

# udisksctl dump | grep devices/sd|wc -l
200
# udisksctl dump | grep ^/org.*drives/Linux_scsi_debug_|wc -l
179

The udisksctl dump diff between the good and the broken state shows that the device isn't properly connected to the drive, and the corresponding drive object is missing:

--- /tmp/good	2023-06-29 12:58:11.013662135 +0200
+++ /tmp/broken	2023-06-29 12:58:05.021631876 +0200
@@ -4707,7 +4707,7 @@
     CryptoBackingDevice:        '/'
     Device:                     /dev/sdfu
     DeviceNumber:               33536
-    Drive:                      '/org/freedesktop/UDisks2/drives/Linux_scsi_debug_18500'
+    Drive:                      '/'
     HintAuto:                   false
     HintIconName:
     HintIgnore:                 false

However, it is good in udev. The udevadm info --export-db diff between the
good and broken situation looks identical aside from time stamps and minor reordering:

@@ -11034,8 +11034,8 @@
 D: b 131:0
 N: sdfu
 L: 0
-S: disk/by-diskseq/188
 S: disk/by-id/scsi-3333333300002d2a8
+S: disk/by-diskseq/188
 S: disk/by-id/wwn-0x333333300002d2a8
 Q: 188
 E: DEVPATH=/devices/pseudo_0/adapter0/host3/target3:0:177/3:0:177:0/block/sdfu
@@ -11061,7 +11061,7 @@
 E: ID_TARGET_PORT=400
 E: ID_SCSI_SERIAL=18500
 E: ID_BUS=scsi
-E: DEVLINKS=/dev/disk/by-diskseq/188 /dev/disk/by-id/scsi-3333333300002d2a8 /dev/disk/by-id/wwn-0x333333300002d2a8
+E: DEVLINKS=/dev/disk/by-id/scsi-3333333300002d2a8 /dev/disk/by-diskseq/188 /dev/disk/by-id/wwn-0x333333300002d2a8
 E: TAGS=:systemd:
 E: CURRENT_TAGS=:systemd:

So far we only see this on Arch Linux, not (yet?) on Fedora/Debian/RHEL.

# pacman -Q udisks2 systemd linux
udisks2 2.9.4-4
systemd 253.5-1
linux 6.3.7.arch1-1

Arch's udisks package has no patches. Fedora 38 has kernel 6.3.9 and systemd 253.3 (same version), so arch is not really ahead here. So I'm not yet sure what the significant difference is, might just be sheer timing luck.

@martinpitt
Copy link
Contributor Author

martinpitt commented Jun 29, 2023

I am trying to reproduce this outside of cockpit with a repr.sh script (it has to end in .txt so that I can attach it here -- please rename). But so far no luck -- that does fail after a few iterations, but on a different issue. So this needs something more, perhaps because Cockpit is connected to udisks and has signal listeners, etc.? I'll try to refine this and report back if I have a working shell reproducer.

Update: I ran busctl monitor org.freedesktop.UDisks2 > bus.log in parallel, and that didn't help. Nor with running udisksctl monitor > udisks.log (or logging to stderr), repr.sh keeps working fine. This is tricky.

@tbzatek
Copy link
Member

tbzatek commented Jun 29, 2023

One potential issue on a first sight - it takes a while until udiskd processes all uevents. There's a side probing thread that may potentially do additional I/O (e.g. cd-rom drives). Even after udevadm settle and similar it is not guaranteed that everything is up-to-date on UDisks side. See similar issue #711.

What happens if you put sleep(60) or something in there?

@martinpitt
Copy link
Contributor Author

I forgot to mention -- it's not a race condition. I've kept the VM up in that state for an hour, and it didn't catch up. I can reload the cockpit page or do udisksctl dump all I want, it's a stable stabe. Only restarting udisks helped.

@tbzatek
Copy link
Member

tbzatek commented Jun 29, 2023

Thanks, I don't like race conditions :-)

Any chance to get ssh access to that machine? I'd be interested in:

  • lsscsi -wiu
  • ls -al /dev/disk/by-id/

We've seen some scsi_debug weirdness with missing identifiers before but so far I haven't managed to find the root cause: https://bugzilla.redhat.com/show_bug.cgi?id=2139318

And perhaps the obvious:

  • udisksctl dump
  • udevadm info -e

@martinpitt
Copy link
Contributor Author

udisks and udevadm dumps are already linked from the description (the "good" and "broken" links).

I reproduced the failure again, and grabbed all four things. Unfortunately uploading to this github issue is broken right now, so I put them onto my server: https://piware.de/tmp/udisks-info/

I'm happy to give you ssh access to that machine. It's a VM running on my laptop, so requires some port forwarding, but that can be arranged. Please send me your public SSH key, and let's coordinate on Slack or so? (My laptop isn't always on, and I'll be working on a train tomorrow afternoon). Alternatively I can talk you to running the test on your machine, in a container (it's quite a lot of stuff to download, but no danger to your OS or home dir)

@martinpitt
Copy link
Contributor Author

I tried to run the daemon with /usr/lib/udisks2/udisksd --debug, but that does not show any extra message -- like at all: Even when running the above repr.sh the whole output stays at

udisks-Message: 05:50:41.039: udisks daemon version 2.9.4 starting
udisks-Message: 05:50:41.060: Acquired the name org.freedesktop.UDisks2 on the system message bus

With G_MESSAGES_DEBUG=all I get tons of

(udisksd:32988): GLib-GIO-DEBUG: 05:51:26.255: GTask 0x5607c9739930 (source object: 0x5607c97395d0, source tag: (nil)) finalized without ever returning (using g_task_return_*()). This potentially indicates a bug in the program.

but that's not very useful, and unrelated to this bug (I suppose).

I played around further with repr.sh, but no luck so far. I updated the comment above with the results. So I am trying my luck with catching it from the other side, and modifying our integration test. That does the following steps:

  1. boot VM
  2. modprobe scsi_debug
  3. log into cockpit, go to storage page
  4. wait for expected number of devices

I tried to vary this. "stable" means it survived 3 parallel runs of 10 iterations each. "sleep" means adding time.sleep(5) ; "settle" means calling udevadm settle.

  • boot, sleep, modprobe, login: fail -- okay, so it's not a race against stuff that is still happening during boot
  • boot, modprobe, sleep, login: stable -- not surprising, udisks is less disturbed by the cockpit login and d-bus listeners
  • boot, modprobe, settle, login: stable -- dito
  • boot, login, modprobe: stable - !! that is surprising -- I had expected this to disturb udisksd more, not less
  • boot, login to overview, modprobe, open storage page: fail
  • boot, login to overview, sleep, modprobe, open storage page [1]: fail -- okay, so it's really not the login or the overview page which disturb here, it's the storage page connecting to udisks.

Current theory: happens when connecting to udisks while events are going on. I tried to simulate that with an updated
repr.sh, with varying sleeps, and serial/parallel monitor commands, but still no luck.

There must be something specific about cockpit's udisks monitoring. The Storage page does a single D-Bus connection to udisks (it also connects to Stratis). Right after connection, it enables the LVM2 and iscsi modules, that also sounds significant. I'll play around with that.

[1] diff, for posterity:

--- test/verify/check-storage-scaling
+++ test/verify/check-storage-scaling
@@ -17,6 +17,8 @@
 # You should have received a copy of the GNU Lesser General Public License
 # along with Cockpit; If not, see <http://www.gnu.org/licenses/>.
 
+import time
+
 import storagelib
 import testlib
 
@@ -27,9 +29,14 @@ class TestStorageScaling(storagelib.StorageCase):
         m = self.machine
         b = self.browser
 
+        self.login_and_go()
+        time.sleep(5)
         m.execute("modprobe scsi_debug num_tgts=200")
 
-        self.login_and_go("/storage")
+        b.go("/storage")
+        b.enter_page("/storage")
+        b.wait_visible("#devices")
+
         with b.wait_timeout(60):
             b.click("#drives button:contains(Show all 202 drives)")
             b.wait_not_in_text("#drives", "Show all")

@martinpitt
Copy link
Contributor Author

martinpitt commented Jun 30, 2023

I reworked repr.sh a bit to be closer to what cockpit does: initially udisksd is not running, then it gets started and modules enabled. But still no luck with reproducing that way 😢 I need to stop here.

Is there some other debugging that I could turn on during the test? --debug unfortunately doesn't seem to help at all.

You can also try to run it yourself. To make sure you don't have to change your running system, it's safest and easiest too do that in a toolbox with our devel/CI container:

toolbox create --image quay.io/cockpit/tasks -c cockpit
toolbox enter cockpit

Inside, check out cockpit, build an arch image:

git clone https://github.com/cockpit-project/cockpit
cd cockpit/
test/image-prepare -q arch

Then open some 2 or 3 terminals (for more parallelism, and failing faster), and run

for i in `seq 10`; do TEST_OS=arch test/verify/check-storage-scaling -stv || break; done

at some point this should fail and "sit":

wait_js_cond(ph_is_present("#drives button:contains(Show all 202 drives):not([disabled]):not([aria-disabled=true])")): Uncaught (in promise) Error: condition did not become true
[...]
SSH ACCESS
  $ ssh -p 2401 -i bots/machine/identity [email protected]

COCKPIT
  http://127.0.0.2:9291
Press RET to continue...

Then you can log into the VM using the provided command (the port number will vary depending on which parallel test failed). You can log in as user "admin" or "root", both have password "foobar".

jelly pushed a commit to cockpit-project/bots that referenced this issue Jun 30, 2023
@tbzatek
Copy link
Member

tbzatek commented Jul 3, 2023

(rather busy here at the moment, sorry, and PTO until Jul 17)

Just another thing that I've noticed:

sdc           8:32   0    8M  0 disk  
└─mpathc    253:0    0    8M  0 mpath 
sdd           8:48   0    8M  0 disk  
└─mpathb    253:2    0    8M  0 mpath 
sde           8:64   0    8M  0 disk  
└─mpatha    253:3    0    8M  0 mpath 
sdf           8:80   0    8M  0 disk  
sdg           8:96   0    8M  0 disk  
sdh           8:112  0    8M  0 disk  
sdi           8:128  0    8M  0 disk  
sdj           8:144  0    8M  0 disk  
└─mpathd    253:1    0    8M  0 mpath 

For some reason, multipathd is kicking in.

@martinpitt
Copy link
Contributor Author

@tbzatek : Very well spotted!

There is no multipathd process running. However, there is /usr/lib/udev/rules.d/56-multipath.rules, which triggers a lot of processes like "/sbin/multipath -u sdbv". I tried

--- test/verify/check-storage-scaling
+++ test/verify/check-storage-scaling
@@ -27,6 +27,7 @@ class TestStorageScaling(storagelib.StorageCase):
         m = self.machine
         b = self.browser
 
+        m.execute("rm /usr/lib/udev/rules.d/56-multipath.rules; udevadm control --reload")
         m.execute("modprobe scsi_debug num_tgts=200")
 
         self.login_and_go("/storage")

I have run it 3x parallel for 15 minutes now, and it eventually failed. It may have taken longer than usual.

lsscsi -wiu is "clean" -- it has exactly 200 entries like

[3:0:0:0]    disk    3333333000001f40                  0x33333333000001f40                 /dev/sda   33333333000001f40

However, that was already the case in my original lsscsi output, with multipath. What command did you run which produce the "sdc-mpathc" output? I don't see that anywhere in any of the dumps. If it was lsblk, this also looks clean in the failed instance with disabled mpath:

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0    8M  0 disk 
[... they all look the same, no mpath ...]
sdgr   132:112  0    8M  0 disk 
sr0     11:0    1  366K  0 rom  
vda    254:0    0   10G  0 disk 
├─vda1 254:1    0    1M  0 part 
├─vda2 254:2    0  300M  0 part /efi
└─vda3 254:3    0  9.7G  0 part /

@martinpitt
Copy link
Contributor Author

This now also affects Fedora 38.

@tbzatek
Copy link
Member

tbzatek commented Nov 16, 2023

Testing this on F39 cloud image:

# modprobe scsi_debug num_tgts=200
# udisksctl dump | grep devices/sd|wc -l
184
# udisksctl dump | grep ^/org.*drives/Linux_scsi_debug_|wc -l
184
# udisksctl dump | grep '^/org/freedesktop/UDisks2/block_devices/sd' | wc -l
184
# lsblk | grep '^sd' | wc -l
200
# ls -1 /dev/sd* | wc -l
200
# ls -1 /dev/disk/by-id/wwn-0x33333330000* | wc -l
200
# ls -1 /dev/disk/by-id/scsi-333333330000* | wc -l
200
# udevadm info --export-db | grep -e '^P: /devices/pseudo_.*/adapter.*/host.*/target.*/block/sd.*' | wc -l
200

Not good.

@tbzatek
Copy link
Member

tbzatek commented Nov 16, 2023

Btw. same thing just like last time - if I start udisksd from a root shell, everything works. Only when started through the systemd service it fails. Same binary, no extra args. SELinux is permissive.

@martinpitt
Copy link
Contributor Author

Weird.. I suppose the next thing to look at is udev, which sits in the middle? I checked that the udev dump is identical in good/bad situations, but it would be good to find out if it's missing the sending of uevents, or if udisks is missing to receive/handle some of them?

@tbzatek
Copy link
Member

tbzatek commented Nov 16, 2023

It's a performance issue mixed with what looks like a ring buffer on the kernel side, read through netlink. Reading the gudev and libudev sources, records are polled and read directly and not buffered anywhere.

Just putting a 100ms sleep at the beginning of udisks uevent processing chain results in about 90% uevent loss after a burst of about 250-300 uevents are read, likely being overwritten by other records in the ring buffer. Looking at the uevent sequence numbers the after the buffer is exhausted, there are significant gaps. Also noticed random order of incoming records (but that's something generally not guaranteed and we're kinda ready for that).

So we'll need a dedicated high-priority side thread that only reads and queues all incoming uevents that are then put into a probing queue processed in another thread and the results are then sent to the main thread where appropriate d-bus objects are created.

@martinpitt
Copy link
Contributor Author

Nice debugging! I'm not entirely sure if we aren't asking too much here. I added that test case back then as we got a bug report from users with a gazillion drives -- but they usually aren't being hotplugged all at the same time.

We could certainly also rewrite this in a way to create 200 drives and restart udisks, to treat them as "coldplug".

@tbzatek
Copy link
Member

tbzatek commented Nov 16, 2023

Oh no worries, the change will be reasonably small I think. Sooner or later we'd hit the limit anyway and it might have actually been source of some random failures we've been seeing.

The problem is that there's a lot of going on in the main thread and things would only get worse once new features are added.

@tbzatek
Copy link
Member

tbzatek commented Nov 21, 2023

With G_MESSAGES_DEBUG=all I get tons of

(udisksd:32988): GLib-GIO-DEBUG: 05:51:26.255: GTask 0x5607c9739930 (source object: 0x5607c97395d0, source tag: (nil)) finalized without ever returning (using g_task_return_*()). This potentially indicates a bug in the program.

Another good catch, addressed in #1224.

@martinpitt
Copy link
Contributor Author

Great work @tbzatek ! 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants