Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Handle fail cases caused by missing LVM devices. #326

Open
kvaps opened this issue Nov 18, 2022 · 11 comments
Open

Feature request: Handle fail cases caused by missing LVM devices. #326

kvaps opened this issue Nov 18, 2022 · 11 comments

Comments

@kvaps
Copy link

kvaps commented Nov 18, 2022

Hi, I just faced with issue of resizing the volume:

# linstor r l -r pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node           ┊ Port ┊ Usage ┊ Conns ┊              State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d ┊ monster-killer ┊ 7004 ┊ InUse ┊ Ok    ┊ Resizing, UpToDate ┊ 2022-09-13 09:47:31 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

I tired to invoke resize operation manually:

# linstor vd set-size pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d 0 19531250KiB
SUCCESS:
Description:
    Volume definition with number '0' of resource definition 'pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d' modified.
Details:
    Volume definition with number '0' of resource definition 'pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d' UUID is: 7f380b22-6ece-41cf-9f2b-5032b29c6868
ERROR:
    (Node: 'monster-killer') Failed to access DRBD super-block of volume pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/0
Show reports:
    linstor error-reports show 635BD872-5C0FA-000126

error report:

# linstor error-reports show 635BD872-5C0FA-000126
ERROR REPORT 635BD872-5C0FA-000126

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Satellite
Version:                            1.19.1
Build ID:                           a758bf07796c374fd2004465b0d8690209b74356
Build time:                         2022-07-28T04:54:55+00:00
Error time:                         2022-11-03 09:52:23
Node:                               monster-killer

============================================================

Reported error:
===============

Description:
    Failed to access DRBD super-block of volume pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/0

Category:                           LinStorException
Class name:                         VolumeException
Class canonical name:               com.linbit.linstor.core.devmgr.exceptions.VolumeException
Generated at:                       Method 'hasMetaData', Source file 'DrbdLayer.java', Line #1067

Error message:                      Failed to access DRBD super-block of volume pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/0

Error context:
    An error occurred while processing resource 'Node: 'monster-killer', Rsc: 'pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d''

Call backtrace:

    Method                                   Native Class:Line number
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1067
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:627
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:393
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:847
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:359
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:169
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           Exception
Class name:                         NoSuchFileException
Class canonical name:               java.nio.file.NoSuchFileException
Generated at:                       Method 'translateToIOException', Source file 'UnixException.java', Line #92

Error message:                      /dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000

Call backtrace:

    Method                                   Native Class:Line number
    translateToIOException                   N      sun.nio.fs.UnixException:92
    rethrowAsIOException                     N      sun.nio.fs.UnixException:111
    rethrowAsIOException                     N      sun.nio.fs.UnixException:116
    newFileChannel                           N      sun.nio.fs.UnixFileSystemProvider:182
    open                                     N      java.nio.channels.FileChannel:292
    open                                     N      java.nio.channels.FileChannel:345
    readObject                               N      com.linbit.linstor.layer.drbd.utils.MdSuperblockBuffer:74
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1062
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:627
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:393
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:847
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:359
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:169
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.

Seems wasn't able to find /dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000 device, okay, let's exec into pod:

LVM found (already resized):

# lvs | grep pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
  pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000 linstor -wi-ao----  18.63g

DRBD found (not resized):

# lsblk /dev/drbd1004
NAME     MAJ:MIN  RM SIZE RO TYPE MOUNTPOINT
drbd1004 147:1004  0  10G  0 disk /var/lib/kubelet/pods/56332201-3640-4de8-9ebb-52244111c406/volumes/kubernetes.io~csi/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/mount

Drbdadm adjust does not make anything:

# lsblk /dev/drbd1004
NAME     MAJ:MIN  RM SIZE RO TYPE MOUNTPOINT
drbd1004 147:1004  0  10G  0 disk /var/lib/kubelet/pods/56332201-3640-4de8-9ebb-52244111c406/volumes/kubernetes.io~csi/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/mount

# drbdadm adjust pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d

# lsblk /dev/drbd1004
NAME     MAJ:MIN  RM SIZE RO TYPE MOUNTPOINT
drbd1004 147:1004  0  10G  0 disk /var/lib/kubelet/pods/56332201-3640-4de8-9ebb-52244111c406/volumes/kubernetes.io~csi/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/mount

Drbdadm down/up wasn't completed because of missing device:

# drbdadm down pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d

# drbdadm up pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
open(/dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000) failed: No such file or directory
Command 'drbdmeta 1004 v09 /dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000 internal apply-al' terminated with exit code 20
command terminated with exit code 1

# drbdadm up pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
Defaulted container "linstor-satellite" out of: linstor-satellite, kube-rbac-proxy, drbd-prometheus-exporter, kernel-module-injector (init)
new-minor pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d 1004 0: sysfs node '/sys/devices/virtual/block/drbd1004' (already? still?) exists
pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d: Failure: (161) Minor or volume exists already (delete it first)
Command 'drbdsetup new-minor pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d 1004 0' terminated with exit code 10
command terminated with exit code 1

# drbdadm status pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d role:Secondary
  disk:Diskless

lvchange make this device appears back on the node:

# lvchange -ay linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000
# ls /dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d*
ls: cannot access '/dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d*': No such file or directory

# lvs | grep pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
  pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000 linstor -wi-a-----  18.63g

# lvchange -an linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000
# ls /dev/linstor/ | grep pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
# lvchange -ay linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000
# ls /dev/linstor/
pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000
# drbdadm adjust pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
Moving the internal meta data to its proper location
Internal drbd meta data successfully moved.
# drbdadm status pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d role:Secondary
  disk:UpToDate

That is not a first case when I see that LVM devices are disappearing from the node this way.

Since we can't make influence on LVM to make it working more predictable. I suggest a few enhancements in linstor-server to improve diagnostics and troubleshooting process:

  1. Detect missing backing device path and report problem about this (or don't allow running resize and related operations)
  2. Consider adding some automation for fixing such issues, eg. In case if device is not InUse, run drbdadm down; lvchange -an; lvchange -ay; drbdadm up. Or is there any better method?
@kvaps
Copy link
Author

kvaps commented Dec 28, 2022

Today this issue was repeated on different cluster, the resource was stuck on resizing, because of missing LV:

# linstor r l
Defaulted container "linstor-controller" out of: linstor-controller, kube-rbac-proxy
+-------------------------------------------------------------------------------------------------------------------------------------+
| ResourceName                             | Node                   | Port | Usage | Conns |              State | CreatedOn           |
|=====================================================================================================================================|
| pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 | slt-dev-kube-system-01 | 7000 | InUse | Ok    | Resizing, UpToDate | 2022-10-06 09:32:06 |
+-------------------------------------------------------------------------------------------------------------------------------------+
# linstor vd l
Defaulted container "linstor-controller" out of: linstor-controller, kube-rbac-proxy
+------------------------------------------------------------------------------------------------+
| ResourceName                             | VolumeNr | VolumeMinor | Size    | Gross | State    |
|================================================================================================|
| pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 | 0        | 1000        | 100 GiB |       | resizing |
+------------------------------------------------------------------------------------------------+
# linstor vd set-size pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 0 100G
Defaulted container "linstor-controller" out of: linstor-controller, kube-rbac-proxy
SUCCESS:
Description:
    Volume definition with number '0' of resource definition 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' modified.
Details:
    Volume definition with number '0' of resource definition 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' UUID is: a58b59cd-ce4a-46c2-b9cd-1d7a7eca1b4e
ERROR:
    (Node: 'slt-dev-kube-system-01') Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0
Show reports:
    linstor error-reports show 639FE3FF-E8C1E-000009
command terminated with exit code 10
# linstor vd l
Defaulted container "linstor-controller" out of: linstor-controller, kube-rbac-proxy
+------------------------------------------------------------------------------------------------+
| ResourceName                             | VolumeNr | VolumeMinor | Size    | Gross | State    |
|================================================================================================|
| pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 | 0        | 1000        | 100 GiB |       | resizing |
+------------------------------------------------------------------------------------------------+
# linstor error-reports show 639FE3FF-E8C1E-000009
ERROR REPORT 639FE3FF-E8C1E-000009

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Satellite
Version:                            1.20.0
Build ID:                           9c6f7fad48521899f7a99c564b1d33aeacfdbfa8
Build time:                         2022-11-07T16:37:38+00:00
Error time:                         2022-12-28 11:16:25
Node:                               slt-dev-kube-system-01

============================================================

Reported error:
===============

Description:
    Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Category:                           LinStorException
Class name:                         VolumeException
Class canonical name:               com.linbit.linstor.core.devmgr.exceptions.VolumeException
Generated at:                       Method 'hasMetaData', Source file 'DrbdLayer.java', Line #1087

Error message:                      Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Error context:
    An error occurred while processing resource 'Node: 'slt-dev-kube-system-01', Rsc: 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434''

Call backtrace:

    Method                                   Native Class:Line number
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1087
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:622
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           Exception
Class name:                         NoSuchFileException
Class canonical name:               java.nio.file.NoSuchFileException
Generated at:                       Method 'translateToIOException', Source file 'UnixException.java', Line #92

Error message:                      /dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000

Call backtrace:

    Method                                   Native Class:Line number
    translateToIOException                   N      sun.nio.fs.UnixException:92
    rethrowAsIOException                     N      sun.nio.fs.UnixException:111
    rethrowAsIOException                     N      sun.nio.fs.UnixException:116
    newFileChannel                           N      sun.nio.fs.UnixFileSystemProvider:182
    open                                     N      java.nio.channels.FileChannel:292
    open                                     N      java.nio.channels.FileChannel:345
    readObject                               N      com.linbit.linstor.layer.drbd.utils.MdSuperblockBuffer:74
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1082
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:622
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.

I know that this is not linstor issue, but since we relying on existing technologies we need to know how to live and how to overcome their bugs.

The issue above was fixed by recreating symlink manually:

# lvscan | grep pvc
  ACTIVE            '/dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000' [100.02 GiB] inherit
# ls -lah /dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000
ls: cannot access '/dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000': No such file or directory
# dmsetup ls | grep pvc
data-pvc--96665a02--7aaa--4f19--b10a--74ec53fac434_00000	(253:0)
# ls -lah /dev/dm-* | grep "253, 0"
brw-rw---- 1 root disk 253, 0 Dec 28 10:06 /dev/dm-0
# ln -s /dev/dm-0 /dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000
# linstor vd set-size pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 0 100G
SUCCESS:
Description:
    Volume definition with number '0' of resource definition 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' modified.
Details:
    Volume definition with number '0' of resource definition 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' UUID is: a58b59cd-ce4a-46c2-b9cd-1d7a7eca1b4e

Thus symlinks can be recovered without invoking lvchange -an; lvchange -ay commands.
@ghernadi the devices are active anyway, can't we automate this to not rely on udev daemon?

@kvaps
Copy link
Author

kvaps commented Feb 1, 2023

Today I faced again with problem of missing symlink. I went through the many bugs trying to fix that attempt to resize, eg:

root@slt-dev-kube-system-02:/# linstor r l
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node                   ┊ Port ┊ Usage ┊ Conns ┊              State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 ┊ slt-dev-kube-system-01 ┊ 7000 ┊ InUse ┊ Ok    ┊ Resizing, UpToDate ┊ 2022-10-06 09:32:06 ┊
┊ pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 ┊ slt-dev-kube-system-02 ┊ 7000 ┊       ┊ Ok    ┊  Resizing, Unknown ┊ 2023-01-31 15:38:42 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

root@slt-dev-kube-system-02:/# linstor r d slt-dev-kube-system-02 pvc-96665a02-7aaa-4f19-b10a-74ec53fac434
SUCCESS:
Description:
    Node: slt-dev-kube-system-02, Resource: pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 preparing for deletion.
Details:
    Node: slt-dev-kube-system-02, Resource: pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 UUID is: 8691638c-2caf-4779-a462-a6b54f13cd71
SUCCESS:
    Preparing deletion of resource on 'slt-dev-kube-system-02'
ERROR:
    (Node: 'slt-dev-kube-system-01') Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0
Show reports:
    linstor error-reports show 63D51331-E8C1E-000017
ERROR:
Description:
    Deletion of resource 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' on node 'slt-dev-kube-system-02' failed due to an unknown exception.
Details:
    Node: slt-dev-kube-system-02, Resource: pvc-96665a02-7aaa-4f19-b10a-74ec53fac434
Show reports:
    linstor error-reports show 63CACC00-00000-000007
linstor error-reports show 63CACC00-00000-000007
ERROR REPORT 63CACC00-00000-000007

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.20.0
Build ID:                           9c6f7fad48521899f7a99c564b1d33aeacfdbfa8
Build time:                         2022-11-07T16:37:38+00:00
Error time:                         2023-02-01 13:58:47
Node:                               linstor-controller-766b7f6574-h469w
Peer:                               RestClient(192.168.236.102; 'PythonLinstor/1.15.1 (API1.0.4): Client 1.15.1')

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         DelayedApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils.DelayedApiRcException
Generated at:                       Method 'lambda$mergeExtractingApiRcExceptions$4', Source file 'CtrlResponseUtils.java', Line #126

Error message:                      Exceptions have been converted to responses

Error context:
    Deletion of resource 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' on node 'slt-dev-kube-system-02' failed due to an unknown exception.

Asynchronous stage backtrace:
    (Node: 'slt-dev-kube-system-01') Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

    Error has been observed at the following site(s):
    	|_ checkpoint ? Prepare resource delete
    	|_ checkpoint ? Activating resource if necessary before deletion
    Stack trace:

Call backtrace:

    Method                                   Native Class:Line number
    lambda$mergeExtractingApiRcExceptions$4  N      com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils:126

Suppressed exception 1 of 2:
===============
Category:                           RuntimeException
Class name:                         ApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiRcException
Generated at:                       Method 'handleAnswer', Source file 'CommonMessageProcessor.java', Line #337

Error message:                      (Node: 'slt-dev-kube-system-01') Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Error context:
    Deletion of resource 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' on node 'slt-dev-kube-system-02' failed due to an unknown exception.

ApiRcException entries:
Nr: 1
  Message: (Node: 'slt-dev-kube-system-01') Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Call backtrace:

    Method                                   Native Class:Line number
    handleAnswer                             N      com.linbit.linstor.proto.CommonMessageProcessor:337
    handleDataMessage                        N      com.linbit.linstor.proto.CommonMessageProcessor:284
    doProcessInOrderMessage                  N      com.linbit.linstor.proto.CommonMessageProcessor:235
    lambda$doProcessMessage$3                N      com.linbit.linstor.proto.CommonMessageProcessor:220
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.FluxFlatMap$FlatMapMain:418
    drainAsync                               N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
    drain                                    N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
    onNext                                   N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
    drainFused                               N      reactor.core.publisher.UnicastProcessor:286
    drain                                    N      reactor.core.publisher.UnicastProcessor:329
    onNext                                   N      reactor.core.publisher.UnicastProcessor:408
    next                                     N      reactor.core.publisher.FluxCreate$IgnoreSink:618
    next                                     N      reactor.core.publisher.FluxCreate$SerializedSink:153
    processInOrder                           N      com.linbit.linstor.netcom.TcpConnectorPeer:383
    doProcessMessage                         N      com.linbit.linstor.proto.CommonMessageProcessor:218
    lambda$processMessage$2                  N      com.linbit.linstor.proto.CommonMessageProcessor:164
    onNext                                   N      reactor.core.publisher.FluxPeek$PeekSubscriber:177
    runAsync                                 N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
    run                                      N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
    call                                     N      reactor.core.scheduler.WorkerTask:84
    call                                     N      reactor.core.scheduler.WorkerTask:37
    run                                      N      java.util.concurrent.FutureTask:264
    run                                      N      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
    runWorker                                N      java.util.concurrent.ThreadPoolExecutor:1128
    run                                      N      java.util.concurrent.ThreadPoolExecutor$Worker:628
    run                                      N      java.lang.Thread:829

Suppressed exception 2 of 2:
===============
Category:                           RuntimeException
Class name:                         OnAssemblyException
Class canonical name:               reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at:                       Method 'lambda$mergeExtractingApiRcExceptions$4', Source file 'CtrlResponseUtils.java', Line #126

Error message:
Error has been observed at the following site(s):
	|_ checkpoint ��� Prepare resource delete
	|_ checkpoint ��� Activating resource if necessary before deletion
Stack trace:

Error context:
    Deletion of resource 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' on node 'slt-dev-kube-system-02' failed due to an unknown exception.

Call backtrace:

    Method                                   Native Class:Line number
    lambda$mergeExtractingApiRcExceptions$4  N      com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils:126
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onComplete                               N      reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber:207
    onComplete                               N      reactor.core.publisher.FluxMap$MapSubscriber:136
    checkTerminated                          N      reactor.core.publisher.FluxFlatMap$FlatMapMain:838
    drainLoop                                N      reactor.core.publisher.FluxFlatMap$FlatMapMain:600
    innerComplete                            N      reactor.core.publisher.FluxFlatMap$FlatMapMain:909
    onComplete                               N      reactor.core.publisher.FluxFlatMap$FlatMapInner:1013
    onComplete                               N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2016
    request                                  N      reactor.core.publisher.FluxJust$WeakScalarSubscription:101
    set                                      N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2152
    onSubscribe                              N      reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber:68
    subscribe                                N      reactor.core.publisher.FluxJust:70
    subscribe                                N      reactor.core.publisher.Flux:8357
    onError                                  N      reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber:97
    onError                                  N      reactor.core.publisher.FluxMap$MapSubscriber:126
    onError                                  N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2021
    onError                                  N      reactor.core.publisher.MonoIgnoreElements$IgnoreElementsSubscriber:76
    onError                                  N      reactor.core.publisher.FluxPeek$PeekSubscriber:214
    onError                                  N      reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber:100
    error                                    N      reactor.core.publisher.Operators:196
    subscribe                                N      reactor.core.publisher.FluxError:43
    subscribe                                N      reactor.core.publisher.Flux:8357
    onError                                  N      reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber:97
    onError                                  N      reactor.core.publisher.FluxMap$MapSubscriber:126
    onError                                  N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2021
    error                                    N      reactor.core.publisher.FluxCreate$BaseSink:452
    drain                                    N      reactor.core.publisher.FluxCreate$BufferAsyncSink:781
    error                                    N      reactor.core.publisher.FluxCreate$BufferAsyncSink:726
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedSink:229
    drain                                    N      reactor.core.publisher.FluxCreate$SerializedSink:205
    error                                    N      reactor.core.publisher.FluxCreate$SerializedSink:181
    apiCallError                             N      com.linbit.linstor.netcom.TcpConnectorPeer:451
    handleAnswer                             N      com.linbit.linstor.proto.CommonMessageProcessor:349
    handleDataMessage                        N      com.linbit.linstor.proto.CommonMessageProcessor:284
    doProcessInOrderMessage                  N      com.linbit.linstor.proto.CommonMessageProcessor:235
    lambda$doProcessMessage$3                N      com.linbit.linstor.proto.CommonMessageProcessor:220
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.FluxFlatMap$FlatMapMain:418
    drainAsync                               N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
    drain                                    N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
    onNext                                   N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
    drainFused                               N      reactor.core.publisher.UnicastProcessor:286
    drain                                    N      reactor.core.publisher.UnicastProcessor:329
    onNext                                   N      reactor.core.publisher.UnicastProcessor:408
    next                                     N      reactor.core.publisher.FluxCreate$IgnoreSink:618
    next                                     N      reactor.core.publisher.FluxCreate$SerializedSink:153
    processInOrder                           N      com.linbit.linstor.netcom.TcpConnectorPeer:383
    doProcessMessage                         N      com.linbit.linstor.proto.CommonMessageProcessor:218
    lambda$processMessage$2                  N      com.linbit.linstor.proto.CommonMessageProcessor:164
    onNext                                   N      reactor.core.publisher.FluxPeek$PeekSubscriber:177
    runAsync                                 N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
    run                                      N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
    call                                     N      reactor.core.scheduler.WorkerTask:84
    call                                     N      reactor.core.scheduler.WorkerTask:37
    run                                      N      java.util.concurrent.FutureTask:264
    run                                      N      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
    runWorker                                N      java.util.concurrent.ThreadPoolExecutor:1128
    run                                      N      java.util.concurrent.ThreadPoolExecutor$Worker:628
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.
linstor error-reports show 63D51331-E8C1E-000017
ERROR REPORT 63D51331-E8C1E-000017

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Satellite
Version:                            1.20.0
Build ID:                           9c6f7fad48521899f7a99c564b1d33aeacfdbfa8
Build time:                         2022-11-07T16:37:38+00:00
Error time:                         2023-02-01 13:58:46
Node:                               slt-dev-kube-system-01

============================================================

Reported error:
===============

Description:
    Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Category:                           LinStorException
Class name:                         VolumeException
Class canonical name:               com.linbit.linstor.core.devmgr.exceptions.VolumeException
Generated at:                       Method 'hasMetaData', Source file 'DrbdLayer.java', Line #1087

Error message:                      Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Error context:
    An error occurred while processing resource 'Node: 'slt-dev-kube-system-01', Rsc: 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434''

Call backtrace:

    Method                                   Native Class:Line number
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1087
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:622
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           Exception
Class name:                         NoSuchFileException
Class canonical name:               java.nio.file.NoSuchFileException
Generated at:                       Method 'translateToIOException', Source file 'UnixException.java', Line #92

Error message:                      /dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000

Call backtrace:

    Method                                   Native Class:Line number
    translateToIOException                   N      sun.nio.fs.UnixException:92
    rethrowAsIOException                     N      sun.nio.fs.UnixException:111
    rethrowAsIOException                     N      sun.nio.fs.UnixException:116
    newFileChannel                           N      sun.nio.fs.UnixFileSystemProvider:182
    open                                     N      java.nio.channels.FileChannel:292
    open                                     N      java.nio.channels.FileChannel:345
    readObject                               N      com.linbit.linstor.layer.drbd.utils.MdSuperblockBuffer:74
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1082
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:622
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.

I found that vgscan --mknodes fixes the issue of missing symlink.
Thus can't we simple run it before the resizing attempt in case of missing device?

@flant-team-zulu
Copy link

flant-team-zulu commented Apr 15, 2023

root@kube-master:~# kubectl -n dev get pvc data-dispace-redis-0 -o jsonpath='{.spec.resources.requests.storage}' && echo 
512Mi
root@kube-node-1:~# ls /dev/linstor_data/pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d*
/dev/linstor_data/pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d_00000
root@kube-master:~# kubectl -n dev patch pvc data-dispace-redis-0 --type='json' -p='[{"op": "replace", "path": "/spec/resources/requests/storage", "value":"530Mi"}]'
persistentvolumeclaim/data-dispace-redis-0 patched
root@kube-master:~# kubectl -n dev get pvc data-dispace-redis-0 -o jsonpath='{.spec.resources.requests.storage}' && echo 
530Mi
root@kube-node-1:~# ls /dev/linstor_data/pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d*
ls: cannot access '/dev/linstor_data/pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d*': No such file or directory
root@kube-master:~# linstor v l

+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Node        | Resource                                 | StoragePool          | VolNr | MinorNr | DeviceName    | Allocated | InUse  |              State |
|===========================================================================================================================================================|
| kube-node-1 | pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d | lvm                  |     0 |    1004 | /dev/drbd1004 |   532 MiB | Unused | Resizing, UpToDate |
| kube-node-2 | pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d | lvm                  |     0 |    1004 | /dev/drbd1004 |   532 MiB | InUse  | Resizing, UpToDate |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+

@kvaps
Copy link
Author

kvaps commented Jul 10, 2023

kvaps added a commit to kvaps/linstor-server that referenced this issue Oct 3, 2023
Introduce an additional handler that checks for the device path before each such modification.
If the device is not found, it attempts to fix the symlink using dmsetup output.

This change is workaround for specific set of issues, often related to udev,
which lead to the disappearance of symlinks for LVM devices on a working system.
These issues commonly manifest during device resizing and deactivation.
causing LINSTOR expceptions when accessing DRBD super-block of volume.

fixes: LINBIT#326

Signed-off-by: Andrei Kvapil <[email protected]>
kvaps added a commit to kvaps/linstor-server that referenced this issue Oct 3, 2023
Introduce an additional handler that checks for the device path before each such modification.
If the device is not found, it attempts to fix the symlink using dmsetup output.

This change is workaround for specific set of issues, often related to udev,
which lead to the disappearance of symlinks for LVM devices on a working system.
These issues commonly manifest during device resizing and deactivation.
causing LINSTOR expceptions when accessing DRBD super-block of volume.

fixes: LINBIT#326

Signed-off-by: Andrei Kvapil <[email protected]>
kvaps added a commit to kvaps/linstor-server that referenced this issue Oct 4, 2023
Introduce an additional handler that checks for the device path before each such modification.
If the device is not found, it attempts to fix the symlink using dmsetup output.

This change is workaround for specific set of issues, often related to udev,
which lead to the disappearance of symlinks for LVM devices on a working system.
These issues commonly occur during device resizing and deactivation,
causing LINSTOR expceptions diring accessing DRBD super-block of volume.

fixes: LINBIT#326

Signed-off-by: Andrei Kvapil <[email protected]>
kvaps added a commit to kvaps/linstor-server that referenced this issue Oct 4, 2023
Introduce an additional handler that checks for the device path before each such modification.
If the device is not found, it attempts to fix the symlink using dmsetup output.

This change is workaround for specific set of issues, often related to udev,
which lead to the disappearance of symlinks for LVM devices on a working system.
These issues commonly occur during device resizing and deactivation,
causing LINSTOR expceptions diring accessing DRBD super-block of volume.

fixes: LINBIT#326

Signed-off-by: Andrei Kvapil <[email protected]>
kvaps added a commit to kvaps/linstor-server that referenced this issue Oct 4, 2023
Introduce an additional handler that checks for the device path before each such modification.
If the device is not found, it attempts to fix the symlink using dmsetup output.

This change is workaround for specific set of issues, often related to udev,
which lead to the disappearance of symlinks for LVM devices on a working system.
These issues commonly occur during device resizing and deactivation,
causing LINSTOR expceptions diring accessing DRBD super-block of volume.

fixes: LINBIT#326

Signed-off-by: Andrei Kvapil <[email protected]>
kvaps added a commit to kvaps/linstor-server that referenced this issue Oct 4, 2023
Introduce an additional handler that checks for the device path before each such modification.
If the device is not found, it attempts to fix the symlink using dmsetup output.

This change is workaround for specific set of issues, often related to udev,
which lead to the disappearance of symlinks for LVM devices on a working system.
These issues commonly occur during device resizing and deactivation,
causing LINSTOR expceptions diring accessing DRBD super-block of volume.

fixes: LINBIT#326

Signed-off-by: Andrei Kvapil <[email protected]>
kvaps added a commit to kvaps/linstor-server that referenced this issue Oct 4, 2023
Introduce an additional handler that checks for the device path before each such modification.
If the device is not found, it attempts to fix the symlink using dmsetup output.

This change is workaround for specific set of issues, often related to udev,
which lead to the disappearance of symlinks for LVM devices on a working system.
These issues commonly occur during device resizing and deactivation,
causing LINSTOR expceptions diring accessing DRBD super-block of volume.

fixes: LINBIT#326

Signed-off-by: Andrei Kvapil <[email protected]>
@maxpain
Copy link

maxpain commented Nov 3, 2024

I faced the same problem

@dimm0
Copy link

dimm0 commented Dec 18, 2024

Same here

@WanzenBug
Copy link
Contributor

WanzenBug commented Dec 18, 2024

This has been fixed with recent Piraeus Operator releases. This has not a lot to do with LINSTOR, as that is just calling the usual lvresize commands. The issue seemed to be happening when running in a container with no access to the hosts udev daemon.

See piraeusdatastore/piraeus-operator#728 for details on the fix.

So the issue with "detecting" the issue in LINSTOR: there is nothing to detect until it is already too late, because the missing link is caused by the resize command.

So please upgrade to the latest Piraeus Operator (at least 2.7.0) version, that issue will go away.

@dimm0
Copy link

dimm0 commented Jan 21, 2025

After the upgrade I'm getting "No space" error when resizing:

  Warning  FailedMount             2s    kubelet                  MountVolume.Setup failed while expanding volume for volume "pvc-9e67b737-53ae-42e3-b4c9-6b6812e9ef68" : Expander.NodeExpand failed to expand the volume : rpc error: code = Internal desc = NodeExpandVolume - expand volume failed for target /var/lib/kubelet/pods/3f0a28de-2945-41f7-b073-8f9ce565a465/volumes/kubernetes.io~csi/pvc-9e67b737-53ae-42e3-b4c9-6b6812e9ef68/mount, err: resize of device /var/lib/kubelet/pods/3f0a28de-2945-41f7-b073-8f9ce565a465/volumes/kubernetes.io~csi/pvc-9e67b737-53ae-42e3-b4c9-6b6812e9ef68/mount failed: exit status 1. xfs_growfs output: xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: No space left on device
meta-data=/dev/drbd1370          isize=512    agcount=8, agsize=655474 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0 nrext64=0
data     =                       bsize=4096   blocks=5243792, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
  Warning  FailedMount  1s  kubelet  MountVolume.SetUp failed for volume "pvc-9e67b737-53ae-42e3-b4c9-6b6812e9ef68" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-9e67b737-53ae-42e3-b4c9-6b6812e9ef68: unable to resize volume: resize of device /var/lib/kubelet/pods/3f0a28de-2945-41f7-b073-8f9ce565a465/volumes/kubernetes.io~csi/pvc-9e67b737-53ae-42e3-b4c9-6b6812e9ef68/mount failed: exit status 1. xfs_growfs output: xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: No space left on device
meta-data=/dev/drbd1370          isize=512    agcount=8, agsize=655474 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0 nrext64=0
data     =                       bsize=4096   blocks=5243792, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   

@dimm0
Copy link

dimm0 commented Jan 22, 2025

@WanzenBug we can't use several volumes now because of this. Could you please look?

@dimm0
Copy link

dimm0 commented Jan 23, 2025

The weird thing is that both LVM and XFS are already resized to the requested size. Manually mounting the drbd device and running xfs_growfs produces the same error.

@WanzenBug
Copy link
Contributor

Please open an issue in piraeus operator, this has nothing to do with LINSTOR itself. This may also be an issue with the kernel. I remember a version of RHEL 9.2 that produces this error when trying to resize a volume that does not need to be resized. So perhaps try to update everything?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants