Failed to find logical volume vm-13649-disk-0 after reverting snapshot StarLVM plugin, FC storage

Hello,

We've recently migrated from ESXi to Proxmox. We are using the same storage we used at ESXI EV, it's FC SAN mounted as usual in such case.
We have two disks 15Tb each. We use Proxmox as cluster with multuple nodes, 4 nodes now active.

One is LVM - pure, no volume-chains, no other tricks, not using snapshots on it. No issues.
Another one is starlvm - for dev VMs where we need to use snapshot functionality. This is the problematic one.

For development purposes we need to use snapshots as a tool to roll-back VM changes fast and reliably.
But sometimes, randomly, during snapshot revert procedure some VMs are losing disks:

Code:

unsupported storage of vg 'star_vg_fc_san_3738'
 activating vm-13472-disk-0...
 deactivating vm-13472-disk-0...
Use of uninitialized value in string ne at /usr/share/perl5/PVE/Storage/Custom/StarLvmPlugin.pm line 481.
TASK ERROR: no such logical volume star_vg_fc_san_3738/vm-13472-disk-0

To make things consistent the snapshot reverting procedure is following:

shutdown VM (gracefully)
get list of snapshots
revert to the latest snapshot (have no parent)
start VM

I'm running this routine sequentially for each VM group(6 VMs in a group), but there are multiple VM groups at the cluster (alpha, beta, each with 6 VMs in it), and sometimes they might execute this routine for different VMs in parallel. I'm using some randomization and time shift to run snapshot and power-related tasks as random as possible. I'm not sure if this may be the cause, but it's maybe worth mentioning.

Here is the problematic VM config:
- all vms are built similarly, only name, tags, mac changes for each vm version and group

Code:

#Clone made from SOME VM
agent: enabled=1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 2
cpu: host
machine: q35
memory: 4096
meta: creation-qemu=11.0.0,ctime=1780414247
name: VM-NAME-delta
net0: virtio=some_MAC,bridge=somebridge,firewall=1,queues=1
onboot: 0
ostype: l26
parent: phase_2
scsi0: swsan3738lv:vm-13472-disk-0,aio=native,cache=none,detect_zeroes=1,discard=on,iothread=1,queues=2,size=70G
scsihw: virtio-scsi-single
smbios1: uuid=UUID
tags: SOME-TAGS
vmgenid: 896cef72-1765-41dd-8b93-89f4ea668e09
[phase_2]
#Cloned VM Reverting this snapshot when needed
agent: enabled=1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 2
cpu: host
machine: q35
memory: 4096
meta: creation-qemu=11.0.0,ctime=1780414247
name: VM-NAME-delta
net0: virtio=some_MAC,bridge=somebridge,firewall=1,queues=1
onboot: 0
ostype: l26
scsi0: swsan3738lv:vm-13472-disk-0,aio=native,cache=none,detect_zeroes=1,discard=on,iothread=1,queues=2,size=70G
scsihw: virtio-scsi-single
smbios1: uuid=UUID
snaptime: 1780436041
tags: SOME-TAGS
vmgenid: 1f6ec138-c221-497e-ae8f-23d69c5c92d8

I'm checking two VMs, one is already affected by this issue 13472 and another one is OK 13649.
And there are some checks I run after this fail to get the full view:

Code:

#
# Affected VM  13472
#
pvesm status
unsupported storage of vg 'star_vg_fc_san_3738'
Name                  Type     Status     Total (KiB)      Used (KiB) Available (KiB)        %
local                  dir   disabled               0               0               0      N/A
local-lvm          lvmthin   disabled               0               0               0      N/A
san3739lv              lvm     active     16106123264     10011447296      6094675968   62.16%
swsan3738lv        starlvm     active     16106123264      4272119808     11834003456   26.52%
#
pvesm list swsan3738lv --vmid 13472
unsupported storage of vg 'star_vg_fc_san_3738'
Volid Format  Type      Size VMID
#
sudo /usr/sbin/lvscan | grep /star_vg_fc_san_3738  | grep 13472
  inactive          '/dev/star_vg_fc_san_3738/lvmth-13472' [70.00 GiB] inherit
  inactive          '/dev/star_vg_fc_san_3738/snap_vm-13472-disk-0_phase_2' [70.00 GiB] inherit
#
vgs; lvs -a -o vg_name,lv_name,lv_attr,lv_size,pool_lv,data_percent,metadata_percent | grep 13472
  VG                  #PV #LV #SN Attr   VSize   VFree
  pve                   1   3   0 wz--n- 277.87g  16.00g
  star_vg_fc_san_3738   1 167   0 wz--n- <15.00t <10.96t
  vg_fc_san_3739        1  55   0 wz--n- <15.00t  <5.68t
  star_vg_fc_san_3738 lvmth-13472                    twi---tz-k   70.00g
  star_vg_fc_san_3738 [lvmth-13472_tdata]            Twi-------   70.00g
  star_vg_fc_san_3738 [lvmth-13472_tmeta]            ewi-------   72.00m
  star_vg_fc_san_3738 snap_vm-13472-disk-0_phase_2   Vri---tz-k   70.00g lvmth-13472

#
ls -lah /dev/star_vg_fc_san_3738/ | grep 13472
-- none ---

#
ls -l /dev/mapper/ | grep 13472
-- none ---

#
vgchange -a y star_vg_fc_san_3738
  18 logical volume(s) in volume group "star_vg_fc_san_3738" now active

#
lvs | grep 13472
  lvmth-13472                    star_vg_fc_san_3738 twi---tz-k   70.00g
  snap_vm-13472-disk-0_phase_2   star_vg_fc_san_3738 Vri---tz-k   70.00g lvmth-13472

# It does not help me to check the disk itself, yet
mount /dev/star_vg_fc_san_3738/lvmth-13472 /mnt/pve/
mount /dev/star_vg_fc_san_3738/snap_vm-13472-disk-0_phase_2 /mnt/pve/

#
# Example of healthy VM 13649
#
pvesm list swsan3738lv --vmid 13649
unsupported storage of vg 'star_vg_fc_san_3738'
Volid                       Format  Type             Size VMID
swsan3738lv:vm-13649-disk-0 raw     images    69793218560 13649

#
sudo /usr/sbin/lvscan | grep /star_vg_fc_san_3738  | grep 13649
  ACTIVE            '/dev/star_vg_fc_san_3738/lvmth-13649' [65.00 GiB] inherit
  inactive          '/dev/star_vg_fc_san_3738/snap_vm-13649-disk-0_phase_2' [65.00 GiB] inherit
  ACTIVE            '/dev/star_vg_fc_san_3738/vm-13649-disk-0' [65.00 GiB] inherit

#
vgs; lvs -a -o vg_name,lv_name,lv_attr,lv_size,pool_lv,data_percent,metadata_percent | grep 13649
  VG                  #PV #LV #SN Attr   VSize   VFree
  pve                   1   3   0 wz--n- 277.87g  16.00g
  star_vg_fc_san_3738   1 167   0 wz--n- <15.00t <10.96t
  vg_fc_san_3739        1  55   0 wz--n- <15.00t  <5.68t
  star_vg_fc_san_3738 lvmth-13649                    twi-aotz-k   65.00g             30.48  20.12
  star_vg_fc_san_3738 [lvmth-13649_tdata]            Twi-ao----   65.00g
  star_vg_fc_san_3738 [lvmth-13649_tmeta]            ewi-ao----   68.00m
  star_vg_fc_san_3738 snap_vm-13649-disk-0_phase_2   Vri---tz-k   65.00g lvmth-13649
  star_vg_fc_san_3738 vm-13649-disk-0                Vwi-aotz-k   65.00g lvmth-13649 28.84

#
ls -l /dev/mapper/ | grep 13649
lrwxrwxrwx 1 root root       8 Jun 12 08:34 star_vg_fc_san_3738-lvmth--13649 -> ../dm-45

The interesting part is in the working VM:

Code:

star_vg_fc_san_3738 vm-13649-disk-0                Vwi-aotz-k   65.00g lvmth-13649 28.84

So, if I understand correctly, using example of unaffected VM:

1	VM main disk	as volume	lvmth-13649 twi-aotz-k 65.00g
2	VM disk mounted	volume as disk	vm-13649-disk-0 Vwi-aotz-k 65.00g lvmth-13649
3	VM snapshot	separate volume, mounted as disk?	snap_vm-13649-disk-0_phase_2 Vri---tz-k 65.00g lvmth-13649

Affected VM somehow lost the mount #2 during the procudure of snapshot reverting:

1	VM main disk	as volume	lvmth-13472 twi---tz-k 70.00g
2	VM disk mounted	LOST?	no such logical volume star_vg_fc_san_3738/vm-13472-disk-0
3	VM snapshot	separate volume, mounted as disk?	snap_vm-13472-disk-0_phase_2 Vri---tz-k 70.00g lvmth-13472

Questions:

What am I doing wrong?
Is there a way to mount lost volume again, since I still have a snapshot and original disk of affected VM in place?

UPD: I'm still investigating the issue, trying to find a cause and to blame some network issues or improper setup, or concurrency, but I cannot find a strong evidence or a proper error in host logs.

It happens randomly and I cannot recreate the issue running snapshot reverting routine in loop for 100 times in parallel to normal routines. I tried switching this routine in thread-running case or in sequential execution: one vm after another - and it still happens.

Thanks in advance

Last edited: 3 minutes ago

推荐订阅源

Proxmox Support Forum