Hello,
We've recently migrated from ESXi to Proxmox. We are using the same storage we used at ESXI EV, it's FC SAN mounted as usual in such case.
We have two disks 15Tb each. We use Proxmox as cluster with multuple nodes, 4 nodes now active.
One is LVM - pure, no volume-chains, no other tricks, not using snapshots on it. No issues.
Another one is starlvm - for dev VMs where we need to use snapshot functionality. This is the problematic one.
For development purposes we need to use snapshots as a tool to roll-back VM changes fast and reliably.
But sometimes, randomly, during snapshot revert procedure some VMs are losing disks:
Code:
unsupported storage of vg 'star_vg_fc_san_3738'
activating vm-13472-disk-0...
deactivating vm-13472-disk-0...
Use of uninitialized value in string ne at /usr/share/perl5/PVE/Storage/Custom/StarLvmPlugin.pm line 481.
TASK ERROR: no such logical volume star_vg_fc_san_3738/vm-13472-disk-0
To make things consistent the snapshot reverting procedure is following:
- shutdown VM (gracefully)
- get list of snapshots
- revert to the latest snapshot (have no parent)
- start VM
I'm running this routine sequentially for each VM group(6 VMs in a group), but there are multiple VM groups at the cluster (alpha, beta, each with 6 VMs in it), and sometimes they might execute this routine for different VMs in parallel. I'm using some randomization and time shift to run snapshot and power-related tasks as random as possible. I'm not sure if this may be the cause, but it's maybe worth mentioning.
Here is the problematic VM config:
- all vms are built similarly, only name, tags, mac changes for each vm version and group
Code:
#Clone made from SOME VM
agent: enabled=1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 2
cpu: host
machine: q35
memory: 4096
meta: creation-qemu=11.0.0,ctime=1780414247
name: VM-NAME-delta
net0: virtio=some_MAC,bridge=somebridge,firewall=1,queues=1
onboot: 0
ostype: l26
parent: phase_2
scsi0: swsan3738lv:vm-13472-disk-0,aio=native,cache=none,detect_zeroes=1,discard=on,iothread=1,queues=2,size=70G
scsihw: virtio-scsi-single
smbios1: uuid=UUID
tags: SOME-TAGS
vmgenid: 896cef72-1765-41dd-8b93-89f4ea668e09
[phase_2]
#Cloned VM Reverting this snapshot when needed
agent: enabled=1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 2
cpu: host
machine: q35
memory: 4096
meta: creation-qemu=11.0.0,ctime=1780414247
name: VM-NAME-delta
net0: virtio=some_MAC,bridge=somebridge,firewall=1,queues=1
onboot: 0
ostype: l26
scsi0: swsan3738lv:vm-13472-disk-0,aio=native,cache=none,detect_zeroes=1,discard=on,iothread=1,queues=2,size=70G
scsihw: virtio-scsi-single
smbios1: uuid=UUID
snaptime: 1780436041
tags: SOME-TAGS
vmgenid: 1f6ec138-c221-497e-ae8f-23d69c5c92d8
I'm checking two VMs, one is already affected by this issue 13472 and another one is OK 13649.
And there are some checks I run after this fail to get the full view:
Code:
#
# Affected VM 13472
#
pvesm status
unsupported storage of vg 'star_vg_fc_san_3738'
Name Type Status Total (KiB) Used (KiB) Available (KiB) %
local dir disabled 0 0 0 N/A
local-lvm lvmthin disabled 0 0 0 N/A
san3739lv lvm active 16106123264 10011447296 6094675968 62.16%
swsan3738lv starlvm active 16106123264 4272119808 11834003456 26.52%
#
pvesm list swsan3738lv --vmid 13472
unsupported storage of vg 'star_vg_fc_san_3738'
Volid Format Type Size VMID
#
sudo /usr/sbin/lvscan | grep /star_vg_fc_san_3738 | grep 13472
inactive '/dev/star_vg_fc_san_3738/lvmth-13472' [70.00 GiB] inherit
inactive '/dev/star_vg_fc_san_3738/snap_vm-13472-disk-0_phase_2' [70.00 GiB] inherit
#
vgs; lvs -a -o vg_name,lv_name,lv_attr,lv_size,pool_lv,data_percent,metadata_percent | grep 13472
VG #PV #LV #SN Attr VSize VFree
pve 1 3 0 wz--n- 277.87g 16.00g
star_vg_fc_san_3738 1 167 0 wz--n- <15.00t <10.96t
vg_fc_san_3739 1 55 0 wz--n- <15.00t <5.68t
star_vg_fc_san_3738 lvmth-13472 twi---tz-k 70.00g
star_vg_fc_san_3738 [lvmth-13472_tdata] Twi------- 70.00g
star_vg_fc_san_3738 [lvmth-13472_tmeta] ewi------- 72.00m
star_vg_fc_san_3738 snap_vm-13472-disk-0_phase_2 Vri---tz-k 70.00g lvmth-13472
#
ls -lah /dev/star_vg_fc_san_3738/ | grep 13472
-- none ---
#
ls -l /dev/mapper/ | grep 13472
-- none ---
#
vgchange -a y star_vg_fc_san_3738
18 logical volume(s) in volume group "star_vg_fc_san_3738" now active
#
lvs | grep 13472
lvmth-13472 star_vg_fc_san_3738 twi---tz-k 70.00g
snap_vm-13472-disk-0_phase_2 star_vg_fc_san_3738 Vri---tz-k 70.00g lvmth-13472
# It does not help me to check the disk itself, yet
mount /dev/star_vg_fc_san_3738/lvmth-13472 /mnt/pve/
mount /dev/star_vg_fc_san_3738/snap_vm-13472-disk-0_phase_2 /mnt/pve/
#
# Example of healthy VM 13649
#
pvesm list swsan3738lv --vmid 13649
unsupported storage of vg 'star_vg_fc_san_3738'
Volid Format Type Size VMID
swsan3738lv:vm-13649-disk-0 raw images 69793218560 13649
#
sudo /usr/sbin/lvscan | grep /star_vg_fc_san_3738 | grep 13649
ACTIVE '/dev/star_vg_fc_san_3738/lvmth-13649' [65.00 GiB] inherit
inactive '/dev/star_vg_fc_san_3738/snap_vm-13649-disk-0_phase_2' [65.00 GiB] inherit
ACTIVE '/dev/star_vg_fc_san_3738/vm-13649-disk-0' [65.00 GiB] inherit
#
vgs; lvs -a -o vg_name,lv_name,lv_attr,lv_size,pool_lv,data_percent,metadata_percent | grep 13649
VG #PV #LV #SN Attr VSize VFree
pve 1 3 0 wz--n- 277.87g 16.00g
star_vg_fc_san_3738 1 167 0 wz--n- <15.00t <10.96t
vg_fc_san_3739 1 55 0 wz--n- <15.00t <5.68t
star_vg_fc_san_3738 lvmth-13649 twi-aotz-k 65.00g 30.48 20.12
star_vg_fc_san_3738 [lvmth-13649_tdata] Twi-ao---- 65.00g
star_vg_fc_san_3738 [lvmth-13649_tmeta] ewi-ao---- 68.00m
star_vg_fc_san_3738 snap_vm-13649-disk-0_phase_2 Vri---tz-k 65.00g lvmth-13649
star_vg_fc_san_3738 vm-13649-disk-0 Vwi-aotz-k 65.00g lvmth-13649 28.84
#
ls -l /dev/mapper/ | grep 13649
lrwxrwxrwx 1 root root 8 Jun 12 08:34 star_vg_fc_san_3738-lvmth--13649 -> ../dm-45
The interesting part is in the working VM:
Code:
star_vg_fc_san_3738 vm-13649-disk-0 Vwi-aotz-k 65.00g lvmth-13649 28.84
So, if I understand correctly, using example of unaffected VM:
| 1 | VM main disk | as volume | lvmth-13649 twi-aotz-k 65.00g |
| 2 | VM disk mounted | volume as disk | vm-13649-disk-0 Vwi-aotz-k 65.00g lvmth-13649 |
| 3 | VM snapshot | separate volume, mounted as disk? | snap_vm-13649-disk-0_phase_2 Vri---tz-k 65.00g lvmth-13649 |
Affected VM somehow lost the mount #2 during the procudure of snapshot reverting:
| 1 | VM main disk | as volume | lvmth-13472 twi---tz-k 70.00g |
| 2 | VM disk mounted | LOST? | no such logical volume star_vg_fc_san_3738/vm-13472-disk-0 |
| 3 | VM snapshot | separate volume, mounted as disk? | snap_vm-13472-disk-0_phase_2 Vri---tz-k 70.00g lvmth-13472 |
Questions:
- What am I doing wrong?
- Is there a way to mount lost volume again, since I still have a snapshot and original disk of affected VM in place?
UPD: I'm still investigating the issue, trying to find a cause and to blame some network issues or improper setup, or concurrency, but I cannot find a strong evidence or a proper error in host logs.
It happens randomly and I cannot recreate the issue running snapshot reverting routine in loop for 100 times in parallel to normal routines. I tried switching this routine in thread-running case or in sequential execution: one vm after another - and it still happens.
Thanks in advance
Last edited:

























