Background
I run 2 Proxmox hosts, one being a primary and the secondary being a Proxmox Backup Server target. As is too common the primary is far different from the secondary including the boot configuration, which was the affected component in the latest Proxmox major upgrade (v7 to v8).
The situation was further exacerbated by a “temporary” implementation of virtualized OPNsense after a failure of pfsense hardware in 2021. This makes the primary Proxmox host a rather juicy single point of failure.
Failed upgrade from Proxmox v7 to v8
Upgrading the secondary host went smoothly. While upgrading the primary host I recieved an error towards the end of the apt dist-upgrade
step which caused a cascading failure: /usr/sbin/grub-probe: error: failed to get canonical path of /dev/disk/by-id/ata-WDC_WDS250G2B0A_201688800263-part3'
. Most of the rest of the upgrade failed as the proxmox-ve
package depends on a functioning kernel (seemingly). With a failed grub/kernel/proxmox-ve package situation a reboot of the host would certainly be mostly non-functional (along with the virtualized OPNsense).
After a bit of poking around to remember the boot configuration and learn how Proxmox boots on ZFS I noticed that the output of ls -lah /dev/disk/by-id
did not include the path referenced in the grub-probe
error. This was odd, as this system has booted fine for 2.5 years with the same ZFS configuration. The current assumption is something new in Debian Bookworm, but it appeared the only path that existed was /dev/disk/by-id/ata-WDC_WDS250G2B0A_201688800263
without the -part3
.
At this point I was at a loss- the root zpool was still up and working with a path that was seemingly non-existent. It wasn’t clear why the path wasn’t present, but felt it didn’t hurt to detach/reattach the disk to the zpool and see what happened. This post was helpful on the process (ZFS docs are a bit disjointed): https://plantroon.com/changing-disk-identifiers-in-zpool/
The process to detach/reattach the disk was like this:
- Verify pool status
zpool status rpool
- Detach the disk:
zpool detach rpool ata-WDC_WDS250G2B0A_201688800263-part3
- Clear the labels using a path that existed
zpool labelclear -f /dev/disk/by-id/ata-WDC_WDS250G2B0A_201688800263
- Attach the disk after the
part3
path appearedzpool attach rpool /dev/disk/by-id/ata-WDC_WDS250G2B0A-00SM50_212702A00863-part3 /dev/disk/by-id/ata-WDC_WDS250G2B0A_201688800263-part3
- Verify pool status was online with a resilvering in progress
zpool status rpool
root@pve01:/etc/lvm# zpool status
pool: rpool
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Jul 25 21:22:41 2023
22.8G scanned at 4.57G/s, 236M issued at 47.2M/s, 22.8G total
238M resilvered, 1.01% done, 00:08:11 to go
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WDS250G2B0A-00SM50_212702A00863-part3 ONLINE 0 0 0
ata-WDC_WDS250G2B0A_201688800263-part3 ONLINE 0 0 0 (resilvering)
errors: No known data errors
Summary
It is still not clear to me why the -part3
path was missing, but somewhere in the process of detaching and clearing labels it re-appeared. Once the resilvering process was close to complete I was finally able to get apt dist-upgrade
to run cleanly as grub-probe
found the 2 boot devices. The total downtime was around 1 hour for the upgrade. While the downtime was stressful it was good practice on relearning the implementation and encouraging me to both (1) buy hardware for the routing/firewall functions and (2) speed up planning the next lab iteration to remove a large single point of failure.
root@pve01:/etc/lvm# apt dist-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following packages were automatically installed and are no longer required:
cryptsetup-run libfmt7 libopts25 libthrift-0.13.0 pve-kernel-5.13 pve-kernel-5.13.19-6-pve pve-kernel-5.15.102-1-pve pve-kernel-5.15.107-2-pve pve-kernel-5.15.83-1-pve python-pastedeploy-tpl telnet
Use 'apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
5 not fully installed or removed.
Need to get 0 B/123 kB of archives.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
Setting up grub-pc (2.06-13) ...
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.2.16-4-pve
Found initrd image: /boot/initrd.img-6.2.16-4-pve
Found linux image: /boot/vmlinuz-5.15.108-1-pve
Found initrd image: /boot/initrd.img-5.15.108-1-pve
Found linux image: /boot/vmlinuz-5.15.107-2-pve
Found initrd image: /boot/initrd.img-5.15.107-2-pve
Found linux image: /boot/vmlinuz-5.15.102-1-pve
Found initrd image: /boot/initrd.img-5.15.102-1-pve
Found linux image: /boot/vmlinuz-5.15.83-1-pve
Found initrd image: /boot/initrd.img-5.15.83-1-pve
Found linux image: /boot/vmlinuz-5.13.19-6-pve
Found initrd image: /boot/initrd.img-5.13.19-6-pve
Found linux image: /boot/vmlinuz-5.13.19-2-pve
Found initrd image: /boot/initrd.img-5.13.19-2-pve
Found linux image: /boot/vmlinuz-5.11.22-7-pve
Found initrd image: /boot/initrd.img-5.11.22-7-pve
Found linux image: /boot/vmlinuz-5.11.22-4-pve
Found initrd image: /boot/initrd.img-5.11.22-4-pve
done
Setting up pve-kernel-6.2.16-4-pve (6.2.16-5) ...
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 6.2.16-4-pve /boot/vmlinuz-6.2.16-4-pve
update-initramfs: Generating /boot/initrd.img-6.2.16-4-pve
cryptsetup: ERROR: Couldn't resolve device rpool/ROOT/pve-1
cryptsetup: WARNING: Couldn't determine root device
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/C315-3D06
Copying kernel and creating boot-entry for 5.15.108-1-pve
Copying kernel and creating boot-entry for 6.2.16-4-pve
Copying and configuring kernels on /dev/disk/by-uuid/C315-AAEB
Copying kernel and creating boot-entry for 5.15.108-1-pve
Copying kernel and creating boot-entry for 6.2.16-4-pve
run-parts: executing /etc/kernel/postinst.d/proxmox-auto-removal 6.2.16-4-pve /boot/vmlinuz-6.2.16-4-pve
run-parts: executing /etc/kernel/postinst.d/zz-proxmox-boot 6.2.16-4-pve /boot/vmlinuz-6.2.16-4-pve
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/C315-3D06
Copying kernel and creating boot-entry for 5.15.108-1-pve
Copying kernel and creating boot-entry for 6.2.16-4-pve
Copying and configuring kernels on /dev/disk/by-uuid/C315-AAEB
Copying kernel and creating boot-entry for 5.15.108-1-pve
Copying kernel and creating boot-entry for 6.2.16-4-pve
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 6.2.16-4-pve /boot/vmlinuz-6.2.16-4-pve
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.2.16-4-pve
Found initrd image: /boot/initrd.img-6.2.16-4-pve
Found linux image: /boot/vmlinuz-5.15.108-1-pve
Found initrd image: /boot/initrd.img-5.15.108-1-pve
Found linux image: /boot/vmlinuz-5.15.107-2-pve
Found initrd image: /boot/initrd.img-5.15.107-2-pve
Found linux image: /boot/vmlinuz-5.15.102-1-pve
Found initrd image: /boot/initrd.img-5.15.102-1-pve
Found linux image: /boot/vmlinuz-5.15.83-1-pve
Found initrd image: /boot/initrd.img-5.15.83-1-pve
Found linux image: /boot/vmlinuz-5.13.19-6-pve
Found initrd image: /boot/initrd.img-5.13.19-6-pve
Found linux image: /boot/vmlinuz-5.13.19-2-pve
Found initrd image: /boot/initrd.img-5.13.19-2-pve
Found linux image: /boot/vmlinuz-5.11.22-7-pve
Found initrd image: /boot/initrd.img-5.11.22-7-pve
Found linux image: /boot/vmlinuz-5.11.22-4-pve
Found initrd image: /boot/initrd.img-5.11.22-4-pve
done
Setting up pve-kernel-6.2 (8.0.3) ...
Setting up proxmox-ve (8.0.1) ...
root@pve01:/etc/lvm# date
Tue Jul 25 09:28:58 PM CDT 2023
root@pve01:/etc/lvm#