I started this post in April 2024 but never committed it to the repository. I’ve recently dug into K8s and am strongly considering migrating my lab over- so I figured I should finish this up as gitops upgrades for virtual machines has been reducing time I spend on updates for about 1 year.
Reliability
A core principle in my lab has always been reliability. Many design choices may appear outdated, inefficient, expensive, or bespoke, but are intentional in my desire to maintain reliability and simplicity in updates, backup restoration, availability, and ongoing maintenance. There are a single points of failure (many, actually) but fixing them is trivial save for a full non-disk hardware failure, which I hope to better solve with K8s and ceph (or OpenEBS- TBD).
Maintenance Purgatory
There are 12 “services” that are critical to me, primarily serving purposes such as communication and productivity. It has become laborious over the years to keep them up to date. I am firm in maintaining what I’ve put into place- things are not left to rot, again, reliability (if provided by security) is a core principle.
Brief Architecture Overview
A packet traveling inbound to an Internet facing service will pass through a cable modem, an Decisio OPNSense DEC750, a Unifi switch (formerly Mikrotik), and into a Dell R730XD running Proxmox. Each “service” resides in a dedicated VM running Fedora or Rocky9. Some are RPM packages from either the project’s repos (e.g. Miniflux) or Fedora’s repos, but most are Podman with podman-compose
setup as a systemd service to startup at boot.
There are inefficiencies and performance penalties to pay with Podman nested in virtual machines, but it’s acceptable with the highly attractive combination of a self-contained service within a VM which can be snapshotted for reliable backup/restores.
Adding gitops
Adding Renovate has been fantastic. Rules are in place set on a certain cadence that ensures when a MR is opened it’s likely safe to update at that point (generally a few days). I still have a complex set of recurring tasks on my to-do list via 2Do which I’ve used for more than a decade.
Once a MR from renovate is merged an Ansible playbook is kicked off on a self-hosted gitlab runner. The Ansible playbook will shut down the VM, ZFS snapshot of the VM disk, start the VM, and upgrade the software. The features are a mix of the Proxmox API and the playbook running on the target VM.
Bad upgrade? The healthcheck feature in each playbook will restore the latest snapshot if the service doesn’t become healthy. Healthchecks pass, but something is wrong? Restore the ZFS snapshot and boot the VM to be restore availability. Issue noticed after a few days? Restore from a Proxmox Backup Server snapshot (months of retention). I’ve gradually added an automatic snapshot restoration and start VM mechanism into the Ansible playbook if they fail healthchecks post-upgrade. It’s a great combination of modern gitops but “simple” infrastructure.
Scheduled Runs
While most of the upgrades are kicked off via Renovate MRs being merged I do have a variety of scheduled runs to keep things up to date via CI and Ansible:
- Operating system packages (essentially
dnf upgrade --refresh -y
) - Software that generally upgrades reliably (Photoprism, PiHole, and Nextcloud)
The Past
There have been many iterations of the maintenance processes, some frankly embarrassing when I didn’t have the requisite automation skillset to implement:
- Manually VNC into each VM and update every few days (early on when I only had a few VMs)
- Manually run Ansible playbooks to upgrade software/OS
- Run a podman container with Ansible/proxmox dependencies to perform snapshots/upgrades
- Finally: GitOps workflow to update services via CI, automatic Renovate MRs, and Ansible upon merge (even from my phone anywhere)
Potential Future
I’ve ignored K8s for some time which will make some of this easier (other items a lot more complex). However, my current employer heavily relies on K8s and I’ve finally dug in to learn. You cannot secure what you do not understand. I’ve found K8s to be very interesting thus far. If I do migrate while I’ll lose much of my infrastructure simplicity I think the tooling/architecture might be worth the added complexity.
If nothing else it will be a great learning experience and I can fallback to my current setup that I’ve found to be quite comfortable.