I’ve had a couple of issues with brown-outs recently which have interrupted my Proxmox server, and stopped my connected disks from coming back up cleanly (yes, I’m working on that separately!) but it’s left me in a state where several of my containers and virtual machines on the cluster are down.
It’s possible to point-and-click your way around this, but far easier to script it!
A failed state may look like this:
root@proxmox1:~# ha-manager status
quorum OK
master proxmox2 (active, Fri Mar 22 10:40:49 2024)
lrm proxmox1 (active, Fri Mar 22 10:40:52 2024)
lrm proxmox2 (active, Fri Mar 22 10:40:54 2024)
service ct:101 (proxmox1, error)
service ct:102 (proxmox2, error)
service ct:103 (proxmox2, error)
service ct:104 (proxmox1, error)
service ct:105 (proxmox1, error)
service ct:106 (proxmox2, error)
service ct:107 (proxmox2, error)
service ct:108 (proxmox1, error)
service ct:109 (proxmox2, error)
service vm:100 (proxmox2, error)
Once you’ve fixed your issue, you can do this on each node:
for worker in $(ha-manager status | grep "($(hostnamectl hostname), error)" | cut -d\ -f2)
do
echo "Disabling $worker"
ha-manager set $worker --state disabled
until ha-manager status | grep "$worker" | grep -q disabled ; do sleep 1 ; done
echo "Restarting $worker"
ha-manager set $worker --state started
until ha-manager status | grep "$worker" | grep -q started ; do sleep 1 ; done
done
Note that this hasn’t been tested, but a scan over it with those nodes working suggests it should. I guess I’ll be updating this the next time I get a brown-out!