Wooha…fsck!!!
This is yet another rolling release example of a disastrous outcome with no mission critical self resolving solution implemented, has anyone considered it not important enough to care to develop a solution for such disasters?? IMO I’d have implemented a solution before deploying a rolling release strategy, seems no-one else in the Linux community has bothered either.
Ideally, under no circumstances should a machine need to be rebooted for software updates to be applied. A solution must be found to ensure this isn’t required so that production isn’t interrupted by a reboot.
Under no circumstances should machine, if rebooted, fall into a non bootable state
like this, ever. This is unacceptable in any production environment, and questions should be asked if these typical rolling release solution we have implemented are suitable for a production environment? If yes, then solutions need to be in place so this failed to boot situation never happens.
Why can’t the system put itself into a pseudo sandboxed in memory
disk image, test if it runs, then apply service restarting to the live system, all without a reboot.
IDK the answers but maybe swupd
can evaluate the current system state before updating, put the data in a database with some automated instructions on how to reinstate.
If the machine fails to boot after the update, should the machine auto dump into emergency mode
which then by auto scripting automatically consult this above mentioned swupd
database state, and automatically restore the previous known bootable state, then finally reboot the machine and alert the user at the ssh
command prompt after logging in,
WARNING, swupd release # failed, restored previous release# configuration, we’ve automatically sent a failed update/reboot log to developers.
It shouldn’t be dumped to the users lap, requiring him to instruct the machine to manually issues swupd repair commands.
Given the current circumstances of this thread, a question here is, when the machine fails to boot and is dumped into this emergency mode, is the network stack available to ssh
into the machine?
If no, how is expected to recover the remote bare metal Clear Linux machine without physical access (no ability to use a USB live boot) and no ability to ssh
into the machine to run the swupd repair
command, how will we even see that the machine is stuck in this emergency mode, remotely. (maybe this is possible on some machines with IPMI, but this would seriously take some hours to setup and finally resolve to have the machine back online.)
This situations is exactly why I’m shit scared to update, let alone have swupd on auto and it’s not an isolated case, a colleagues’ openSUSE Tumbleweed machine auto updated itself the other week and it too failed to boot because of some Grub boot loader issue, this is not cool.