Hi there, one of our machines started throwing a handful of Machine Check Events over the past month, which are shown in dmesg and journald with very little info:
mcelog is running and unfortunately does not not report any events when checking âmcelog --clientâ (the command returns nothing, and /var/log/mcelog is empty as well even though the logfile param is set).
The errors are on a SuperMicro machine and IPMI is not showing hardware errors either. Curious if anyone has other ideas on how else I can go about diagnosing this? I was looking into rasdaemon but wasnât sure if itâd help or not (seems like Iâd need to build it as it didnât seem available in swupd)
Hi have you reached out to SuperMicro? You say the mcelog is running did you confirm that with ps -aux | grep -i mcelog or actual name of process if not mcelog?
Also check other distribution Ubuntu, Red-Hat etc⌠This sounds like a big deal that this functionality, is not working for you.
Thanks for the reply! Iâve not yet checked in with SuperMicro, but itâs helpful to know itâs unusual⌠unfortunately this is a production machine so I canât easily test other operating systems at the moment.
I did confirm that mcelog is running:
$ systemctl status mcelog
â mcelog.service - Machine Check Exception Logging Daemon
Loaded: loaded (/usr/lib/systemd/system/mcelog.service; disabled; preset: disabled)
Active: active (running) since Sun 2023-07-23 16:53:57 EDT; 23h ago
Main PID: 2664656 (mcelog)
Tasks: 1 (limit: 154253)
Memory: 224.0K
CGroup: /system.slice/mcelog.service
ââ2664656 /usr/sbin/mcelog --daemon --foreground
Jul 23 16:53:57 x systemd[1]: Started Machine Check Exception Logging Daemon.
$ ps -aux | grep -i mcelog
...
root 2664656 0.0 0.0 2624 1836 ? Ss Jul23 0:00 /usr/sbin/mcelog --daemon --foreground
Iâll check in with SuperMicro to see if they have any suggestions and report back.
SuperMicro support recommended we first update the BMC and BIOS firmwares, which were out of date. Though upon rebooting the machine to run the new BIOS, it detected a fault in a DIMM and self-healed using PPR (this machine has DDR4 memory). But now that the DIMM is healed, the OS now longer can see any faults.
So good news that weâve identified the faulty hardware; not-so-good news that we wonât be able to test mcelog again until the next fault
It is good that, mce works in the Clear Linux distribution. The Clear Linux team does not talk much about how Clear Linux performs in server use cases. You are using Clear Linux that says something.
I have an ASUS system with AMD processor running Clear Linux. I have Samba running in a role: âactive directory domain controllerâ. I just have to make sure I test updates on a USB rescue disk
before applying them to my system. Sometimes things go a little sideways. Other than that Clear Linux is great, it does not suffer from âbloatware itisâ like some operating systems and other Linux distribution do. I also have a server version running in VMware and that thing boots wicked fast, even
with hardware allocation constrained.
I have an outside interest in SuperMicro (Stock ticker SMCI) Hardware and software Support itâs one of my largest investment positions. I just like getting confirmation that they are more than marketing hype and they really are producing server hardware and providing services that justify the stock price.
Yes we have 3 VM hypervisors running Clear Linux and libvirt / KVM, including some more advanced features like SR-IOV, etc. I think itâs a great hypervisor OS â especially with the stateless design â but for the actual VMs running applications, we rely on Debian (which has a meaningfully wider library of packages via apt). SuperMicro is generally fantastic when it comes to support (when calling, you immediately get a very knowledgeable support person, without having to go through your entire life story before they begin helping you)
I realized today those two other SuperMicro machines are running the exact same version of Clear Linux (by design, but I forgot ), and that even without memory faults â mcelog still returns âsomethingâ. E.g.,:
The hardware differs between all three machine, so itâs interesting that mcelog is returning ânothingâ only on one of the machines. I just followed upon the ticket with SuperMicro to see if they have any other ideas.