Diagnosing Machine Check Events (mce)

Hi there, one of our machines started throwing a handful of Machine Check Events over the past month, which are shown in dmesg and journald with very little info:

Jul 23 18:00:19 x kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 15:18:02 x kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 15:18:02 x kernel: mce: [Hardware Error]: Machine check events logged
[23573307.469481] mce: [Hardware Error]: Machine check events logged
[23573307.469492] mce: [Hardware Error]: Machine check events logged
[23583044.888540] mce: [Hardware Error]: Machine check events logged

mcelog is running and unfortunately does not not report any events when checking ‘mcelog --client’ (the command returns nothing, and /var/log/mcelog is empty as well even though the logfile param is set).

The errors are on a SuperMicro machine and IPMI is not showing hardware errors either. Curious if anyone has other ideas on how else I can go about diagnosing this? I was looking into rasdaemon but wasn’t sure if it’d help or not (seems like I’d need to build it as it didn’t seem available in swupd)

Thank you,
AP

Hi have you reached out to SuperMicro? You say the mcelog is running did you confirm that with ps -aux | grep -i mcelog or actual name of process if not mcelog?

Also check other distribution Ubuntu, Red-Hat etc… This sounds like a big deal that this functionality, is not working for you.

Thanks for the reply! I’ve not yet checked in with SuperMicro, but it’s helpful to know it’s unusual… unfortunately this is a production machine so I can’t easily test other operating systems at the moment.

I did confirm that mcelog is running:

$ systemctl status mcelog
● mcelog.service - Machine Check Exception Logging Daemon
     Loaded: loaded (/usr/lib/systemd/system/mcelog.service; disabled; preset: disabled)
     Active: active (running) since Sun 2023-07-23 16:53:57 EDT; 23h ago
   Main PID: 2664656 (mcelog)
      Tasks: 1 (limit: 154253)
     Memory: 224.0K
     CGroup: /system.slice/mcelog.service
             └─2664656 /usr/sbin/mcelog --daemon --foreground

Jul 23 16:53:57 x systemd[1]: Started Machine Check Exception Logging Daemon.

$ ps -aux | grep -i mcelog
...
root     2664656  0.0  0.0   2624  1836 ?        Ss   Jul23   0:00 /usr/sbin/mcelog --daemon --foreground

I’ll check in with SuperMicro to see if they have any suggestions and report back.

Thanks,
AP

To close the loop here:

SuperMicro support recommended we first update the BMC and BIOS firmwares, which were out of date. Though upon rebooting the machine to run the new BIOS, it detected a fault in a DIMM and self-healed using PPR (this machine has DDR4 memory). But now that the DIMM is healed, the OS now longer can see any faults.

So good news that we’ve identified the faulty hardware; not-so-good news that we won’t be able to test mcelog again until the next fault :slight_smile:

1 Like

It is good that, mce works in the Clear Linux distribution. The Clear Linux team does not talk much about how Clear Linux performs in server use cases. You are using Clear Linux that says something.

I have an ASUS system with AMD processor running Clear Linux. I have Samba running in a role: “active directory domain controller”. I just have to make sure I test updates on a USB rescue disk
before applying them to my system. Sometimes things go a little sideways. Other than that Clear Linux is great, it does not suffer from “bloatware itis” like some operating systems and other Linux distribution do. I also have a server version running in VMware and that thing boots wicked fast, even
with hardware allocation constrained.

I have an outside interest in SuperMicro (Stock ticker SMCI) Hardware and software Support it’s one of my largest investment positions. I just like getting confirmation that they are more than marketing hype and they really are producing server hardware and providing services that justify the stock price.

Yes we have 3 VM hypervisors running Clear Linux and libvirt / KVM, including some more advanced features like SR-IOV, etc. I think it’s a great hypervisor OS – especially with the stateless design – but for the actual VMs running applications, we rely on Debian (which has a meaningfully wider library of packages via apt). SuperMicro is generally fantastic when it comes to support (when calling, you immediately get a very knowledgeable support person, without having to go through your entire life story before they begin helping you)

I realized today those two other SuperMicro machines are running the exact same version of Clear Linux (by design, but I forgot :slight_smile: ), and that even without memory faults – mcelog still returns “something”. E.g.,:

$ mcelog --client
Memory errors
SOCKET 0 CHANNEL 0 DIMM 0
DMI_NAME "DIMMA1" DMI_LOCATION "P0_Node0_Channel0_Dimm0"
corrected memory errors:
        0 total
uncorrected memory errors:
        0 total

The hardware differs between all three machine, so it’s interesting that mcelog is returning ‘nothing’ only on one of the machines. I just followed upon the ticket with SuperMicro to see if they have any other ideas.