Search This Blog

Saturday, September 9, 2017

From Supernova to Intel Xeon L2 CPU Cache: My Own Machine Check Event (MCE) Glitch!

Supratim Sanyal's Blog: A Supernova Causes a MCE Machine Check Event on Intel Processor
Less than thirteen and a three-quarters of a billion years ago, a star the size of about fifteen times our own sun ran out of hydrogen fuel in its core to burn into helium.

Undeterred and left with prodigious amounts of helium, it non-nonchalantly started on the helium to burn to carbon for a few billion years. Then it lit up the carbon, and spent billions of years to continue up the periodic table - aluminum, silicon, nickel, copper, lead ... all the while pushing the lighter stuff outwards in layers and getting heavier in the middle where gravity kept getting happier. In another few billion years, gravity betrayed a little smile when the star crossed over the Chandrasekhar Limit. For gravity had won again, as it always does; all the energy of the burning core could no longer hold the star up. The collapse started.

The unrelenting crush of gravity then continued to make that star's core so dense and so hot that, more importantly than human equations trying to compute it starting to fail, something had to give.

After billions of years of cooking the elements, it took barely one and a half minutes for the core to explode, lighting up the universe with such brightness that it would be clearly visible to naked human eyes in daytime when that light would reach planet Earth.

The supernova explosion scattered the periodic table into space. Some of that ejected matter coagulated into a scary collection of mostly hydrogen and carbon-based molecules which would be labeled together as "Supratim Sanyal". 

The explosion also fired off, at light speed in all directions, billions of little monsters - atomic nuclei with no electrons, alpha particles, electrons and friends. One of these - a hydrogen nucleus, which is just a proton, traveled unchallenged a few billion light years only to finally get arrested by the L2 cache of the 8th Xeon CPU in my Dell PowerEdge 2950 in the basement.

Supratim Sanyal's Blog: Machine Check Event (MCE) Error - Intel Xeon L2 Cache Error
Machine Check Event (MCE)
I have never faced a Machine Check Event before.

I logged into my old faithful and rock-solid Dell PowerEdge 2950 blade server just now, and was informed:

ABRT has detected 1 problem(s). For more info run: abrt-cli list --since 1504666020

Okay, so I ran the recommended command, and got:

# abrt-cli list --since 1504666020
id ea6720f12a431197ca717b7bcd90f43f7a92d366
reason:         mce: [Hardware Error]: Machine check events logged
time:           Thu 07 Sep 2017 07:28:16 PM UTC
cmdline:        BOOT_IMAGE=/vmlinuz-3.10.0-514.26.2.el7.x86_64 root=/dev/mapper/centos_dellpoweredge2950-root ro rd.lvm.lv=centos_dellpoweredge2950/root rd.lvm.lv=centos_dellpoweredge2950/swap rhgb quiet LANG=en_US.UTF-8
package:        kernel
uid:            0 (root)
count:          1
Directory:      /var/spool/abrt/oops-2017-09-07-19:28:16-12996-0
Reported:       cannot be reported


The Autoreporting feature is disabled. Please consider enabling it by issuing
'abrt-auto-reporting enabled' as a user with root privileges

At this point, I googled "Machine Check Event" and learned that one of the reasons a MCE could happen is cosmic rays! Unless, of course, the processor or hardware or bus or some such thing is really going bad; the PowerEdge 2950 is a decade old anyway.

The forums also recommended running "mcelog", which I did not have, but was readily available in the repos.

# yum install mcelog

Now I could run mcelog.

# mcelog
Hardware event. This is not a software error.
MCE 0
CPU 7 BANK 3
ADDR 43f883580
TIME 1504812495 Thu Sep  7 19:28:15 2017
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
Threshold based error status: green
MCA: Generic CACHE Level-2 Generic Error
STATUS 942000570001010a MCGSTATUS 0
MCGCAP 806 APICID 7 SOCKETID 1
CPUID Vendor Intel Family 6 Model 23

OK, so it clearly says this MCE is not software-related, and whatever it was, it was corrected. It is also probably trying to say the L2 cache on the 8th CPU misfired that time.

A few quick checks with htop, top, iotop, etc. do not indicate any issues. Therefore, I will blame it on cosmic rays this time and let it go. If hardware is indeed failing, I will know soon enough.

It may be worth keeping an eye on eBay for a replacement blade server.

No comments:

Post a Comment

Recommended Products from Amazon