Kernel errors and NMI
These are the errors we've been getting in /var/messages and on the command line.
May 2 07:22:24 lucolo0628 kernel: Uhhuh. NMI received for unknown reason 00. May 2 07:22:24 lucolo0628 kernel: Do you have a strange power saving mode enabled? May 2 07:22:24 lucolo0628 kernel: Dazed and confused, but trying to continue
These have been happening every few days, but the server has been under negligible load, as it is not yet a production server.
Some research on the web seems to suggest that this is not unknown and that it's related to NMI watchdog using the High Precision Event Timer (hpet). It seems like a non-fatal bug, although it can cause some boxes to hang on boot, or to crash under heavy load, as lots of these errors are generated.
Redhat suggests either preventing hpet from being used by the kernel, or switching off the NMI watchdog. Both of these are done in the grub.conf file in /boot/grub/grub.conf.
We turned off hpet use by adding "nohpet" to the grub.conf file as follows:
title CentOS (2.6.18-308.4.1.el5) root (hd0,0) kernel /vmlinuz-2.6.18-308.4.1.el5 ro nohpet root=/dev/VolGroup00/LogVol00 rhgb quiet initrd /initrd-2.6.18-308.4.1.el5.img
I'll come back and let you know if that works. In the mean time, here are some of the pages we found which led us to this workaround:
- https://access.redhat.com/knowledge/solutions/15443
- http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c01670591
- https://www.centos.org/modules/newbb/viewtopic.php?topic_id=24267&forum=38
- https://bugzilla.redhat.com/show_bug.cgi?id=663912
- http://slacksite.com/slackware/nmi.html (NMI watchdog definition)
HPET change make no difference
Well, turning off HPET didn't work for us, as we got another one of the messages in the messages log on May 7th. So the next thing to try is changing the power regulation settings in the BIOS to use "OS Control" (presumably changing it from "Hardware control" or something similar) to see if Linux can control power saving modes in a way that doesn't annoy the NMI watchdog.
The next alternatives are, possibly, an upgrade to Centos 6 from 5, or turning the NMI watchdog off.
More as we get it...