龙空技术网

生产环境宿主机内核崩溃一例

云技术 62

前言:

今天我们对“hp380安装centos”都比较重视,咱们都想要了解一些“hp380安装centos”的相关内容。那么小编在网络上搜集了一些有关“hp380安装centos””的相关文章,希望兄弟们能喜欢,你们快快来学习一下吧!

前几天,一台HP 380GEN8服务器的宿主机内核崩溃,突然重启。上面的虚拟机也全部重启,影响比较大。

软件环境是centos6.6 x64,随后进行了排查。

内核崩溃文件是放在/var/crash目录下面,有个精确到秒的以崩溃时间命名的目录,目录下面有两个文件,一个是崩溃是内存的一个转存,一个是文本的日志文件

ls

vmcore vmcore-dmesg.txt

查看dmesg文件,可以清楚的看到内核崩溃的时候的情况

<0>Kernel panic – not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources:

<0>1. Integrated Management Log (IML)

<0>2. OA Syslog

<0>3. OA Forward Progress Log

<0>4. iLO Event Log

<4>Pid: 0, comm: swapper Not tainted 2.6.32-504.el6.x86_64 #1

<4>Clocksource tsc unstable (delta = -34359715026 ns). Enable clocksource failover by adding clocksource_failover kernel parameter.

<4>Call Trace:

<4> <NMI> [<ffffffff815292bc>] ? panic+0xa7/0x16f

<4> [<ffffffff8152fa46>] ? kprobe_exceptions_notify+0x16/0x430

<4> [<ffffffffa001f4df>] ? hpwdt_pretimeout+0x9f/0xcc [hpwdt]

<4> [<ffffffff8152e589>] ? perf_event_nmi_handler+0x9/0xb0

<4> [<ffffffff81530075>] ? notifier_call_chain+0x55/0x80

<4> [<ffffffff815300da>] ? atomic_notifier_call_chain+0x1a/0x20

<4> [<ffffffff810a4eae>] ? notify_die+0x2e/0x30

<4> [<ffffffff8152de23>] ? do_nmi+0x2a3/0x340

<4> [<ffffffff8152d600>] ? nmi+0x20/0x30

<4> [<ffffffff812ea5c1>] ? intel_idle+0xb1/0x170

<4> <<EOE>> [<ffffffff81426cb8>] ? menu_select+0x178/0x390

<4> [<ffffffff81425b97>] ? cpuidle_idle_call+0xa7/0x140

<4> [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110

<4> [<ffffffff8151061a>] ? rest_init+0x7a/0x80

<4> [<ffffffff81c2af8f>] ? start_kernel+0x424/0x430

<4> [<ffffffff81c2a33a>] ? x86_64_start_reservations+0x125/0x129

<4> [<ffffffff81c2a453>] ? x86_64_start_kernel+0x115/0x124

从日志可以清楚的看到,主要是 NMI 发生,造成内核崩溃。

NMI 是什么?

A non-maskable interrupt (NMI) is a hardware interrupt that cannot be ignored by standard interrupt masking techniques in the system. It is typically used to signal attention for non-recoverable hardware errors. (Some NMIs may be masked, but only by using proprietary methods specific to the particular NMI.)

NMI是系统中标准的中断屏蔽技术不能被忽略的硬件中断,通常是不可恢复的硬件错误信号。

通过谷歌,发现HP GEN8在ubuntu vmware上都有类似的问题,在hp官网找到一篇文档,说的很清楚。

;docId=emr_na-c04332584-3&docLocale=en_US

Advisory: (Revision) – HP Integrated Lights-Out 4 – FIRMWARE UPDATE REQUIRED: Intermittent Non-Maskable Interrupt (NMI) Events May Occur on ProLiant Gen8 Servers with HP Integrated Lights-Out 4 Firmware Versions 1.30, 1.32, 1.40 and 1.50

GEN8 ILO4 固件版本1.30 1.32 1.40 1.50的时候会触发NMI问题

If this issue occurs, the operating system will indicate that an NMI has happened; however, the specific indication will vary by OS:

VMware ESXi operating systems will experience a Purple Screen of Death (PSOD).

Linux operating systems will display a message indicating that an NMI occurred.

Windows will become completely unresponsive or experience a Blue Screen of Death (BSOD).

vmware linux windows系统都会受到影响。

经验总结,服务器的固件硬件定期关注,并及时升级。

标签: #hp380安装centos #hpgen8服务器centos