2.6.32-33 内核跑到 208 天宕机

Ubuntu 10.04 2.6.32-33-server SMP x86_64 机器,跑到大概 208 天左右出现宕机。截屏:


/var/log/kern.log:
Mar 21 12:45:17 unode11 kernel: [18446743993.492018] BUG: soft lockup – CPU#1 stuck for 17163091968s! [ruby:18753]
Mar 21 12:45:17 unode11 kernel: [18446743993.492438] Modules linked in: ip6table_filter ip6_tables ipt_LOG xt_limit xt_tcpudp xt_state ipt_REJECT iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables fbcon tileblit font bitblit softcursor psmouse lp parport dell_wmi power_meter serio_raw vga16fb bnx2 joydev vgastate dcdbas usbhid hid megaraid_sas
Mar 21 12:45:17 unode11 kernel: [18446743993.492476] CPU 1:
Mar 21 12:45:17 unode11 kernel: [18446743993.492479] Modules linked in: ip6table_filter ip6_tables ipt_LOG xt_limit xt_tcpudp xt_state ipt_REJECT iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables fbcon tileblit font bitblit softcursor psmouse lp parport dell_wmi power_meter serio_raw vga16fb bnx2 joydev vgastate dcdbas usbhid hid megaraid_sas
Mar 21 12:45:17 unode11 kernel: [18446743993.492515] Pid: 18753, comm: ruby Not tainted 2.6.32-33-server #72-Ubuntu PowerEdge R410
Mar 21 12:45:17 unode11 kernel: [18446743993.492518] RIP: 0033:[] [] 0x47a88a
Mar 21 12:45:17 unode11 kernel: [18446743993.492527] RSP: 002b:00007fff2c0a50d0 EFLAGS: 00000202
Mar 21 12:45:17 unode11 kernel: [18446743993.492531] RAX: 000000000000000b RBX: 0000000002067600 RCX: 000000000000001e
Mar 21 12:45:17 unode11 kernel: [18446743993.492535] RDX: 0000000000000000 RSI: 0000000000433a10 RDI: 0000000002067600
Mar 21 12:45:17 unode11 kernel: [18446743993.492538] RBP: ffffffff81013cae R08: 0000000001fe5000 R09: 0000000000000000
Mar 21 12:45:17 unode11 kernel: [18446743993.492542] R10: 00007f895350b5c0 R11: 00007f895345d942 R12: 00007fff2c0a5110
Mar 21 12:45:17 unode11 kernel: [18446743993.492545] R13: 0000000000000180 R14: 000000000779c080 R15: 0000000000000000
Mar 21 12:45:17 unode11 kernel: [18446743993.492550] FS: 00007f8954478720(0000) GS:ffff880009000000(0000) knlGS:0000000000000000
Mar 21 12:45:17 unode11 kernel: [18446743993.492554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 21 12:45:17 unode11 kernel: [18446743993.492557] CR2: 0000000008aa3000 CR3: 0000000426311000 CR4: 00000000000006e0
Mar 21 12:45:17 unode11 kernel: [18446743993.492561] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 21 12:45:17 unode11 kernel: [18446743993.492564] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 21 12:45:17 unode11 kernel: [18446743993.492567] Call Trace:
Mar 21 12:45:17 unode11 kernel: [18446743993.492577] BUG: soft lockup – CPU#2 stuck for 17163091968s! [ruby:18759]
Mar 21 12:45:17 unode11 kernel: [18446743993.492997] Modules linked in: ip6table_filter ip6_tables ipt_LOG xt_limit xt_tcpudp xt_state ipt_REJECT iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables fbcon tileblit font bitblit softcursor psmouse lp parport dell_wmi power_meter serio_raw vga16fb bnx2 joydev vgastate dcdbas usbhid hid megaraid_sas

经确认,是 2.6.32-33 的 bug,由 sched_clock 函数引起。这里描述了发生的条件:
* Interl 的 CPU
* /proc/cpuinfo 内包含 constant_tsc, nonstop_tsc flag
* dmesg 或者 /var/log/boot.msg 不包含 "Marking TSC unstable" 字符
* 非 xen

同样的问题发生在 CentOS Linux release 6.0 (Final) 2.6.32-71.el6.x86_64 #1 SMP 服务器上。目前比较好的解决办法是在 208 天前重启一下服务器,或者升级内核。

根据官方的建议,还可以在 kernel 后面追加 noapic 参数
注意:不是 acpi=off

说几句,该 bug 发现之后,Ubuntu 没有先修复身为 LTS 的 10.04 Lucid 版本,而是先修复的 Maverick、Natty 等版本,直到现在(5 月 1 日),Lucid 才有修复的意图,launch 上终于是 "Nominated for Lucid by" 了,截止发稿时间(5 月 7 日)依然维持该状态。社区的东西免费是好事,但是出了问题,尤其像这种大 bug,没人催没人修的局面就尴尬了,相比之下,Suse 很早就释出了修复后的内核版本,redhat 应该也不会太掉后。

ref:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/805341
https://www.kernel.org/pub/linux/kernel/v2.6/longterm/ChangeLog-2.6.32.50
http://www.novell.com/support/viewContent.do?externalId=7009834&sliceId=1
http://thread.gmane.org/gmane.linux.kernel/1132515