NX-OS(N7K) 也能跑崩(Nexus 7000 Stuck Sending TCNs Every 2 Seconds)

题目有点夸张,但是确实对我们的生产环境造成了很大的影响,越是基础的部件出现问题造成的损失越大。这个是触发该 bug 的两个时间段内,一台前端机器到部分后端机器的丢包情况监控。

这个 bug 跟我两年前碰到的 2.6.32 内核的 208.5 bug(1, 2) 倒是很像,弄不好又是哪个无证码农犯了除以 0 了。

发现这个 unicast flooding 的特征还是蛮明显的,比如 iftop 发现竟然出现了其他主机之间交互的流量,tcpdump 抓包也能观察到类似的现象,除了一些 ARP、HSRP 的请求回应之外不应该有大量的其他的包了。这篇文档总结了出现了该情况的几种可能。

总的来说,NX-OS 虽然已经上市好几年了,但还是不够成熟,无意中还发现可能存在 kernel panic 的风险。考虑到可能有人因为权限的问题无法打开上面的链接,我这里附一份完整的存档:

Nexus 7000 Stuck Sending TCNs Every 2 Seconds
CSCuo80937

Symptom:
If there is any TC after upgrade to 6.2.(6), 6.2.(6a) or 6.2.(8), then after
approximately 90 days of active supervisor uptime, STP TC BPDUs are sent out
every 2 seconds for a long period of time.

Conditions:
Nexus 7000 or 7700 running 6.2(6), 6.2.(6a), or 6.2(8).

Workaround:
In order to circumvent this issue until an upgrade to fixed code can be
performed,
execute the appropriate workaround depending on whether you have a
dual-supervisor or single-supervisor
configuration before each 90 days of Active supervisor uptime.

For dual-supervisor setups:
1. Reload the standby supervisor using cli "reload module x" where x is
standby supervisor slot number.
2. Use the 'show module' command to confirm that the standby supervisor is up
and in the ha-standby mode.
3. Use the system 'switchover command' to switch to the standby supervisor.

For single-supervisor setups:
1. Upgrade to 6.2.6b or 6.2.8a, depending on your business requirements.
2. Reload the switch.

Further Problem Description:
Active Supervisor Uptime can be found from "show system uptime":
N7K-7009-3# show system uptime
System start time: Fri Oct 25 09:40:58 2013
System uptime: 236 days, 8 hours, 56 minutes, 59 seconds
Kernel uptime: 110 days, 23 hours, 7 minutes, 49 seconds
Active supervisor uptime: 110 days, 23 hours, 2 minutes, 23 seconds <<<<<<<<<<<