NX-OS(N7K) 也能跑崩(Nexus 7000 Stuck Sending TCNs Every 2 Seconds)

题目有点夸张,但是确实对我们的生产环境造成了很大的影响,越是基础的部件出现问题造成的损失越大。这个是触发该 bug 的两个时间段内,一台前端机器到部分后端机器的丢包情况监控。

这个 bug 跟我两年前碰到的 2.6.32 内核的 208.5 bug(1, 2) 倒是很像,弄不好又是哪个无证码农犯了除以 0 了。

发现这个 unicast flooding 的特征还是蛮明显的,比如 iftop 发现竟然出现了其他主机之间交互的流量,tcpdump 抓包也能观察到类似的现象,除了一些 ARP、HSRP 的请求回应之外不应该有大量的其他的包了。这篇文档总结了出现了该情况的几种可能。

总的来说,NX-OS 虽然已经上市好几年了,但还是不够成熟,无意中还发现可能存在 kernel panic 的风险。考虑到可能有人因为权限的问题无法打开上面的链接,我这里附一份完整的存档:

Nexus 7000 Stuck Sending TCNs Every 2 Seconds

If there is any TC after upgrade to 6.2.(6), 6.2.(6a) or 6.2.(8), then after
approximately 90 days of active supervisor uptime, STP TC BPDUs are sent out
every 2 seconds for a long period of time.

Nexus 7000 or 7700 running 6.2(6), 6.2.(6a), or 6.2(8).

In order to circumvent this issue until an upgrade to fixed code can be
execute the appropriate workaround depending on whether you have a
dual-supervisor or single-supervisor
configuration before each 90 days of Active supervisor uptime.

For dual-supervisor setups:
1. Reload the standby supervisor using cli "reload module x" where x is
standby supervisor slot number.
2. Use the 'show module' command to confirm that the standby supervisor is up
and in the ha-standby mode.
3. Use the system 'switchover command' to switch to the standby supervisor.

For single-supervisor setups:
1. Upgrade to 6.2.6b or 6.2.8a, depending on your business requirements.
2. Reload the switch.

Further Problem Description:
Active Supervisor Uptime can be found from "show system uptime":
N7K-7009-3# show system uptime
System start time: Fri Oct 25 09:40:58 2013
System uptime: 236 days, 8 hours, 56 minutes, 59 seconds
Kernel uptime: 110 days, 23 hours, 7 minutes, 49 seconds
Active supervisor uptime: 110 days, 23 hours, 2 minutes, 23 seconds <<<<<<<<<<<