tcp_keepalive_time and rst Flag in NAT Environment

Here, I'm not going to explain the details of what is TCP keepalive, what are the 3 related parameters tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes mean.
You need to know, the default value of these 3 parameters with net.ipv4.tcp_keepalive_time = 7200, net.ipv4.tcp_keepalive_intvl = 75, net.ipv4.tcp_keepalive_probes = 9, especially the first one, tcp_keepalive_time may and usually will cause some nasty behavior to you in some environment like NAT.

Let's say, your client now wants to connect to server, via one or more nat in the middle way, whatever SNAT or DNAT. The nat server, no matter its' a router or a Linux based device with ip_forward on, need to maintain a TCP connections mapping table to track each incoming and outcoming connection, since the device's resource is limited, the table can't grow as large as it wants, so it needs to drop some connections which are idle for a period of time with no data change between client and server.

This phenomenon is quite common, not only in consumer oriented/low end/poor implementation router, but also in data center NAT servers. For most of these NAT server, their tcp_keepalive_time is usually set to 90s or around, so, if your client set the parameter larger than 90, and have no data communications for more than 90s, the NAT server will send a TCP package with RST flags to disconnect the both end.

In this situation, one workaround is to lower the keepalive interval than that, like 30s, if you can't control the NAT server.

Recently, when we sync data from MongoDB Master/Slave, note, the architecture is a little old, not the standard replica. We often observe a large amount of errors indicating the failure of synchronisation during creating index process after finishing synchronizing the MongoDB data. Below log is get from our slave server:

Fri Nov 14 12:52:00.356 [replslave] Socket recv() errno:110 Connection timed out 100.80.1.26:27018
Fri Nov 14 12:52:00.356 [replslave] SocketException: remote: 100.80.1.26:27018 error: 9001 socket exception [RECV_ERROR] server [100.80.1.26:27018]
Fri Nov 14 12:52:00.356 [replslave] DBClientCursor::init call() failedFri Nov 14 12:52:00.356 [replslave] repl: AssertionException socket error for mapping query
Fri Nov 14 12:52:00.356 [replslave] Socket flush send() errno:104 Connection reset by peer 100.80.1.26:27018
Fri Nov 14 12:52:00.356 [replslave]   caught exception (socket exception [SEND_ERROR] for 100.80.1.26:27018) in destructor (~PiggyBackData)
Fri Nov 14 12:52:00.356 [replslave] repl: sleep 2 sec before next pass
Fri Nov 14 12:52:02.356 [replslave] repl: syncing from host:100.80.1.26:27018
Fri Nov 14 12:52:03.180 [replslave] An earlier initial clone of 'sso_production' did not complete, now resyncing.
Fri Nov 14 12:52:03.180 [replslave] resync: dropping database sso_production
Fri Nov 14 12:52:04.272 [replslave] resync: cloning database sso_production to get an initial copy

100.80.1.26 is DNAT IP address for MongoDB master. As you can see, the slave receives reset from 100.80.1.26.

After some package capture and analysis, we get the conclusion with the root cause of tcp_keepalive_time related. If you are sensitive, you may consider TCP stack related issues when see "reset by peer" keyword at first glance.

Be careful, the same issue also occurs in some IMAP proxy, IPVS envionment, etc.  After many years hands-on online operations experience, I have to say, DNS and NAT are two potential threats of various issues, even in some so-called 100% uptime environment.

Nagios 监控 MongoDB

插件 github 上有现成的:
$ wget -c  https://github.com/mzupan/nagios-plugin-mongodb/zipball/master
$ unzip master -d nagios_plugin_mongodb
$ rm master
$ sudo apt-get install python-dev python-setuptools -y
$ sudo easy_install pymongo
$ cd nagios_plugin_mongodb/mzupan-nagios-plugin-mongodb-59a9247
$ cp check_mongodb.py $NAGIOSROOTPATH/libexec
$ $NAGIOSROOTPATH/libexec/check_mongodb.py -H 192.168.10.42 -P 27017 -A connect -W 1 -C 3

ref:
http://zcentric.com/2010/06/18/mongodb-nagios-plugin-created/

代码变更对系统的影响

某台 Mongo,监控的 agent 突然取不到所有的数据,ssh 也直接显示:
ssh_exchange_identification: Connection closed by remote host
前提是我已经加了 public key。

通过远程卡登录,出现大量的如下信息:
INFO: task dpkg:5206 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disable this message.

尝试登录,进不了 shell,报错:
login: failure forking: Cannot allocate memory

猜测可能是机器的某些资源耗尽引起的。不得以强行 warn boot。之后的一小段时间内没有出现此问题,但是大概 2h 之后,又出现了上面的情况,查看系统资源的占用的情况,下面三张图分别是那段时间的 inode table, interrupts and context switch 以及 threads 显示

可以发现在 18 号的 20 点出现了一次波动,reboot 之后稍有缓解,但是从 24 点又开始出现了 threads 直线上升的情况。在网络没有变动,系统没有变动的情况下,惟一的可能就是代码有问题了。经排查,确实是代码引起的问题。
除了 dev 对变更后的代码最好监控之外,以后对这类的问题应该更加敏感些。

NUMA 在 DB 上的一些问题

关于 NUMA (Non-Uniform Memory Access, 非一致性内存访问)的概念,可以看 MicroSHIT 的这篇文档。概括起来,NUMA 就是把 CPU 与特定的内存做绑定,这个在有好多物理 CUP 的情况下比较占优势。
而 NUMA 分为硬件的和软件的,硬件的有好几条系统总线,每条系统总线为一小组 CUP 服务,然后这一小组 CUP 有自己的内存;软件的跟软 RAID 类似,没有硬件的方式只能通过软件的方式实现了。
跟 NUMA 对应的是 SMP(Symmetric Multi-Processor, 对称多处理器),所有的 CUP 共享系统总线,当 CUP 增多时,系统总线会互相竞争强资源,自然就撑不住了,正常的 SMP 体系也就只能支持十来个 CPU,多了系统总线就会成为瓶颈。
现在还出现了个叫 MPP(Massive Parallel Processing) 的体系,比 NUMA 的扩展性更牛逼。
这里有张图比较形象的描述了这三者,可以参考一下。

说上面这段话是为了引出下文,在启动 mongo 的时候出现下面的 warning:
Thu Sep 7 15:47:30 [initandlisten] ** WARNING: You are running on a NUMA machine.
Thu Sep 7 15:47:30 [initandlisten] **          We suggest launching mongod like this to avoid performance problems:
Thu Sep 7 15:47:30 [initandlisten] **              numactl –interleave=all mongod [other options]

根据官方文档的解释,Linux, NUMA, MongoDB 这三者不是很和谐,如果当前硬件是 NUMA 的,可以把它给关了:
# numactl --interleave=all sudo -u mongodb mongod --port xxx  --logappend --logpath yyy --dbpath zzz

官方还建议将 vm.zone_reclaim_mode 设置为 0。系统给 NUMA node 分配内存,如果 NUMA node 已经满了,这时候,系统会为本地的 NUMA node 回收内存而不是将多出来的内存给 remote NUMA node,这样整体的性能会更好,但是在某些情况下,给 remote NUMA node 分配内存会比回收本地的 NUMA node 更好,这时候就需要将 zone_reclaim_mode 给关闭了。这篇文章记录了对于 web server/file server/email server 不设置为 0 的悲剧。这里还有一篇,描述了作者遇到的现象,因为遇到了内核 bug。

不过,真正起作用的其实还是 numactl,看了这篇文档所讲,抽取出来,NUMA 有几个内存分配的策略:
Continue reading