OpenVPN Connectivity Issue in Public Network

We established a ovpn tunnel between 2 IDCs in September 2014, and we have monitored the availability and performance of two ends for a long time. The geographical distance between 2 IDCs are quite short, but with different telecom carries. mtr shows that there exits ahout 7 hops from one end to the other. The below screenshot shows the standard ping loss.


The result is quite interesting. At first we used UDP protocol, and we often experienced network disconnection issue, later we switched to TCP, and it improved a lot. From the digram, the average package is 1.11%.
Why 1.11%, what I can explain is the tunnel is often fully saturated during the peak hour, and this can't solved at the moment, so no matter what protocol, the package loss should exists. Another possible reason is the complexity of public network which I can't quantitate.
The current plan works during current background, but no "how many 9s" guarantee. Anyway, if we want to achieve more stable connectivity, a DLL(dedicated leased line) is a better choice.

Router Matters

We are transfering PBs of our HDFS data from one data center to another via a router, we never thought the performance of a router will becomes the bottleneck until we find the below statistic:

#show interfaces Gi0/1
GigabitEthernet0/1 is up, line protocol is up
  Hardware is iGbE, address is 7c0e.cece.dc01 (bia 7c0e.cece.dc01)
  Description: Connect-Shanghai-MSTP
  Internet address is 10.143.67.18/30
  MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
     reliability 255/255, txload 250/255, rxload 3/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full Duplex, 1Gbps, media type is ZX
  output flow-control is unsupported, input flow-control is unsupported
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:00:06, output 00:00:00, output hang never
  Last clearing of "show interface" counters 1d22h
  Input queue: 0/75/0/6 (size/max/drops/flushes); Total output drops: 22559915
  Queueing strategy: fifo
  Output queue: 39/40 (size/max)

The output queue is full, hence the txload is obviously high.

How awful it is. At the beginning, we found there were many failures or retransmissions during the transfer between two data centers. After adding some metrics, everything is clear, the latency between two data centers is quite unstable, sometimes around 30ms, and sometimes reaches 100ms or even more which is unacceptable for some latency sensitive application. we then ssh into the router and found the above result.

After that, we drop it and replace it with a more advanced one, now, everything returns to normal, latency is around 30ms, packet drop is below 1%.

How Many Non-Persistent Connections Can Nginx/Tengine Support Concurrently

Recently, I took over a product line which has horrible performance issues and our customers complain a lot. The architecture is qute simple, clients, which are SDKs installed in our customers' handsets send POST requests to a cluster of Tengine servers, via a cluster of IPVS load balancers, actually the Tengine is a highly customized Nginx server, it comes with tons of handy features. then, the Tengine proxy redirects the requests to the upstream servers, after some computations, the app servers will send the results to MongoDB.

When using curl to send a POST request like this:
$ curl -X POST -H "Content-Type: application/json" -d '{"name": "jaseywang","sex": "1","birthday": "19990101"}' http://api.jaseywang.me/person -v

Every 10 tries, you'll probably get 8 or even more failure responses with "connection timeout" back.

After some basic debugging, I find that Nginx is quite abnormal with TCP accept queue totally full, therefore, it's explainable that the client get unwanted response. The CPU and memory util is acceptable, but the networking goes wild cuz packets nics received is extremely tremendous, about 300kpps average, 500kpps at peak times, since the package number is so large, the interrupts is correspondingly quite high. Fortunately, these Nginx servers all ship with 10G network cards, with mode 4 bonding, some link layer and lower level are also pre-optimizated like TSO/GSO/GRO/LRO, ring buffer size, etc. I havn't seen any package drop/overrun using ifconfig or similar tools.

After some package capturing, I found more teffifying facts, almost all of the incoming packagea are less than 75Byte. Most of these package are non-persistent. They start the 3-way handshake, send one or a few more usually less than 2 if need TCP sengment HTTP POST request, and exist with 4-way handshake. Besides that, these clients usually resend the same requests every 10 minutes or longer, the interval time is set by the app's developer and is beyond our control. That means:

1. More than 80% of the traffic are purely small TCP packages, which will have significant impact on network card and CPU. You can get the overall ideas about the percent with the below image. Actually, the percent is about 88%.

2. Since they can't keep the connections persistent, just TCP 3-wayhandshak, one or more POST, TCP 4-way handshake, quite simple. You have no way to reuse the connection. That's OK for network card and CPU, but it's a nightmare for Nginx, even I enlarge the backlog, the TCP accept queue quickly becomes full after reloading Nginx. The below two images show the a client packet's lifetime. The first one is the packet capture between load balance IPVS and Nginx, the second one is the communications between Nginx and upstream server.

3. It's quite expensive to set up a TCP connection, especially when Nginx runs out of resources. I can see during that period, the network traffic it quite large, but the new request connected per second is qute small compare to the normal. HTTP 1.0 needs client to specify Connection: Keep-Alive in the request header to enable persistent connection, and HTTP 1.1 enables by default. After turning off, not so much effect.

It now have many evidences shows that our 3 Nginx servers, yes, its' 3, we never thought Nginx will becomes bottleneck one day, are half broken in such a harsh environment. How mnay connections can it keep and how many new connections can it accept? I need to run some real traffic benchmarks to get the accurate result.

Since recovering the product is the top priority, I add another 6 Nginx servers with same configurations to IPVS. With 9 servers serving the online service, it behaves normal now. Each Nginx gets about 10K qps, with 300ms response time, zero TCP queue, and 10% CPU util.

The benchmark process is not complex, remove one Nginx server(real server) from IPVS at a time, and monitor its metrics like qps, response time, TCP queue, and CPU/memory/networking/disk utils. When qps or similar metric don't goes up anymore and begins to turn round, that's usually the maximum power the server can support.

Before kicking off, make sure some key parameters or directives are setting correct, including Nginx worker_processes/worker_connections, CPU affinity to distrubute interrupts, and kernel parameter(tcp_max_syn_backlog/file-max/netdev_max_backlog/somaxconn, etc.). Keep an eye on the rmem_max/wmem_max, during my benchmark, I notice quite different results with different values.

Here are the results:
The best performance for a single server is 25K qps, however during that period, it's not so stable, I observe a almost full queue size in TCP and some connection failures during requesting the URI, except that, everything seems normal. A conservative value is about 23K qps. That comes with 300K TCP established connections, 200Kpps, 500K interrupts and 40% CPU util.
During that time, the total resource consumed from the IPVS perspective is 900K current connection, 600Kpps, 800Mbps, and 100K qps.
The above benchmark was tested during 10:30PM ~ 11:30PM, the peak time usually falls between 10:00PM to 10:30PM.

Turning off your access log to cut down the IO and timestamps in TCP stack may achieve better performance, I haven't tested.

Don't confused TCP keepalievd with HTTP keeplive, they are totally diffent concept. One thing to keep in mind is that, the client-LB-Nginx-upstream mode usually has a LB TCP sesstion timeout value with 90s by default. that means, when client sends a request to Nginx, if Nginx doesn't response within 90s to client, LB will disconnect both end TCP connection by sending rst package in order to save LB's resource and sometimes for security reasons. In this case, you can decrease TCP keepalive parameter to workround.

tcp_keepalive_time and rst Flag in NAT Environment

Here, I'm not going to explain the details of what is TCP keepalive, what are the 3 related parameters tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes mean.
You need to know, the default value of these 3 parameters with net.ipv4.tcp_keepalive_time = 7200, net.ipv4.tcp_keepalive_intvl = 75, net.ipv4.tcp_keepalive_probes = 9, especially the first one, tcp_keepalive_time may and usually will cause some nasty behavior to you in some environment like NAT.

Let's say, your client now wants to connect to server, via one or more nat in the middle way, whatever SNAT or DNAT. The nat server, no matter its' a router or a Linux based device with ip_forward on, need to maintain a TCP connections mapping table to track each incoming and outcoming connection, since the device's resource is limited, the table can't grow as large as it wants, so it needs to drop some connections which are idle for a period of time with no data change between client and server.

This phenomenon is quite common, not only in consumer oriented/low end/poor implementation router, but also in data center NAT servers. For most of these NAT server, their tcp_keepalive_time is usually set to 90s or around, so, if your client set the parameter larger than 90, and have no data communications for more than 90s, the NAT server will send a TCP package with RST flags to disconnect the both end.

In this situation, one workaround is to lower the keepalive interval than that, like 30s, if you can't control the NAT server.

Recently, when we sync data from MongoDB Master/Slave, note, the architecture is a little old, not the standard replica. We often observe a large amount of errors indicating the failure of synchronisation during creating index process after finishing synchronizing the MongoDB data. Below log is get from our slave server:

Fri Nov 14 12:52:00.356 [replslave] Socket recv() errno:110 Connection timed out 100.80.1.26:27018
Fri Nov 14 12:52:00.356 [replslave] SocketException: remote: 100.80.1.26:27018 error: 9001 socket exception [RECV_ERROR] server [100.80.1.26:27018]
Fri Nov 14 12:52:00.356 [replslave] DBClientCursor::init call() failedFri Nov 14 12:52:00.356 [replslave] repl: AssertionException socket error for mapping query
Fri Nov 14 12:52:00.356 [replslave] Socket flush send() errno:104 Connection reset by peer 100.80.1.26:27018
Fri Nov 14 12:52:00.356 [replslave]   caught exception (socket exception [SEND_ERROR] for 100.80.1.26:27018) in destructor (~PiggyBackData)
Fri Nov 14 12:52:00.356 [replslave] repl: sleep 2 sec before next pass
Fri Nov 14 12:52:02.356 [replslave] repl: syncing from host:100.80.1.26:27018
Fri Nov 14 12:52:03.180 [replslave] An earlier initial clone of 'sso_production' did not complete, now resyncing.
Fri Nov 14 12:52:03.180 [replslave] resync: dropping database sso_production
Fri Nov 14 12:52:04.272 [replslave] resync: cloning database sso_production to get an initial copy

100.80.1.26 is DNAT IP address for MongoDB master. As you can see, the slave receives reset from 100.80.1.26.

After some package capture and analysis, we get the conclusion with the root cause of tcp_keepalive_time related. If you are sensitive, you may consider TCP stack related issues when see "reset by peer" keyword at first glance.

Be careful, the same issue also occurs in some IMAP proxy, IPVS envionment, etc.  After many years hands-on online operations experience, I have to say, DNS and NAT are two potential threats of various issues, even in some so-called 100% uptime environment.

通过 tcpcopy(pf_ring) 对 BCM 5719 小包做的多组 benchmark

tcpcopy 在文档化、用户参与方式方面有很大的提升空间这个问题在之前已经专门说过。最终,在我们自己阅读代码的情况下,结合 pf_ring,坚持跑通了整个流程,用其对目前 BCM 5719 型号的网卡做了多组对比,结论见结尾。
使用 tcpcopy 做 benchmark,务必确定 tcpcopy 语法使用的正确性, 尽管互联网上绝大多数的文档以及官方文档都写的含糊不清。
比如,我们之前把过滤条件 -F "tcp and dst port 80" 写成了 -F "tcp and src port 80",造成的结果是在错误的基础上得出了一些奇葩的数据以及无法解释的现象,比如到 target server 的流量特别小,并且及其的不稳定,得到的 pps 也是非常的不稳定大到 400kpps/s,小到 20k  甚至 0。

这里列举我们测试环境使用的一组配置。
所有机器都是 RedHat 6.2 2.6.32-279.el6.x86_64,Tcpcopy 1.0 版本,PF_RING 6.0.1,BCM 5719,双网卡 bonding。另外,我们单台生产机器的流量都比较大,并且绝大多数都是 100B 以下的小包,所以对系统的要求还是比较高的。
为了避免后续的误解,先确定几个用语:
1. online server(OS): 即我们线上生产环境的机器
2. target server(TS): 是我们想做 benchmark 的那台机器,以测试其究竟能支撑多大的量
3. assistant server(AS): 是安装了 intercept 的机器
4. mirror server(MS): 通过 tcpcopy 的 mirror 工具把 OS 的流量复制到该台机器上

我们一共做了三大组 benchmark,每组下面包含若干小的 benchmark,包括:
1. tcpcopy + intercept
2. tcpcopy + intercept + tcpcopy mirror,加一个 mirror 是为了减少对 OS 的负载,这样 tcpcopy 在 MS 上运行
3. tcpcopy + intercept + 交换机 mirror,更直接的方式复制线上流量,保证对 OS 没有干扰,并且能保证 100% 全量复制 OS 的流量

Continue reading

PF_RING 对网络抓包性能的提升不仅仅是 30% – 40%

pf_ring 由于涉及的东西比较多,最初看的时候可能会云里雾里,不过多看几遍官方文档应该就能大致理解含义了。
安装的步骤可以看这里。 我建议还是自己跑一遍,这样能熟悉每个零部件的作用。要是实在没空,也可以直接用官方提供的 rpm, deb 安装。
这里提示下,除了编译出来的 pf_ring.ko 之外,如果你的 NIC 不支持 PF_RING™-aware driver,那么只能使用 mode 0,如果支持的话,可以使用 mode 1 以及 mode 2,三者在我们的系统上 benchmark 下来差距并不大,不过加载了 pf_ring.ko 跟不加载 pf_ring.ko 的差距那真是天壤之别了。基本目前主流的 Intel、Broadcom 的都支持(igb, ixgbe, e1000e, tg3, bnx2),注意,tg3 目前(10/18/2014)最新的稳定版本 6.0.1 并没有,需要使用 nightly builds 版本,PF_RING_aware/non-ZC-drivers/broadcom/tg3-3.136h,PF_RING_aware/non-ZC-drivers/2.6.x/broadcom/tg3/tg3-3.102 .2.6.x 目录下面的都是已经废弃的版本,就不要使用了。

一般的应用也不大用到 pf_ring,不过如果是一些 IDS、snort 之类的系统,就派上用场了。我们的情况有点特殊,某条产品线由于前端的机器网卡比较弱(BCM5719),再加上发过来的绝大多数都是 75B 以下的小包,导致 rx_discards 以每秒数 k 的恐怖值增加。最初想抓包看看发过来的都是什么样结构的包,结果普通的 tcpdump 就完全没法用了,传统的 BPF(libpcap) 抓包相比简直是弱爆了。下面两张图,一张是通过 ntopng(也是使用的传统的 pcap)分析的包的大小,跟 iptraf 结果一致,一个只抓到了 2% 不到的包:-(


有需求就有动力了,后来研究了下 pf_ring,情况立马变好转了。看看下面两个对比就知道了。

加载了 pf_ring 的:
$ sudo /var/tmp/pfring/PF_RING-6.0.1/userland/tcpdump-4.1.1/tcpdump   -i bond1 -n -c 1000000 -w tmp
tcpdump: listening on bond1, link-type EN10MB (Ethernet), capture size 8192 bytes
1000000 packets captured
1000000 packets received by filter
0 packets dropped by kernel

使用传统 libpcap 抓包的:
$ sudo tcpdump   -i bond1 -n -c 1000000 -w tmp
tcpdump: listening on bond1, link-type EN10MB (Ethernet), capture size 65535 bytes
1000000 packets captured
1722968 packets received by filter
722784 packets dropped by kernel
12139 packets dropped by interface

所以说,pf_ring 对抓包的提升不仅仅是官方宣传的 30%-40% 的提升,更是一种技术对另外一种技术的革新,以及实际生产效率的大幅度提升。
另外,广告下 ntop.org 这个网站,上面有不少有意思的玩意儿,ntop, ntopng,再比如代替 OpenDPI 的 nDPI。我曾经玩过 OpenDPI,没几天官方就不更新了,这东西最重要的是实时的分析目前主流的协议,好在 nDPI 站在巨人肩膀上发扬光大了。提醒一下,这个也仅仅做娱乐只用,如果要实打实的,还是去买设备。