How Many Non-Persistent Connections Can Nginx/Tengine Support Concurrently

Recently, I took over a product line which has horrible performance issues and our customers complain a lot. The architecture is qute simple, clients, which are SDKs installed in our customers' handsets send POST requests to a cluster of Tengine servers, via a cluster of IPVS load balancers, actually the Tengine is a highly customized Nginx server, it comes with tons of handy features. then, the Tengine proxy redirects the requests to the upstream servers, after some computations, the app servers will send the results to MongoDB.

When using curl to send a POST request like this:
$ curl -X POST -H "Content-Type: application/json" -d '{"name": "jaseywang","sex": "1","birthday": "19990101"}' http://api.jaseywang.me/person -v

Every 10 tries, you'll probably get 8 or even more failure responses with "connection timeout" back.

After some basic debugging, I find that Nginx is quite abnormal with TCP accept queue totally full, therefore, it's explainable that the client get unwanted response. The CPU and memory util is acceptable, but the networking goes wild cuz packets nics received is extremely tremendous, about 300kpps average, 500kpps at peak times, since the package number is so large, the interrupts is correspondingly quite high. Fortunately, these Nginx servers all ship with 10G network cards, with mode 4 bonding, some link layer and lower level are also pre-optimizated like TSO/GSO/GRO/LRO, ring buffer size, etc. I havn't seen any package drop/overrun using ifconfig or similar tools.

After some package capturing, I found more teffifying facts, almost all of the incoming packagea are less than 75Byte. Most of these package are non-persistent. They start the 3-way handshake, send one or a few more usually less than 2 if need TCP sengment HTTP POST request, and exist with 4-way handshake. Besides that, these clients usually resend the same requests every 10 minutes or longer, the interval time is set by the app's developer and is beyond our control. That means:

1. More than 80% of the traffic are purely small TCP packages, which will have significant impact on network card and CPU. You can get the overall ideas about the percent with the below image. Actually, the percent is about 88%.

2. Since they can't keep the connections persistent, just TCP 3-wayhandshak, one or more POST, TCP 4-way handshake, quite simple. You have no way to reuse the connection. That's OK for network card and CPU, but it's a nightmare for Nginx, even I enlarge the backlog, the TCP accept queue quickly becomes full after reloading Nginx. The below two images show the a client packet's lifetime. The first one is the packet capture between load balance IPVS and Nginx, the second one is the communications between Nginx and upstream server.

3. It's quite expensive to set up a TCP connection, especially when Nginx runs out of resources. I can see during that period, the network traffic it quite large, but the new request connected per second is qute small compare to the normal. HTTP 1.0 needs client to specify Connection: Keep-Alive in the request header to enable persistent connection, and HTTP 1.1 enables by default. After turning off, not so much effect.

It now have many evidences shows that our 3 Nginx servers, yes, its' 3, we never thought Nginx will becomes bottleneck one day, are half broken in such a harsh environment. How mnay connections can it keep and how many new connections can it accept? I need to run some real traffic benchmarks to get the accurate result.

Since recovering the product is the top priority, I add another 6 Nginx servers with same configurations to IPVS. With 9 servers serving the online service, it behaves normal now. Each Nginx gets about 10K qps, with 300ms response time, zero TCP queue, and 10% CPU util.

The benchmark process is not complex, remove one Nginx server(real server) from IPVS at a time, and monitor its metrics like qps, response time, TCP queue, and CPU/memory/networking/disk utils. When qps or similar metric don't goes up anymore and begins to turn round, that's usually the maximum power the server can support.

Before kicking off, make sure some key parameters or directives are setting correct, including Nginx worker_processes/worker_connections, CPU affinity to distrubute interrupts, and kernel parameter(tcp_max_syn_backlog/file-max/netdev_max_backlog/somaxconn, etc.). Keep an eye on the rmem_max/wmem_max, during my benchmark, I notice quite different results with different values.

Here are the results:
The best performance for a single server is 25K qps, however during that period, it's not so stable, I observe a almost full queue size in TCP and some connection failures during requesting the URI, except that, everything seems normal. A conservative value is about 23K qps. That comes with 300K TCP established connections, 200Kpps, 500K interrupts and 40% CPU util.
During that time, the total resource consumed from the IPVS perspective is 900K current connection, 600Kpps, 800Mbps, and 100K qps.
The above benchmark was tested during 10:30PM ~ 11:30PM, the peak time usually falls between 10:00PM to 10:30PM.

Turning off your access log to cut down the IO and timestamps in TCP stack may achieve better performance, I haven't tested.

Don't confused TCP keepalievd with HTTP keeplive, they are totally diffent concept. One thing to keep in mind is that, the client-LB-Nginx-upstream mode usually has a LB TCP sesstion timeout value with 90s by default. that means, when client sends a request to Nginx, if Nginx doesn't response within 90s to client, LB will disconnect both end TCP connection by sending rst package in order to save LB's resource and sometimes for security reasons. In this case, you can decrease TCP keepalive parameter to workround.

Setup a AQI(Air Quality Index) Monitoring System with Dylos and Raspberry Pi 2

I have been using air purifier for years in Beijing, China. So far so good except the only problem troubles me, is that effective? is my PM2.5 or PM10 reduced? the answer is probably obvious, but how effectively is it? Nobody knows.

At the morment, Dylos is the only manufacturer that provides consumer level product with accurate air quality index, so, not many choice. I got a Dylos air quality counter from Amazon, there are many products, if you want to export the index data from the black box into your desktop for later process , you'll need at least a DC 1100 pro with PC interface or higher version. Strongly not recommend buying from Taobao or the similar online stores, as far as I saw, None of them can provide the correct version, most of them exaggerate for a higher sale.

Now, half done. You need to Raspberry Pi, at the time of writing, Raspberry Pi 2 is coming to the market. I got a complete starter kit with a Pi 2 Model B, Clear Case, power supply, WiFi Dongle and a 8GB Micro SD.

In order to make the Raspberry Pi up, it's better to find a screen monitoring, or it will take huge pain to you.

After turning Dylos and Raspberry Pi on, the left process is quite simple. You need to connect the Dylos and Raspberry Pi with a serial to USB interface, the serial to USB cable is uncommon these days, if you are a network engineer, you should be quite familir with the cable, else, you can get one from online or somewhere else.

Now, write a tiny program to read the data from Dylos. You can make use of Python's pyserial module to read the data with a few lines. Here is mine. Besides that, you can use other language to implement, such as PHP. The interval depends on your Dylos collecting data interval, the minimum is 60s, ususally 3600s is enough.

Once got the data, you can feed them into different metric systems like Munin, or you can feed them into highcharts to get a more pretty look.

Installing Debian jessie on Loongson notebook(8089_B) with Ubuntu keyboard suits

I got a white yeeloong notebook last year, and it costs me 300RMB which ships with a Loongson 2F CPU, 1GB DDR2 memory bar, and a 160GB HHD.

The netbook has a pre-installed Liunx based operating system, I can't tell its distribution, and it looks quite like a Windows XP. Since then, I put it in my closet and never use it again.

Yesterday the untouched toy crossed my mind, so I took it out and spent a whole night to get Debian Jessie working on my yeeloong notebook. Here are some procedures you may need to note if you want to have your own box running.


At the beginning, I download the vmlinux and initrd.gz file from Debian mirror. I set up a TFTP server on my mac, ensure it's working from local. later I power on the notebook,  enter into the GMON screen, using ifaddr to assign a IP address manually, and it can ping to my mac, this means the networking works now. Problem comes since then, I execute load directive to load the vmlinux file, everytime after several minutes waiting, it shows connection timeout, after some time debuging, nonthing exceptions can be found since the tftp server and the connectivity between my mac and yeeloong both normal.

I give it up and find a USB stick, this time I'm going to put vmlinix and initird file into USB and let the notebook boot from USB. You have to check that the filesystem is formatted as ext2/ext3. Most importantly, the single partition must not be larger than 4GB, say, you have a 8G USB,  you need to format at least 3 partiitons, 3-3-2 is a good choice. If the partition is larger than 4GB, you can't even enter into PMON, and it just stalls here after you power on with USB attached to your notebook with "filesys_type() usb0 don't find" showing on the screen.


After entering into PMON, you should find your USB using devls directive, then type:
> bl -d ide (usb0, 0)/vmlinux
> bl -d ide (usb0, 0)/initird.gz

Don't copy directly, you need to figure out partitions location in your USB, maybe yours is (usb0, 1), and your vmlinux file is called vmlinux-3.14-2-loongson-2f.

Be patient, I waited about 10 minutes before both files loaded into memory successfully, press "g" to let it run. Haaaaam, there're fews steps to get a brand new toy.

The installation process now begins, press Yes/No, answer questions as your normal. at the end, I choose to install LXDE desktop environment. For me, I waited about 2 hours to finish the installation.

Now, reboot, and finally, it enters into LEVEL 3 without X because of some buggy tools.

Download the xserver-xorg-video-siliconmotion package from here, don't upset with the tool's version, you need some modifications to make it work on jessie.

Using dpkg to unpack the deb file, remove xorg-video-abi-12 from its control file in "Depends" section, repack the file, before dpkg -i, using apt-get to install the xserver-xorg-core file.

startx, you will get the final screen. The default network manager for LXDE is Wicd Network Manager which is fine, what if you want to use command line, you need to modify the interface file and some other steps before connected to internet. 

Wow, with Ubuntu keyborad suits, it's definitely a good combo.

How about the performance? No matter what I run, it always ends with CPU bound, and the average load with LXDE is around 1. That does not hurt, since it does not bother me too much, but It takes more than 3 minutes to boot up, probably slower than the majority of Windows users ;-)

Since the dummy box is terribly slooow, Why you buy the box? well, just for fun.

tcp_keepalive_time and rst Flag in NAT Environment

Here, I'm not going to explain the details of what is TCP keepalive, what are the 3 related parameters tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes mean.
You need to know, the default value of these 3 parameters with net.ipv4.tcp_keepalive_time = 7200, net.ipv4.tcp_keepalive_intvl = 75, net.ipv4.tcp_keepalive_probes = 9, especially the first one, tcp_keepalive_time may and usually will cause some nasty behavior to you in some environment like NAT.

Let's say, your client now wants to connect to server, via one or more nat in the middle way, whatever SNAT or DNAT. The nat server, no matter its' a router or a Linux based device with ip_forward on, need to maintain a TCP connections mapping table to track each incoming and outcoming connection, since the device's resource is limited, the table can't grow as large as it wants, so it needs to drop some connections which are idle for a period of time with no data change between client and server.

This phenomenon is quite common, not only in consumer oriented/low end/poor implementation router, but also in data center NAT servers. For most of these NAT server, their tcp_keepalive_time is usually set to 90s or around, so, if your client set the parameter larger than 90, and have no data communications for more than 90s, the NAT server will send a TCP package with RST flags to disconnect the both end.

In this situation, one workaround is to lower the keepalive interval than that, like 30s, if you can't control the NAT server.

Recently, when we sync data from MongoDB Master/Slave, note, the architecture is a little old, not the standard replica. We often observe a large amount of errors indicating the failure of synchronisation during creating index process after finishing synchronizing the MongoDB data. Below log is get from our slave server:

Fri Nov 14 12:52:00.356 [replslave] Socket recv() errno:110 Connection timed out 100.80.1.26:27018
Fri Nov 14 12:52:00.356 [replslave] SocketException: remote: 100.80.1.26:27018 error: 9001 socket exception [RECV_ERROR] server [100.80.1.26:27018]
Fri Nov 14 12:52:00.356 [replslave] DBClientCursor::init call() failedFri Nov 14 12:52:00.356 [replslave] repl: AssertionException socket error for mapping query
Fri Nov 14 12:52:00.356 [replslave] Socket flush send() errno:104 Connection reset by peer 100.80.1.26:27018
Fri Nov 14 12:52:00.356 [replslave]   caught exception (socket exception [SEND_ERROR] for 100.80.1.26:27018) in destructor (~PiggyBackData)
Fri Nov 14 12:52:00.356 [replslave] repl: sleep 2 sec before next pass
Fri Nov 14 12:52:02.356 [replslave] repl: syncing from host:100.80.1.26:27018
Fri Nov 14 12:52:03.180 [replslave] An earlier initial clone of 'sso_production' did not complete, now resyncing.
Fri Nov 14 12:52:03.180 [replslave] resync: dropping database sso_production
Fri Nov 14 12:52:04.272 [replslave] resync: cloning database sso_production to get an initial copy

100.80.1.26 is DNAT IP address for MongoDB master. As you can see, the slave receives reset from 100.80.1.26.

After some package capture and analysis, we get the conclusion with the root cause of tcp_keepalive_time related. If you are sensitive, you may consider TCP stack related issues when see "reset by peer" keyword at first glance.

Be careful, the same issue also occurs in some IMAP proxy, IPVS envionment, etc.  After many years hands-on online operations experience, I have to say, DNS and NAT are two potential threats of various issues, even in some so-called 100% uptime environment.

Migrating GitHub Enterprise From Beijing To Shanghai

We need to migrate our GitHub Enterprise from one data center located in Beijing to one in Shanghai. The whole process is not complex but time consuming, it takes us more than one week to finish the migration. I will share some pieces of practice for you.

The current version of GitHub Enterprise running version is 11.10.331, a little bit out of date. it's running on a VirtualBox with 4 CPU cores, 16G memory, the total code takes about 100GByte disk space. According to GitHub staff says, VirtualBox has some performance issues and sometimes even data corruption, so since version 2.0, GitHub no longer support VirtuBox, and they recommend VMware's free vSphere hypervisor, AWS or OpenStack KVM as a replacement.

Considering our new environment in Shanghai, it has quite strict restrictions of choosing platforms from the aspect of security, it's impossible to install something that hasn't been comprehensively investigated and tested, VMware or KVM just out. Since AWS is not so widely spreaded in China and most importantly, it's not possible to host it in a private network, also out. The only choice for us is stilling using VirtualBox.

Now, we need to export the code data from VirtualBox, it has at least two ways, the most straightforward way is to copy the vmdk file, or using VirtualBox's export feature, which you can put the mirror into another OS environment. We choose the first one, copy the vmdk file directly, remember to shutdown the VirtualBox before doing this, or your data will be corrupted.

Originally, We plan to rsync the 100G data from Beijing data center to Shanghai directly via wlan, although both ends are private networks, we can open a DNAT connection from Beijing temporally, so the server hosted in Shanghai can send a active connection to Beijng and begin to transfer, this seems good except that the outbound bandwidth is only 20Mbps, that means, in theory, the tranfer time is 100G*8*1024/20 = 40960s, almost half a day. Actually, there's another issue we can't control, as the data is transfered through the a long distance on the public network, the stability is not guaranteed, what if the connection lost due to some unknown issue? the previous time is in vain.

So, we use a quite traditional way to get it done, take a USB HDD to the data center and copy the data to the HDD, the whole process takes about 2 hours. Later, we rsync the data from HDD to our data center in Shanghai through a leased line.

Now, the raw data is ready for Shanghai, next step it to setup a new VirtualBox instance, it's quite easy, here, we recommend you to install a VNC or something others like ssh X11 forwarding, so you can use a GUI to do the left operations. If you're quite familiar with VBoxManage, that's also ok. The only thing you need to do is to create a new instance, mount the vmdk file as the storage disk.

After booting up the GitHub Enterprise, we found that we can't enter into the http://example.com/setup even we uploaded the license, after opening a ticket, GitHub engineer confirmed that there exists a bug in this version which occasionnlly prevent from entering the management console. In order to continue, we first needed to ssh into the GitHub using admin user, after that, we should remove the previous license file via sudo rm /data/enterprise/enterprise.ghl, then run god restart enterprise-manage. But we were stuck in the fisrt step because we had no sudo permission.

Known the issue, we now require to root the instance. To gain root access we need to boot into recovery mode, just shutdown the VM and power it back on. inside the hypervisor console, hold left shift down while it boots, this will take us to the GRUB menu where we can choose to boot into recovery mode. As the boot process goes so fast, we think up a way to delay the booting by execute the below command outside VirtualBox:
$ VBoxManage modifyvm YOUR_VM_INSTANCE_NAME –bioslogodisplaytime 10000

After that, we can have enough time to press down the button. Note, it's *shift* key, not tab key for VirtualBox. We made a mistake by pressing down the tab unconsciously until we found that stupid accident.

When entering the recovery mode, do as this documentation says. Change the root passwd, and add ssh public key to root/.ssh/authorized_keys. After the server has rebooted normally, we can now use the PubkeyAuthentication method to log in as root.

If everything goes well, we can now upload the license, enter into the management console.

Why we want to enter into the management console? Besides the normal user add/delete management, we also need to change the current authentication process from the built in auth to LDAP-backend auth. The setup is straightforward, we only need to fill out the form with some section values like Host, Port, User ID field, etc. our IT support team provide us. When we click the "Save settings" button, waiting about 10 minutes for it took effect, something weird happened, we  can't login with our LDAP account, everytime we try to login, it returned a 500 error. After some troubleshooting, we found that the LDAP setting didn't work because the configuration run wasn't properly triggered when the setting were saved, so the default authentication was still active which we can see from the login welcome portal, it should be something like "Sign in via LDAP", but the fact we saw was "Sign in". Later, we ran enterprise-configure command as our technical support suggested, this time, it worked. Why? We can only suspect that, there were some customizations made to the VirtulBox that caused the configuration process to fail.

After Signing into the new LDAP-backend portal, say, before that it's username is barackobama, the new account name is barackobama.bo. with the new account, we saw a empty page, that make sense, since it's a new account, barackobama.bo which doesn't have any connection with the old account barackobama. If so, all of our code are lost, or we need to git clone with the old account then git push with the new account manually to make the code alive. This can't beat GitHub, just rename the user to the new username with the http://example.com/stafftools/[username]/admin URL. Note that the dot "." will be changed to a hyphen, so remember to rename the user to "barackobama-bo".

Our GitHub Enterprise comes back to life again within a pre-scheduled downtime.

Finally, thanks to @michaeltwofish, @b4mboo, @buckelij, @donal. Your guys professional skills and behavior really affect me a lot, impressive!