Big transaction and MySQL replication lag

This saturday afternoon I was on call, and got a alert, one of our production MySQL slave begin to have lag like this:

After some time reviewing the monitoring metrics, I got the big breakthrough, there existed large spike of lock structs and delete operations during the lag period:

Usually, this was caused by big transaction on master node. The verfication was quite straightfoward, just analyze the binlog, the script was copied from percona.

After running the script, I do really found something astonishing, during that time the master had one transaction per minute, and each was comparatively big, nearly 10k delete operations for one transaction:
Timestamp : #160904 13:26:54 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7559 row(s) affected
Timestamp : #160904 13:27:56 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7726 row(s) affected
Timestamp : #160904 13:28:57 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7864 row(s) affected
Timestamp : #160904 13:29:59 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8039 row(s) affected
Timestamp : #160904 13:31:01 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7961 row(s) affected
….
Timestamp : #160904 13:33:05 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7985 row(s) affected
Timestamp : #160904 13:34:06 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7849 row(s) affected
Timestamp : #160904 13:35:08 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7927 row(s) affected
Timestamp : #160904 13:36:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8014 row(s) affected
Timestamp : #160904 13:37:11 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8031 row(s) affected
Timestamp : #160904 13:38:13 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8104 row(s) affected
Timestamp : #160904 13:39:15 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7730 row(s) affected
Timestamp : #160904 13:40:16 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7668 row(s) affected
Timestamp : #160904 13:41:18 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7836 row(s) affected
Timestamp : #160904 13:42:20 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7757 row(s) affected

Timestamp : #160904 13:48:29 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7835 row(s) affected
Timestamp : #160904 13:49:31 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7839 row(s) affected
Timestamp : #160904 13:50:32 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8028 row(s) affected
Timestamp : #160904 13:51:34 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8223 row(s) affected
Timestamp : #160904 13:52:35 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8119 row(s) affected
Timestamp : #160904 13:53:37 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7924 row(s) affected
Timestamp : #160904 13:54:38 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 2139 row(s) affected
Timestamp : #160904 14:17:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 5592 row(s) affected
Timestamp : #160904 14:18:12 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8260 row(s) affected
Timestamp : #160904 14:19:13 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7876 row(s) affected
Timestamp : #160904 14:20:14 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7994 row(s) affected
Timestamp : #160904 14:21:16 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7994 row(s) affected
Timestamp : #160904 14:22:17 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7813 row(s) affected

Timestamp : #160904 18:15:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8191 row(s) affected
Timestamp : #160904 18:16:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8180 row(s) affected
Timestamp : #160904 18:17:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8304 row(s) affected
Timestamp : #160904 18:18:28 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8128 row(s) affected

If we compare the transaction of other time period, we can get the result, below is the top 10 transaction of Sep 2 which is far less than 10000:
Timestamp : #160902 18:00:31 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 850 row(s) affected
Timestamp : #160902 10:00:33 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 813 row(s) affected
Timestamp : #160902 8:00:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 786 row(s) affected
Timestamp : #160902 10:55:42 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 646 row(s) affected
Timestamp : #160902 12:30:01 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 608 row(s) affected
Timestamp : #160902 12:30:01 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 606 row(s) affected
Timestamp : #160902 19:07:37 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 599 row(s) affected
Timestamp : #160902 19:10:09 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 582 row(s) affected
Timestamp : #160902 19:07:23 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 575 row(s) affected

That explains why io thread works well but still have big lag, since by default slave has only single thread to process all including big transaction from master, it can't handle that and break.

What's next? find which service cause that big transaction, back-flush from the DB log, there do really have a job that purge the old data every minute like this:
[2016-09-04 14:21:53  INFO com.benmu.rapidq.backup.backup.task.HistoryCleaner:?] clean history qmq_backup_messages data by condition create_time < '2016-08-20 14:21:53', total clean: 892, elapse: 60 ms

After the discussion with developer, we decided to stop the job. Besides that, we consider to increase the thread(slave_parallel_worker) of slave since we currently use MySQL 5.7 which is its big selling point.

Nginx 200 response with empty body by double slash

We setup a new cluster of Nginx to send requests to HIS using curl with domain name as the Host header, unfortunately, we get 200 return code but empty body like this:

When we use IP as Host, it works well, and return the expected result:

After some debug, nothing found and tcpdump comes:

From the tcpdump, the above red arrow uses domain name as Host header and the below one uses IP as Host header, we see there exists a double slash when using domain name.

After some review of Nginx config file, we found the misconfiguration location:

OpenVPN Connectivity Issue in Public Network

We established a ovpn tunnel between 2 IDCs in September 2014, and we have monitored the availability and performance of two ends for a long time. The geographical distance between 2 IDCs are quite short, but with different telecom carries. mtr shows that there exits ahout 7 hops from one end to the other. The below screenshot shows the standard ping loss.


The result is quite interesting. At first we used UDP protocol, and we often experienced network disconnection issue, later we switched to TCP, and it improved a lot. From the digram, the average package is 1.11%.
Why 1.11%, what I can explain is the tunnel is often fully saturated during the peak hour, and this can't solved at the moment, so no matter what protocol, the package loss should exists. Another possible reason is the complexity of public network which I can't quantitate.
The current plan works during current background, but no "how many 9s" guarantee. Anyway, if we want to achieve more stable connectivity, a DLL(dedicated leased line) is a better choice.

Metrics Dashboard Comparision for Linux Desktop

I need to know how good/poorly my desktop(Loongson & Rasberry Pi 2) is running, so there should at least exits a solution that can see its realtime and history metrics of the the system like cpu, memory, network, disk io etc.

Also tools like Glances, nmon, sar(sysstat) or atop provide a nice realtime metircs, it can't get its history data easily which it more suitable for massive production use, also no web dashborard for a nicely view, so I just pass them. Systems like Icinga, Nagios, Graphite, Zabbix, etc. are so heavy and complicated which are also not convenient for a 2GB memory desktop use.

Below are some tools that may help you for setting up a tiny to small monitoring/metrics system. At the end, I'll pick up the most suitable one for my own.

psdash
A dashboard overview of the system using psutils and Flask.

The installation is quite straightforward with pip packager manager. Unfortunately, As my Debian jessie testing version, I happened to a "PROTOCOL_SSLv3 is not defined" bug that stop the system running, you can modify the python gevent lib to work around(1, 2).   

The web dashboard is qute lighweight and nice, you can get the realtime data from the every-3-second refresh web automatically, the fatal problem is it can't get the history data and charts.

Linux-dash 
Quite similar to psdash, but comes with more technical stack like Node.js, Go, and PHP. It's offical slogon is "A simple, low-overhead web dashboard for GNU / Linux. (~1MB)", besides that, not much eye-catching point.

ezservermonitor-web 
Besides all the feature psdash and Linux-dash, it also has a simple trigger configuration, Say, for the load average metric, it will be displayed with gauges, when is less than 50%, it's green, when is more than 76% util, it turns to red. Also, it has the built-in ping and servers check function, which is handy for your family internal use.

Ezservermonitor also has a console based tools called EZ SERVER MONITOR`SH, without web interface.

Web VMStat 
It’s a small application written in Java and HTML which displays live Linux system statistics. It just takes over vmstat command in a pretty web page with SmoothieCharts and websocketd in realtime.

munin 
The last is usually the best. Yes, it's the only one I want to recommend for you, no matter how many desktops you have, Munin can handle them easily and more importantly, it only taks you few minutes depending on your network quality if you use apt/yum to install it directly, the real out-of-box product. it used RRD as it's backend storage engine. by default, 300s interval which I think it's enough for most of desktop users, with at one year history. 

If you are unsatisfied with its hundreds of plugins, just write a new one or porting from any Unix platform by yourself by any scripts language. This is its killer feature.

After the above comparision, you could choose ones according to your demands.

Want near-realtime(5m or so) and history data with charts? Munin is a best option.

Want realtime data, without history data? psdash, Linux-dash, Web-VmStat are those you're looking for.

Still not satisfied? You may consider using Graphite, Zabbix if you have system admin experience, since both are enterprise level open source product.

Router Matters

We are transfering PBs of our HDFS data from one data center to another via a router, we never thought the performance of a router will becomes the bottleneck until we find the below statistic:

#show interfaces Gi0/1
GigabitEthernet0/1 is up, line protocol is up
  Hardware is iGbE, address is 7c0e.cece.dc01 (bia 7c0e.cece.dc01)
  Description: Connect-Shanghai-MSTP
  Internet address is 10.143.67.18/30
  MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
     reliability 255/255, txload 250/255, rxload 3/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full Duplex, 1Gbps, media type is ZX
  output flow-control is unsupported, input flow-control is unsupported
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:00:06, output 00:00:00, output hang never
  Last clearing of "show interface" counters 1d22h
  Input queue: 0/75/0/6 (size/max/drops/flushes); Total output drops: 22559915
  Queueing strategy: fifo
  Output queue: 39/40 (size/max)

The output queue is full, hence the txload is obviously high.

How awful it is. At the beginning, we found there were many failures or retransmissions during the transfer between two data centers. After adding some metrics, everything is clear, the latency between two data centers is quite unstable, sometimes around 30ms, and sometimes reaches 100ms or even more which is unacceptable for some latency sensitive application. we then ssh into the router and found the above result.

After that, we drop it and replace it with a more advanced one, now, everything returns to normal, latency is around 30ms, packet drop is below 1%.

How Many Non-Persistent Connections Can Nginx/Tengine Support Concurrently

Recently, I took over a product line which has horrible performance issues and our customers complain a lot. The architecture is qute simple, clients, which are SDKs installed in our customers' handsets send POST requests to a cluster of Tengine servers, via a cluster of IPVS load balancers, actually the Tengine is a highly customized Nginx server, it comes with tons of handy features. then, the Tengine proxy redirects the requests to the upstream servers, after some computations, the app servers will send the results to MongoDB.

When using curl to send a POST request like this:
$ curl -X POST -H "Content-Type: application/json" -d '{"name": "jaseywang","sex": "1","birthday": "19990101"}' http://api.jaseywang.me/person -v

Every 10 tries, you'll probably get 8 or even more failure responses with "connection timeout" back.

After some basic debugging, I find that Nginx is quite abnormal with TCP accept queue totally full, therefore, it's explainable that the client get unwanted response. The CPU and memory util is acceptable, but the networking goes wild cuz packets nics received is extremely tremendous, about 300kpps average, 500kpps at peak times, since the package number is so large, the interrupts is correspondingly quite high. Fortunately, these Nginx servers all ship with 10G network cards, with mode 4 bonding, some link layer and lower level are also pre-optimizated like TSO/GSO/GRO/LRO, ring buffer size, etc. I havn't seen any package drop/overrun using ifconfig or similar tools.

After some package capturing, I found more teffifying facts, almost all of the incoming packagea are less than 75Byte. Most of these package are non-persistent. They start the 3-way handshake, send one or a few more usually less than 2 if need TCP sengment HTTP POST request, and exist with 4-way handshake. Besides that, these clients usually resend the same requests every 10 minutes or longer, the interval time is set by the app's developer and is beyond our control. That means:

1. More than 80% of the traffic are purely small TCP packages, which will have significant impact on network card and CPU. You can get the overall ideas about the percent with the below image. Actually, the percent is about 88%.

2. Since they can't keep the connections persistent, just TCP 3-wayhandshak, one or more POST, TCP 4-way handshake, quite simple. You have no way to reuse the connection. That's OK for network card and CPU, but it's a nightmare for Nginx, even I enlarge the backlog, the TCP accept queue quickly becomes full after reloading Nginx. The below two images show the a client packet's lifetime. The first one is the packet capture between load balance IPVS and Nginx, the second one is the communications between Nginx and upstream server.

3. It's quite expensive to set up a TCP connection, especially when Nginx runs out of resources. I can see during that period, the network traffic it quite large, but the new request connected per second is qute small compare to the normal. HTTP 1.0 needs client to specify Connection: Keep-Alive in the request header to enable persistent connection, and HTTP 1.1 enables by default. After turning off, not so much effect.

It now have many evidences shows that our 3 Nginx servers, yes, its' 3, we never thought Nginx will becomes bottleneck one day, are half broken in such a harsh environment. How mnay connections can it keep and how many new connections can it accept? I need to run some real traffic benchmarks to get the accurate result.

Since recovering the product is the top priority, I add another 6 Nginx servers with same configurations to IPVS. With 9 servers serving the online service, it behaves normal now. Each Nginx gets about 10K qps, with 300ms response time, zero TCP queue, and 10% CPU util.

The benchmark process is not complex, remove one Nginx server(real server) from IPVS at a time, and monitor its metrics like qps, response time, TCP queue, and CPU/memory/networking/disk utils. When qps or similar metric don't goes up anymore and begins to turn round, that's usually the maximum power the server can support.

Before kicking off, make sure some key parameters or directives are setting correct, including Nginx worker_processes/worker_connections, CPU affinity to distrubute interrupts, and kernel parameter(tcp_max_syn_backlog/file-max/netdev_max_backlog/somaxconn, etc.). Keep an eye on the rmem_max/wmem_max, during my benchmark, I notice quite different results with different values.

Here are the results:
The best performance for a single server is 25K qps, however during that period, it's not so stable, I observe a almost full queue size in TCP and some connection failures during requesting the URI, except that, everything seems normal. A conservative value is about 23K qps. That comes with 300K TCP established connections, 200Kpps, 500K interrupts and 40% CPU util.
During that time, the total resource consumed from the IPVS perspective is 900K current connection, 600Kpps, 800Mbps, and 100K qps.
The above benchmark was tested during 10:30PM ~ 11:30PM, the peak time usually falls between 10:00PM to 10:30PM.

Turning off your access log to cut down the IO and timestamps in TCP stack may achieve better performance, I haven't tested.

Don't confused TCP keepalievd with HTTP keeplive, they are totally diffent concept. One thing to keep in mind is that, the client-LB-Nginx-upstream mode usually has a LB TCP sesstion timeout value with 90s by default. that means, when client sends a request to Nginx, if Nginx doesn't response within 90s to client, LB will disconnect both end TCP connection by sending rst package in order to save LB's resource and sometimes for security reasons. In this case, you can decrease TCP keepalive parameter to workround.