Big transaction and MySQL replication lag

This saturday afternoon I was on call, and got a alert, one of our production MySQL slave begin to have lag like this:

After some time reviewing the monitoring metrics, I got the big breakthrough, there existed large spike of lock structs and delete operations during the lag period:

Usually, this was caused by big transaction on master node. The verfication was quite straightfoward, just analyze the binlog, the script was copied from percona.

After running the script, I do really found something astonishing, during that time the master had one transaction per minute, and each was comparatively big, nearly 10k delete operations for one transaction:
Timestamp : #160904 13:26:54 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7559 row(s) affected
Timestamp : #160904 13:27:56 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7726 row(s) affected
Timestamp : #160904 13:28:57 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7864 row(s) affected
Timestamp : #160904 13:29:59 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8039 row(s) affected
Timestamp : #160904 13:31:01 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7961 row(s) affected
Timestamp : #160904 13:33:05 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7985 row(s) affected
Timestamp : #160904 13:34:06 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7849 row(s) affected
Timestamp : #160904 13:35:08 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7927 row(s) affected
Timestamp : #160904 13:36:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8014 row(s) affected
Timestamp : #160904 13:37:11 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8031 row(s) affected
Timestamp : #160904 13:38:13 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8104 row(s) affected
Timestamp : #160904 13:39:15 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7730 row(s) affected
Timestamp : #160904 13:40:16 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7668 row(s) affected
Timestamp : #160904 13:41:18 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7836 row(s) affected
Timestamp : #160904 13:42:20 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7757 row(s) affected

Timestamp : #160904 13:48:29 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7835 row(s) affected
Timestamp : #160904 13:49:31 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7839 row(s) affected
Timestamp : #160904 13:50:32 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8028 row(s) affected
Timestamp : #160904 13:51:34 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8223 row(s) affected
Timestamp : #160904 13:52:35 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8119 row(s) affected
Timestamp : #160904 13:53:37 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7924 row(s) affected
Timestamp : #160904 13:54:38 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 2139 row(s) affected
Timestamp : #160904 14:17:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 5592 row(s) affected
Timestamp : #160904 14:18:12 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8260 row(s) affected
Timestamp : #160904 14:19:13 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7876 row(s) affected
Timestamp : #160904 14:20:14 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7994 row(s) affected
Timestamp : #160904 14:21:16 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7994 row(s) affected
Timestamp : #160904 14:22:17 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7813 row(s) affected

Timestamp : #160904 18:15:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8191 row(s) affected
Timestamp : #160904 18:16:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8180 row(s) affected
Timestamp : #160904 18:17:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8304 row(s) affected
Timestamp : #160904 18:18:28 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8128 row(s) affected

If we compare the transaction of other time period, we can get the result, below is the top 10 transaction of Sep 2 which is far less than 10000:
Timestamp : #160902 18:00:31 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 850 row(s) affected
Timestamp : #160902 10:00:33 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 813 row(s) affected
Timestamp : #160902 8:00:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 786 row(s) affected
Timestamp : #160902 10:55:42 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 646 row(s) affected
Timestamp : #160902 12:30:01 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 608 row(s) affected
Timestamp : #160902 12:30:01 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 606 row(s) affected
Timestamp : #160902 19:07:37 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 599 row(s) affected
Timestamp : #160902 19:10:09 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 582 row(s) affected
Timestamp : #160902 19:07:23 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 575 row(s) affected

That explains why io thread works well but still have big lag, since by default slave has only single thread to process all including big transaction from master, it can't handle that and break.

What's next? find which service cause that big transaction, back-flush from the DB log, there do really have a job that purge the old data every minute like this:
[2016-09-04 14:21:53  INFO com.benmu.rapidq.backup.backup.task.HistoryCleaner:?] clean history qmq_backup_messages data by condition create_time < '2016-08-20 14:21:53', total clean: 892, elapse: 60 ms

After the discussion with developer, we decided to stop the job. Besides that, we consider to increase the thread(slave_parallel_worker) of slave since we currently use MySQL 5.7 which is its big selling point.

Nginx 200 response with empty body by double slash

We setup a new cluster of Nginx to send requests to HIS using curl with domain name as the Host header, unfortunately, we get 200 return code but empty body like this:

When we use IP as Host, it works well, and return the expected result:

After some debug, nothing found and tcpdump comes:

From the tcpdump, the above red arrow uses domain name as Host header and the below one uses IP as Host header, we see there exists a double slash when using domain name.

After some review of Nginx config file, we found the misconfiguration location:

OpenVPN Connectivity Issue in Public Network

We established a ovpn tunnel between 2 IDCs in September 2014, and we have monitored the availability and performance of two ends for a long time. The geographical distance between 2 IDCs are quite short, but with different telecom carries. mtr shows that there exits ahout 7 hops from one end to the other. The below screenshot shows the standard ping loss.

The result is quite interesting. At first we used UDP protocol, and we often experienced network disconnection issue, later we switched to TCP, and it improved a lot. From the digram, the average package is 1.11%.
Why 1.11%, what I can explain is the tunnel is often fully saturated during the peak hour, and this can't solved at the moment, so no matter what protocol, the package loss should exists. Another possible reason is the complexity of public network which I can't quantitate.
The current plan works during current background, but no "how many 9s" guarantee. Anyway, if we want to achieve more stable connectivity, a DLL(dedicated leased line) is a better choice.

Can Venta Airwasher Effectively Reduce PM2.5 or PM10?

There is not official tests state that whether Venta airwasher is able to reduce PM2.5 or PM10 effectly, hence, I made some tests with the help of Dylos air particle counter during the good weather, PM2.5 index less than 100 means good or moderate, PM2.5 larger than 100 which means unhealthy or even hazardous. My bedroom's door or window is closed all the time and it's a confined space.

The answer is partially efficient. In centain conditions, it works, other times, no.

When the PM2.5 outside is less than 100, Venta can effectly keep the indoor PM2.5 around 30 or even less.

However, when outside is unhealthy, which means the index is above 100 and beyond, it really can't effectly reduce the particles inside, and the metrics I got from Dylos indoor have positive correlation with outside index, sometimes, Dylos even get 100 or even more which is totally unacceptable for people in the bedroom.

Now, the answer is quite clear, if the weather outside is good, just keep the Venta open and it can handle. When outside is unhealty, you really need to turn on your air filter, and don't rely on it, you need HEPA filter.

What about PM10? Venta seems work all the time, no matter good or bad outside, it can keep the index under 100 or lower, most of time, my bedroom PM10 is around 50, sometimes, after long time filter without opening the door, you can see single digit.

Metrics Dashboard Comparision for Linux Desktop

I need to know how good/poorly my desktop(Loongson & Rasberry Pi 2) is running, so there should at least exits a solution that can see its realtime and history metrics of the the system like cpu, memory, network, disk io etc.

Also tools like Glances, nmon, sar(sysstat) or atop provide a nice realtime metircs, it can't get its history data easily which it more suitable for massive production use, also no web dashborard for a nicely view, so I just pass them. Systems like Icinga, Nagios, Graphite, Zabbix, etc. are so heavy and complicated which are also not convenient for a 2GB memory desktop use.

Below are some tools that may help you for setting up a tiny to small monitoring/metrics system. At the end, I'll pick up the most suitable one for my own.

A dashboard overview of the system using psutils and Flask.

The installation is quite straightforward with pip packager manager. Unfortunately, As my Debian jessie testing version, I happened to a "PROTOCOL_SSLv3 is not defined" bug that stop the system running, you can modify the python gevent lib to work around(1, 2).   

The web dashboard is qute lighweight and nice, you can get the realtime data from the every-3-second refresh web automatically, the fatal problem is it can't get the history data and charts.

Quite similar to psdash, but comes with more technical stack like Node.js, Go, and PHP. It's offical slogon is "A simple, low-overhead web dashboard for GNU / Linux. (~1MB)", besides that, not much eye-catching point.

Besides all the feature psdash and Linux-dash, it also has a simple trigger configuration, Say, for the load average metric, it will be displayed with gauges, when is less than 50%, it's green, when is more than 76% util, it turns to red. Also, it has the built-in ping and servers check function, which is handy for your family internal use.

Ezservermonitor also has a console based tools called EZ SERVER MONITOR`SH, without web interface.

Web VMStat 
It’s a small application written in Java and HTML which displays live Linux system statistics. It just takes over vmstat command in a pretty web page with SmoothieCharts and websocketd in realtime.

The last is usually the best. Yes, it's the only one I want to recommend for you, no matter how many desktops you have, Munin can handle them easily and more importantly, it only taks you few minutes depending on your network quality if you use apt/yum to install it directly, the real out-of-box product. it used RRD as it's backend storage engine. by default, 300s interval which I think it's enough for most of desktop users, with at one year history. 

If you are unsatisfied with its hundreds of plugins, just write a new one or porting from any Unix platform by yourself by any scripts language. This is its killer feature.

After the above comparision, you could choose ones according to your demands.

Want near-realtime(5m or so) and history data with charts? Munin is a best option.

Want realtime data, without history data? psdash, Linux-dash, Web-VmStat are those you're looking for.

Still not satisfied? You may consider using Graphite, Zabbix if you have system admin experience, since both are enterprise level open source product.

Router Matters

We are transfering PBs of our HDFS data from one data center to another via a router, we never thought the performance of a router will becomes the bottleneck until we find the below statistic:

#show interfaces Gi0/1
GigabitEthernet0/1 is up, line protocol is up
  Hardware is iGbE, address is 7c0e.cece.dc01 (bia 7c0e.cece.dc01)
  Description: Connect-Shanghai-MSTP
  Internet address is
  MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
     reliability 255/255, txload 250/255, rxload 3/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full Duplex, 1Gbps, media type is ZX
  output flow-control is unsupported, input flow-control is unsupported
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:00:06, output 00:00:00, output hang never
  Last clearing of "show interface" counters 1d22h
  Input queue: 0/75/0/6 (size/max/drops/flushes); Total output drops: 22559915
  Queueing strategy: fifo
  Output queue: 39/40 (size/max)

The output queue is full, hence the txload is obviously high.

How awful it is. At the beginning, we found there were many failures or retransmissions during the transfer between two data centers. After adding some metrics, everything is clear, the latency between two data centers is quite unstable, sometimes around 30ms, and sometimes reaches 100ms or even more which is unacceptable for some latency sensitive application. we then ssh into the router and found the above result.

After that, we drop it and replace it with a more advanced one, now, everything returns to normal, latency is around 30ms, packet drop is below 1%.