Big transaction and MySQL replication lag

This saturday afternoon I was on call, and got a alert, one of our production MySQL slave begin to have lag like this:

After some time reviewing the monitoring metrics, I got the big breakthrough, there existed large spike of lock structs and delete operations during the lag period:

Usually, this was caused by big transaction on master node. The verfication was quite straightfoward, just analyze the binlog, the script was copied from percona.

After running the script, I do really found something astonishing, during that time the master had one transaction per minute, and each was comparatively big, nearly 10k delete operations for one transaction:
Timestamp : #160904 13:26:54 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7559 row(s) affected
Timestamp : #160904 13:27:56 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7726 row(s) affected
Timestamp : #160904 13:28:57 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7864 row(s) affected
Timestamp : #160904 13:29:59 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8039 row(s) affected
Timestamp : #160904 13:31:01 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7961 row(s) affected
….
Timestamp : #160904 13:33:05 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7985 row(s) affected
Timestamp : #160904 13:34:06 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7849 row(s) affected
Timestamp : #160904 13:35:08 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7927 row(s) affected
Timestamp : #160904 13:36:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8014 row(s) affected
Timestamp : #160904 13:37:11 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8031 row(s) affected
Timestamp : #160904 13:38:13 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8104 row(s) affected
Timestamp : #160904 13:39:15 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7730 row(s) affected
Timestamp : #160904 13:40:16 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7668 row(s) affected
Timestamp : #160904 13:41:18 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7836 row(s) affected
Timestamp : #160904 13:42:20 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7757 row(s) affected

Timestamp : #160904 13:48:29 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7835 row(s) affected
Timestamp : #160904 13:49:31 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7839 row(s) affected
Timestamp : #160904 13:50:32 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8028 row(s) affected
Timestamp : #160904 13:51:34 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8223 row(s) affected
Timestamp : #160904 13:52:35 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8119 row(s) affected
Timestamp : #160904 13:53:37 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7924 row(s) affected
Timestamp : #160904 13:54:38 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 2139 row(s) affected
Timestamp : #160904 14:17:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 5592 row(s) affected
Timestamp : #160904 14:18:12 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8260 row(s) affected
Timestamp : #160904 14:19:13 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7876 row(s) affected
Timestamp : #160904 14:20:14 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7994 row(s) affected
Timestamp : #160904 14:21:16 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7994 row(s) affected
Timestamp : #160904 14:22:17 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7813 row(s) affected

Timestamp : #160904 18:15:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8191 row(s) affected
Timestamp : #160904 18:16:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8180 row(s) affected
Timestamp : #160904 18:17:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8304 row(s) affected
Timestamp : #160904 18:18:28 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8128 row(s) affected

If we compare the transaction of other time period, we can get the result, below is the top 10 transaction of Sep 2 which is far less than 10000:
Timestamp : #160902 18:00:31 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 850 row(s) affected
Timestamp : #160902 10:00:33 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 813 row(s) affected
Timestamp : #160902 8:00:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 786 row(s) affected
Timestamp : #160902 10:55:42 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 646 row(s) affected
Timestamp : #160902 12:30:01 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 608 row(s) affected
Timestamp : #160902 12:30:01 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 606 row(s) affected
Timestamp : #160902 19:07:37 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 599 row(s) affected
Timestamp : #160902 19:10:09 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 582 row(s) affected
Timestamp : #160902 19:07:23 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 575 row(s) affected

That explains why io thread works well but still have big lag, since by default slave has only single thread to process all including big transaction from master, it can't handle that and break.

What's next? find which service cause that big transaction, back-flush from the DB log, there do really have a job that purge the old data every minute like this:
[2016-09-04 14:21:53  INFO com.benmu.rapidq.backup.backup.task.HistoryCleaner:?] clean history qmq_backup_messages data by condition create_time < '2016-08-20 14:21:53', total clean: 892, elapse: 60 ms

After the discussion with developer, we decided to stop the job. Besides that, we consider to increase the thread(slave_parallel_worker) of slave since we currently use MySQL 5.7 which is its big selling point.

zabbix proxy/node 的使用总结教训

这两个的运用场景可以参见官方的这个文档
实际情况是,我们是同机房的不同 vlan 通过 proxy 收集,跨机房的则通过 node 跟目前的 master 并行。proxy 以及 node 的操作都没什么难度。这里跟大家说说我们在拆分 node 的一些教训。

# zabbix_server -n 20 -c /etc/zabbix/zabbix_server.conf
Dropping foreign keys ………………………………………. done.
Converting tables ………………………. done.
Creating foreign keys ………………………………………. done.
Conversion completed successfully.

上面这个如此简洁的 stdout 其实是我们新上的一个 zabbix server 做的 node,item 少的很。但是,当拿一个跑了有大半年的 zabbix server(跟那个新的 zabbix server 跨机房) 做 node 的时候,情况就不是那么美好了。

执行上面这条命令其实执行的是 src/libs/zbxdbhigh/dbschema.c 这个文件里面的 db_schema_fkeys 以及 db_schema_fkeys_drop 里面的 sql 语句。
慎重提醒,如果你的 zabbix 以及跑了很长的时间,像我们,单单 history_uint 表做了 partition 之后还有接近 1亿 的条目。要执行完上面的所有的 sql 语句会花费很长的时间。
之前对这个命令没有很好的理解,以为最多几分钟就能搞定的。结果等了几十分钟还没结束,幸好 zabbix_server 会把 slow query 打印在屏幕上,才能找到原来是上面那个 dbschema.c 文件里面执行的 sql。
另外,下面这条语句也能看到目前的 process:
> select * from information_schema.processlist  where info != 'NULL';

另外,官方说上面那个命令只能执行一遍,中途不能按 ctrl+c 停止,一直等到了最终执行完毕,整个过程持续了 6h+。

zabbix_server [310083]: Warning: slow query: 2965.823846 sec, "update history set itemid=itemid+1001000000000000 where itemid>0"
zabbix_server [310083]: Warning: slow query: 12539.601075 sec, "update history_uint set itemid=itemid+1001000000000000 where itemid>0"

基础设施做的不好,后期的维护成本会相应的变高。付出的人力精力时间也会成倍的增加,并且需要承担更大的风险。因此,这类基础性的设施是越早做越好。

CentOS 5.5 x86_64 平台下的 LNMP

被 x86_64 平台弄的蛋都碎了,各种依赖问题,各种冲突。先前使用 epel 装了个 nginx ,还凑或着用;后来需要加入 mysql,使用 mysql.com 的rpm 包安装也没有问题(使用 yum 安装各种 error),后来需要 php 的支持,于是悲剧就来了。于是就有了这篇记录。

更新,使用的是 mirrors.163.com 的,详情见附录。

# yum update

安装 epel
epel 的解释请看附录。

# rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm
Continue reading