Big transaction and MySQL replication lag

This saturday afternoon I was on call, and got a alert, one of our production MySQL slave begin to have lag like this:

After some time reviewing the monitoring metrics, I got the big breakthrough, there existed large spike of lock structs and delete operations during the lag period:

Usually, this was caused by big transaction on master node. The verfication was quite straightfoward, just analyze the binlog, the script was copied from percona.

After running the script, I do really found something astonishing, during that time the master had one transaction per minute, and each was comparatively big, nearly 10k delete operations for one transaction:
Timestamp : #160904 13:26:54 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7559 row(s) affected
Timestamp : #160904 13:27:56 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7726 row(s) affected
Timestamp : #160904 13:28:57 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7864 row(s) affected
Timestamp : #160904 13:29:59 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8039 row(s) affected
Timestamp : #160904 13:31:01 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7961 row(s) affected
….
Timestamp : #160904 13:33:05 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7985 row(s) affected
Timestamp : #160904 13:34:06 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7849 row(s) affected
Timestamp : #160904 13:35:08 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7927 row(s) affected
Timestamp : #160904 13:36:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8014 row(s) affected
Timestamp : #160904 13:37:11 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8031 row(s) affected
Timestamp : #160904 13:38:13 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8104 row(s) affected
Timestamp : #160904 13:39:15 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7730 row(s) affected
Timestamp : #160904 13:40:16 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7668 row(s) affected
Timestamp : #160904 13:41:18 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7836 row(s) affected
Timestamp : #160904 13:42:20 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7757 row(s) affected

Timestamp : #160904 13:48:29 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7835 row(s) affected
Timestamp : #160904 13:49:31 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7839 row(s) affected
Timestamp : #160904 13:50:32 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8028 row(s) affected
Timestamp : #160904 13:51:34 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8223 row(s) affected
Timestamp : #160904 13:52:35 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8119 row(s) affected
Timestamp : #160904 13:53:37 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7924 row(s) affected
Timestamp : #160904 13:54:38 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 2139 row(s) affected
Timestamp : #160904 14:17:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 5592 row(s) affected
Timestamp : #160904 14:18:12 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8260 row(s) affected
Timestamp : #160904 14:19:13 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7876 row(s) affected
Timestamp : #160904 14:20:14 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7994 row(s) affected
Timestamp : #160904 14:21:16 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7994 row(s) affected
Timestamp : #160904 14:22:17 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7813 row(s) affected

Timestamp : #160904 18:15:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8191 row(s) affected
Timestamp : #160904 18:16:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8180 row(s) affected
Timestamp : #160904 18:17:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8304 row(s) affected
Timestamp : #160904 18:18:28 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8128 row(s) affected

If we compare the transaction of other time period, we can get the result, below is the top 10 transaction of Sep 2 which is far less than 10000:
Timestamp : #160902 18:00:31 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 850 row(s) affected
Timestamp : #160902 10:00:33 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 813 row(s) affected
Timestamp : #160902 8:00:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 786 row(s) affected
Timestamp : #160902 10:55:42 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 646 row(s) affected
Timestamp : #160902 12:30:01 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 608 row(s) affected
Timestamp : #160902 12:30:01 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 606 row(s) affected
Timestamp : #160902 19:07:37 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 599 row(s) affected
Timestamp : #160902 19:10:09 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 582 row(s) affected
Timestamp : #160902 19:07:23 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 575 row(s) affected

That explains why io thread works well but still have big lag, since by default slave has only single thread to process all including big transaction from master, it can't handle that and break.

What's next? find which service cause that big transaction, back-flush from the DB log, there do really have a job that purge the old data every minute like this:
[2016-09-04 14:21:53  INFO com.benmu.rapidq.backup.backup.task.HistoryCleaner:?] clean history qmq_backup_messages data by condition create_time < '2016-08-20 14:21:53', total clean: 892, elapse: 60 ms

After the discussion with developer, we decided to stop the job. Besides that, we consider to increase the thread(slave_parallel_worker) of slave since we currently use MySQL 5.7 which is its big selling point.