Tuesday, May 31, 2022

Triggering Replication Lag for Testing a Script

I am currently working on a script to auto-enable parallel replication / multi-threaded replication (MTR) when there is replication lag.  For testing this script, I need to trigger replication lag that would disappear after enabling MTR.  I came-up with a simple solution for that, and I thought it could be useful to more people, so I am writing this blog post about it.  Read-on for the details.

Before explaining how to trigger replication lag, a few words about this Auto-MTR script.  It could seem uselessly complex to run a script enabling MTR when there is lag (it is even a little more complex than that as the script is also disabling MTR once the lag is gone).  A simpler solution might be having MTR enabled all the time, but things are not that simple.

MTR is sadly not as stable as it should.  It sometimes deadlock, which is paging a human to resume replication (for this, we monitor GTIDs progressing, and when it get stuck, we alarm).  To minimize disturbing DBAs, we prefer to run with MTR only when needed (yes, I could also write a self-healing script for this, but I am not there yet).

Also, running with MTR means that replication load monitoring is not working (I have not blogged about what I recently implemented, I will probably share more details about this one day).  If you have not yet implemented any replication load monitoring, you should look at the blog post by Mark Leith about A MySQL Replication Load Average with Performance Schema.  Its main idea is to use the "idle time" from the SQL Thread to compute a replication load.  However, this computation only works with single-threaded replication.

Let's now talk about triggering replication lag.  I have completed the Auto-MTR script, but I want to test it, which needs replication lag.  I could run sysbench with many tables to generate an IO-bound replication workload with sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1, but I would prefer something simpler.

The simpler solution uses statement-based replication

With statement-based replication, replicas run the same statements as were run on the primary.  So if  statements contain a sleep, this sleep will also be run on replicas, which can be used to trigger replication lag.

So after creating this table:

CREATE TABLE t(id INT AUTO_INCREMENT PRIMARY KEY, v DOUBLE, d DATETIME);

And inserting a few rows in it (below is a trick to insert 10 rows in the table in a short statement):

INSERT INTO t (v, d) SELECT 1, NOW() FROM information_schema.INNODB_METRICS LIMIT 10;

I run this script for generating replication lag:

while sleep 1; do for i in $(seq 1 10); do
  mysql $schema <<< "
    SET binlog_format = STATEMENT;
    UPDATE t SET v = SLEEP(0.2), d = now() WHERE id = $i" &
done; done

So now, I have replication lag on a replica, but no way yet to remove the lag by enabling MTR.  To be able to remove lag on the replica, I need to run the sleep in parallel.  For this, I need to tune the primary for making sure that the UPDATE statements above can be run in parallel on the replica.  For that I use Slowing Down the Primary to Speed-Up the Replica (I mentioned this tuning in my recent Percona Live MTR talk and in the blog post A Metric for Tuning Parallel Replication in MySQL 5.7, but as this post is old, that post refers to this as Slowing Down the Master [sic] to Speed-Up the Slave [sic]).  To enable this tuning, I run this on the primary:

SET GLOBAL binlog_group_commit_sync_delay = 50000;

With the variable binlog_group_commit_sync_delay set as above, MySQL waits 50 milliseconds (the variable is in microseconds) for transactions to join the same commit group.  As auto-commit transactions from the same commit-group can be run in parallel on replicas, the UPDATE statements will be able to run in parallel after this change.  So to process the artificially generated lag, I just need to set slave_parallel_workers to 10 (or more), which my script is doing when it detects lag (slave_parallel_type should have previously been set to LOGICAL_CLOCK and slave_preserve_commit_order to ON as I also explained in my Percona Live MTR talk, and this is already done in the environment I am working in).

Some variations on this script are:

  • running more updates in parallel to generate more lag quicker (increase the 10 in the seq 1 10, which also needs more rows in the table),
  • reducing the sleep (the SQL SLEEP(0.2) not the sleep 1 from the while) to catch-up on lag quicker after enabling MTR (this also needs more workers),
  • but keeping the product (multiplication) of the number of updates and the SQL sleep delay higher than 1 (related to the sleep 1 from the while) so the script keeps generating replication lag while running in single-threaded replication.

That is all what I have for today, I hope you find this useful or funny (I find interesting that statement-based replication is still useful here).

No comments:

Post a Comment