Thursday, November 15, 2018

MySQL Master Replication Crash Safety Part #2: lagging slaves

This is Part #2 of the MySQL Master Replication Crash Safety series.  In the previous post, we explored the consequence of reducing durability on masters with slaves using legacy file+position replication.  The consequences are data inconsistencies with a clear warning sign: the slaves stop replicating and report an error.  In this post, we extend our understanding of the impact of running a master with sync_binlog != 1 by considering lagging slaves that are using file+position replication.  This introduces a new inconsistency scenario without replication breakage, which is trickier to detect.  But let's start with some reminders.

In Part #1, we explained why a DBA would set sync_binlog != 1 and what happens on a master OS crash when sync_binlog != 1.  TL&DR: we get performance improvements at the cost of losing binary logs in the case of an OS crash, which could lead to data inconsistencies on master and slaves, and replication breakage on slaves.

In the breakage scenario of Part #1, our slave has more data than the master as it executed transactions that vanished from InnoDB because of trx_commit = 2 (this writes the InnoDB Redo logs after each transaction but only persists — sync — the log once per second, so up to one second of transactions disappeared from the master, and those ghost transactions are almost all present on the slave as they momentarily existed in the binary logs of the master).  Also, replication stops with an error on the slave as when it reconnects to the master, it is pointing in vanished binary logs because of sync_binlog = 0 (this configuration writes the binary logs after each transaction, but does not provide any guarantees on syncing).  We are in the situation depicted in the diagram below.

Illustration #1: IO Thread pointing in a vanished binary log
(after an operating system crash on the master).

But what would happen if the IO Thread is pointing in a persisted binary log ?  We know, as described in Part #1, that the current implementation of MySQL forces a flush on binary log rotation (I would not assume this will always be the case in future versions).  It is not excessive to assume that a lagging slave might be pointing in a previous binary log.  This could happen in the following situations:
  1. a lagging slave has been recently started and it is in the process of downloading binary logs while the master crashes (unlikely but possible),
  2. a lagging (or delayed) slave has relay_log_space_limit set to a value that prevents it from going beyond the sync boundary (more likely),
  3. or a lagging slave pointing in vanished binary logs (lot of unexecuted relay logs) is restarted with relay-log-recovery = 1, which makes it reconnects to its master at the position of the SQL Thread, which might be in persisted binary logs.
(In the demonstration presented in the annexe below, I am simply stopping the slave before flushing the binary logs, which simulate case #1 above.)

In the three cases described above, we are in the situation depicted in the diagram below.

Illustration #2: IO Thread pointing in a persisted binary log
(after an operating system crash on the master).

In this case, the slave will eventually reach the truncated binary log, will happily process it as MySQL crash recovery will make it acceptable, and will continue to the newly created binary log.  So our slave doe not see the transactions in the vanished binary logs and happily continues replicating from the master.

If we take the same example as in Part #1 for transactions on the master:
  • Transactions A, B, ..., G are committed and persisted in both the InnoDB Redo logs and in the binary logs.
  • Transactions H, I, ..., N are committed but only persisted in the InnoDB Redo logs.
  • Transactions O, P, ..., U are committed but not persisted in either the InnoDB Redo logs or the binary logs.
  • After the crash, the master registers transactions V, W, ...
And we assume that the lagging slave it at position B when the master crash.  After reconnecting, the slave will:
  • get B to G,
  • skip H to U as these have disappeared from the binary logs,
  • get V, W ...
So we have an interestingly different failure mode from Part #1.  The slave does not break, but it gets fewer transactions than the master and than the broken slave.  A way to see this type of breakage is presented in the annexe below.

So in this second post in the Master Replication Crash Safety series, we explored a new failure mode of setting sync_binlog != 1 on a master when replicating with file+position.  In the next post, we will explore the failure modes with GTIDs.

To be continued...

Annexe: how to see things break by yourself ?

(As in the previous post of the series and for reproducing the situation described here, I am using two virtual machines with dbdeployer.  The details about why and how are in Part #1.)

Once we have two vms with dbdeployer (lets call them V1 and V2), we need to set up a master on V1 and two slaves on V2.  The easiest way I found to achieve that is to create a dbdeployer replicated environment on each of V1 and V2 and to tweak them.  So lets start:

1) On each vm, create a dbdeployer replicated environment (by default, dbdeployer binds MySQL on, so we have to change that for the slave on V2 to be able to replicate from the master on V1).
$ dbdeployer deploy replication mysql_5.7.23 -c "bind-address = *"
2) On V1, stop the slave nodes, and on V2, the master node (note the prompt notation for V1 and V2 below, they assume the current directory is the respective sandboxes):
V1$ node1/stop; node2/stop
V2$ master/stop
3) On the master, modify the replication user to be available from the network (SQL commands are in blue, execution output is in orange).  The flush binary logs and show master status are useful to allow easy alignment in replication for the slave (next step), and the sync is needed to persist the user as MySQL 5.7 still stores grants in MyISAM (if we do not do that, the crash might lose the user and the slave will not be able to reconnect).
V1$ master/use -N <<< "rename user 'rsandbox'@'127.%' to 'rsandbox'@'%'"; \
    master/use -N <<< "flush binary logs; show master status"; \
mysql-bin.000002        154
4) Make the slaves replicate from the master (use the IP of your master instead of mine and use the binary log file name shown by show master status above).  A small master_connect_retry is useful for avoiding delays in slave connection retries to the master, and a small slave_net_timeout for triggering retries earlier, and the reduced durability for allowing the slave to process transactions at a rate comparable to the master.
V2$ ./use_all_slaves "
    stop slave;
    change master to master_host='', master_connect_retry=1,
                     master_log_file='mysql-bin.000002', master_log_pos=4;
    set global slave_net_timeout = 10;
    set global innodb_flush_log_at_trx_commit = 2; set global sync_binlog = 0;
    start slave"
5) On the master, start the test:
V1$ master/use -N <<< "
    set global innodb_flush_log_at_trx_commit = 2; set global sync_binlog = 0;
    create database test_jfg;
    create table test_jfg.t (id int not null primary key)"; \
    seq -f "insert into t values(%.0f);" 1 1000000 | master/use test_jfg &
5b) Stop the second slave:
V2$ node2/use <<< "stop slave"
6) Rotate the binary logs so the stopped slave points in a fully persisted file, wait a little (5 seconds in our case with more than 1 second is needed because of trx_commit = 2), and then crash the master:
V1$ master/use <<< "flush binary logs"; \
    sleep 5; sudo bash -c "echo c > /proc/sysrq-trigger"
6b) Start back the stopped slave:
V2$ node2/use <<< "start slave"
7) Restart the master vm, then MySQL, add some data on the master, and look at the content of the database:
V1$ master/start; \
    master/use -N test_jfg <<< "
    insert into t values(1000001);
    select * from t order by id desc limit 2"
8) Use show slave status to see the broken state of the first slave (not shown), and look at the content of both slaves:
V2$ ./use_all_slaves "select * from test_jfg.t order by id desc limit 2"
# server: 1 
# server: 2 
The first slave has more data than the master (same as in Part #1), and the second has less data (skipped the truncated binary log) and is not broken: CQFD / QED !


  1. Good one Jeff! Will be interesting to know about the impact of setting sync_binlog = 1 and innodb_flush_log_at_trx_commit != 1 on replication and crash recovery!

    1. IMHO, this makes little sense: you are making the InnoDB part fast but still having the binlog part slow. Can you give me more details about your motivations for doing this, and the exact value you have in mind for trx_commit (0 or 2) ?

    2. I would be interested in decreasing the IO load by not syncing the redo logs for every transaction. At the same time, I do not want any consistency issues with data going to slave and hence want to keep sync_binlog=1. But as a combination, will it make sense ?

  2. It kind of makes sense. Instead of having 2 IOs on commit (Redo log and binlog), you will only have one (binlog). But on an OS crash (if trx_commit = 2) or simply on a mysqld crash (if trx_commit = 0), InnoDB will go back in the past as the Redo log will be missing data.

    IMHO, this is not a great optimisation: you are "only" dividing the numbers of IOs by 2 (and we are not considering the InnoDB flushing in this equation). In the best case, you will double the transaction throughput: and you could go farther by also setting sync_binlog to 0. But then, you would have to make sure you failover in the case of a crash of the master (having READ_ONLY = ON and OFFLINE_MODE = ON in the master configuration file would make sure that, if the master comes back quicker than the failover, nothing get read or written on the master).

    Your idea still work though. If you want to go forward with dividing the IOs by 2, you could do this, but you would have to manage data consistency on the master in case of a crash. By looking at the error logs, you could know the position in the binlogs of the last sync-ed transaction in InnoDB, and replay the missing bits from the binlogs (but replay with sql_log_bin = 0 as you do not want those the get in the binlogs twice). From what I know, MySQL crash recovery is not doing this by itself even if it should...

    This is a complicated for a relatively marginal gain IMHO (only 2 times better). If you want really save IOs (probably in the order of 10 times better), disable sync-ing everywhere and failover in case of a crash.

    Good luck !