Tuesday, February 12, 2019

MySQL Master Replication Crash Safety Part #3: GTID

This is a follow-up post in the MySQL Master Replication Crash Safety series.  In the two previous posts, we explored the consequence of reducing durability on masters (including setting sync_binlog to a value different than 1) when slaves are using legacy file+position replication.  In this post, we cover GTID replication.  This introduces a new inconsistency scenario with a potential replication breakage that depends on transaction execution on the master and timing on the slave.  Before discussing this violation of ACID, we start with some reminders about the last posts and with some explanations about GTIDs.

Update 2019-07-13: the fourth post of the series is out: MySQL Master Replication Crash Safety Part #4: benchmarks of high and low durability.

In Part #1, we explained why a DBA would set sync_binlog to something else than 1 and what happens on a master OS crash in this case, including how slaves using file+position replication react to this crash.  TL&DR: we get performance improvements, at the cost of losing binary logs on a master OS crash, which could/will lead to data inconsistencies on master and slaves, and replication breakage on slaves.

In Part #2, we extended our understanding of the impact of running a master with sync_binlog != 1 by considering lagging slaves (still using file+position replication).  TL&DR: this introduces a new inconsistency scenario after a master OS crash, and the resulting data drift is trickier to detect because replication might not break.

In this post, we are covering the impact of sync_binlog != 1 on master with slaves using GTIDs replication.  Before doing that, we need to know what GITDs are and we need to understand some specifics about their implementation.  Please stay with me during the lengthy GTID explanations, these will make the rest of the discussion very straightforward.

Quick Presentation of MySQL GTIDs


One of the challenges for MySQL DBAs in the pre-GTID era was to repoint a slave to a new master.  One of the possible situations is depicted below: making S2 a slave of S1 is not impossible, but it is challenging.  It needs M1 to be up, and if S1 or S2 are lagging, it poses some additional challenges.  The exact way of doing this is left as an exercise to the reader (but be careful if you implement this yourself as taking into account all the edge cases is not easy, especially with parallel replication and out-of-order commit).
+---+
| M1|
+---+
  |
  +----------+
  |          |
+---+      +---+
| S1|      | S2|
+---+      +---+
A similar situation is depicted below: making T2 a slave of M2 is not trivial.  It also needs T1 to be up, and lag and parallel replication can make this more challenging.
+---+      +---+      +---+
| M2| ---> | T1| ---> | T2|
+---+      +---+      +---+
Finally, an even more challenging situation is the following (the corresponding diagram is a little further below):
  • It is easy to repoint U2 as a slave of U1 (it is the same as S2 to S1 above).
  • It is also easy to repoint U3 as a slave of M3 (it is the same as T2 to M2 above).
  • But moving U3 as a slave of U1 is less trivial.
This last repointing needs two operations: first we have to repoint U3 as a slave of M3, and then we have to repoint U3 as a slave of U1.  This also needs U2 and M3 to be up and it is more challenging when dealing with lag and parallel replication.
+---+
| M3|
+---+
  |
  +----------+
  |          |
+---+      +---+      +---+
| U1|      | U2| ---> | U3|
+---+      +---+      +---+
(Note that in all the three situations presented above, it is possible to achieve satisfying results even when dealing with failures, but it involves with some dark magic and voodoo incantations with careful parsing of the binary logs.  This is very error-prone and I would not encourage you to do this unless you exactly know what you are doing and unless you do not have any other alternative.  Also note that this becomes close to impossible with parallel replication and out-of-order commit.)

GTIDs were introduced to simplify the three situations above.  With GTIDs, it becomes easy to perform all these repointing operations in a single step, even with lag and when dealing with failures.  This allows replacing failed intermediate masters (failure of T1 by repointing T2 to M2) and allows master failover (failure of M1 by repointing S2 to S1).

In the next sections, I give more details about MySQL GTID implementation.  It needs to be noted that even if MariaDB implementation is similar, there are some important differences that I do not discuss here.  Also and before moving forward, I have to point out that I do not like GTID replication (they might still have use cases out of replication).  I have to recognize that they are the only currently available solution for repointing slaves, but GTIDs come with many problems which are yet to be solved (an example is described on the ScaleGrid blog with title Slow MySQL Start Time in GTID mode, and I commented on this blog reporting a similar problem).  I prefer Binlog Servers which I think comes with fewer problems and introduces new possibilities (including simplifying point-in-time recovery) but there is no satisfying implementation of this technology yet, and I hope this will change in the future (there is something very interesting coming on this subject...).

Some MySQL GTID Implementation Details


Note that describing the whole MySQL GTIDs implementation would need a full series of post.  In here, I focus on the elements needed for understanding the impacts of sync_binlog != 1.  I voluntarily omit many details that are not relevant to master replication crash safety, so if your goal is to understand GTIDs, please do not limit yourself to this discussion.

As presented in the section above, GTIDs were introduced to ease slave repointing.  To be able to allow that, MySQL servers must track a GTID state, they must tag every transaction with a global transaction identifier (GTID), and a new GTID replication negotiation must be implemented (as opposed to legacy file+position).  I quickly describe these subjects in the next three sections.

Tagging Transactions With GTIDs


On commit and when a MySQL server writes a transaction to the binary logs, it tags this transaction with a global transaction identifier: this is the GTID of the transaction.  This GTID is composed of the server_uuid and of an increasing number (counter).  Normally, this "counter" part of the GTID is strictly increasing without gaps in the master binary logs.

The GTID of a transaction can be read from the COMMIT OK packet with some MySQL Connector if the session_track_gtids variable is set accordingly.  Contrarily to MariaDB, MySQL does not expose the GTID of the last transaction in a session variable, so I opened Bug#84747 in January 2017 to get this fixed (MariaDB does that with the last_gtid session variable, and a nice use case for this is described in the MASTER_GTID_WAIT page of the Knowledge Base).  In all cases, the GTID of a transaction can be found in the binary logs using the mysqlbinlog command as shown below (note the usage of the gtid_next session variable).
# at 4089
#190210  9:43:36 server id 100  end_log_pos 4154 [...] GTID [...]
SET @@SESSION.GTID_NEXT='00019125-1111-1111-1111-111111111111:17'/*!*/;
# at 4154
#190210  9:43:36 server id 100  end_log_pos 4260 [...] Query [...]
SET TIMESTAMP=1549791816/*!*/;
create database test_jfg
When a transaction is replicated to a slave, the GTID is kept as is in the slave binary logs.  However, there are many edge case of binary logs morphing on slaves that do not involve the GTID, some are:
  • transaction filtering preserving the GTID but not the content of the transactions,
  • parallel replication generating out-of-order commit (temporary gaps in gtid_executed),
  • SBR to RBR replication changes the content of the transactions (but not the GTID).
But I am diverging, what is described so far is enough to continue the discussion...

Tracking the GTID State


Both on a master and on the slaves, it is important to track a GTID state.  This state allows to know which transactions are available on a master and which transactions have been replicated to a slave.  This state can be observed by looking at the gtid_executed variable, and it is also shown in the output of the SHOW MASTER STATUS and SHOW SLAVE STATUS commands in the Executed_Gtid_Set field.

But gtid_executed is not something that is persisted by itself on a master or on the slaves.  This variable is a view on something that is much more complex.  For a master, this variable is derived from the content of the binary logs.  For MySQL 5.6 slaves, this variable is also derived from the binary logs, which explains why log_slave_updates is needed for enabling GTIDs in 5.6.  MySQL 5.7 introduces the possibility of using GTIDs without enabling binary logging, and for that, it adds the mysql.gtid_executed table.  The way this table is updated is described in the GTID Format and Storage section of the manual (look for mysql.gtid_executed in there).  So for MySQL 5.7+ slaves, the gtid_executed variable is derived from the content of the binary logs and from the content of the mysql.gtid_executed table.

Note1: because the GTID state of MySQL 5.6 slaves is stored in the binary logs, GTID replication cannot be crash safe in 5.6 with sync_binlog != 1 as reported in Bug#70659.

Note2: as, when log_slave_updates is enabled on a MySQL 5.7 slave, the mysql.gtid_executed table is only updated on binary log rotation (see the manual for details), GTID replication also cannot be crash safe in 5.7 with sync_binlog != 1.  However, I think this could easily be addressed if that table was updated after each transaction, so I opened Bug#92109 to get this fixed.

GTID Replication Negotiation


The legacy file+position replication negotiation is very simple: ask the master to stream the binary logs from a specific offset in a file.  This is implemented with the COM_BINLOG_DUMP MySQL protocol command and, from a Linux system-call point of view, it is a simple sequence of open/seek/read calls.  As an interesting note, the COM_BINLOG_DUMP command takes a four-byte integer for the offset parameter, which is one of the reasons a binary log file is restricted to a relatively small size (one Gigabyte for max_binlog_size).

If we want to enable GTID replication negotiation, we have to set MASTER_AUTO_POSITION to 1 in the CHANGE MASTER TO command on the slave.  In this case, instead of using the COM_BINLOG_DUMP command to connect to the master, the slave uses the COM_BINLOG_DUMP_GTID command.  In this case, instead of sending a filename and an offset to the master, the slave sends its GTID state, and then receives the missing transactions.  This puts extra work on the master to find what the slave is missing (needs reading the content of the binary logs), and this explains why the connection from a slave to a master sometimes takes a significant time (and this was the cause of fixed Bug#74607).

A Master OS Crash With Slaves Replicating Using GTID


After this somehow long GTID presentation, we can now discuss what happens after an OS crash on a master with sync_binlog != 1 and with slaves replicating using GTID negotiation (and I promise this will be quick as all the needed background is already explained above).  The first thing to remember (from the first post in the series) is that after such a crash, some binary logs are lost.  As the GTID state of the master is stored in the binary logs, this state rewinds to a certain point in the past.

Let's suppose that the master has committed transactions 1 to 60 before the crash, and that after the crash, the master only remembers up to transaction 49.  In this case, the next committed transaction on the master is tagged with GTID number 50.  However, that slave has already seen transaction 50 (maybe it has seen up to transaction 58...).  This is depicted in the diagram below.

Illustration #1: Lost GTID state on the master and
corresponding miss-matching transactions on the master and the slave
(after an operating system crash on the master).

The main problem is that after an OS crash,
a master with sync_binlog != 1 re-uses GTIDs
already seen by slaves

From now on, there are two possibilities:
  1. The slave connects back to the master before it commits transaction 58.
  2. The slave connects back to the master after it commits transaction 58.
In case #1, replication breaks as the slave requests transactions unknown to the master.

In case #2, the master sends the slave transaction 59 and onward.  So in this case, the slave has transactions 50 to 58 from before the crash (those are lost from the binary logs because of the crash, but they could be in InnoDB), the slave also misses transactions 50 and 58 from after the crash (because the GTID state of the slave has these from before the crash), and it executes transaction 59 and onward from after the crash.  Obviously, depending on the data modified by these transactions, replication might break, but in all cases, we have serious data inconsistencies.

In the next posts, I will dive even deeper in this subject, and I will give solutions to avoid the problems that I presented so far.  To be continued...

Annexe: how to see things break by yourself ?


(As in the previous posts of the series and for reproducing the situation described here, I am using two virtual machines with dbdeployer.  The details about why and how are in Part #1.)

Once we have two vms with dbdeployer (lets call them V1 and V2), we need to set up a master on V1 and two slaves on V2.  The easiest way I found to achieve that is to create a dbdeployer replicated environment on each of V1 and V2 and to tweak them.  So lets start:

1) On each vm, create a dbdeployer replicated environment with GTIDs (by default, dbdeployer binds MySQL on 127.0.0.1, so we have to change that for the slave on V2 to be able to replicate from the master on V1).
$ dbdeployer deploy replication --gtid mysql_5.7.23 -c "bind-address = *"
2) On V1, stop the slave nodes, and on V2, the master node (note the prompt notation for V1 and V2 below, they assume the current directory is the respective sandboxes):
V1$ node1/stop; node2/stop
V2$ master/stop
3) On the master, modify the replication user to be available from the network (SQL commands are in blue, execution output is in orange).  The reset master allows to start with empty binary logs, and the sync is needed to persist the user as MySQL 5.7 still stores grants in MyISAM (if we do not do that, the crash might lose the user and the slave will not be able to reconnect).
V1$ master/use -N <<< "reset master"; \
    master/use -N <<< "rename user 'rsandbox'@'127.%' to 'rsandbox'@'%'"; \
    sync
4) Make the slaves replicate from the master (use the IP of your master instead of mine).  The first slave has a small master_connect_retry, so it will probably reconnect to the master before it reuses all the lost GTIDs.  The second has the default which should be enough for the master to re-use 5 seconds of GTIDs.  A small slave_net_timeout is useful for triggering retries earlier, and the reduced durability for allowing the slave to process transactions at a rate comparable to the master.
V2$ ./use_all_slaves "stop slave"; \
    node1/use <<< "change master to master_connect_retry=1"; \
    ./use_all_slaves "
    change master to master_host='10.164.0.10';
    set global slave_net_timeout = 10;
    set global innodb_flush_log_at_trx_commit = 2; set global sync_binlog = 0;
    reset master; start slave"
5) On the master, start the test:
V1$ master/use -N <<< "
    set global innodb_flush_log_at_trx_commit = 2; set global sync_binlog = 0;
    create database test_jfg;
    create table test_jfg.t (id int not null primary key);
    select @@global.gtid_executed"; \
    seq -f "insert into t values(%.0f);" 1 1000000 | master/use test_jfg &
    00019024-1111-1111-1111-111111111111:1-3
6) Rotate the binary logs for generating a checkpoint, wait a little (5 seconds in our case, more than 1 second is needed because of trx_commit = 2), and then crash the master.
V1$ master/use <<< "flush binary logs"; \
    sleep 5; sudo bash -c "echo c > /proc/sysrq-trigger"
6b) Look at the GTID state on the slaves (in this case, we can see that the slaves are not in sync):
V2$ ./use_all_slaves "select @@global.gtid_executed"
# server: 1 
@@global.gtid_executed
00019024-1111-1111-1111-111111111111:1-127705
# server: 2 
@@global.gtid_executed
00019024-1111-1111-1111-111111111111:1-125184
7) Restart the master vm, then MySQL, look at the GTID state of the master, inject transactions (with low durability for the master to get ahead of the second slave), and look at the content of the database:
V1$ master/start; \
    master/use -N <<< "
    select @@global.gtid_executed;
    set global innodb_flush_log_at_trx_commit = 2; set global sync_binlog = 0"; \
    seq -f "insert into t values(1000000+%.0f);" 1 1000000 | master/use test_jfg & \
    master/use -N test_jfg <<< "
    do sleep(1);
    select * from t where id < 1000000 order by id desc limit 1;
    select * from t where id > 1000000 order by id limit 1"
00019024-1111-1111-1111-111111111111:1-78858
127092
1000001
8) Use show slave status to see the broken state of the first slave (not shown), and look at the content of both slaves after sleeping a little for allowing the second slave to reconnect"
V2$ sleep 60; ./use_all_slaves "
    select @@global.gtid_executed;
    (select * from test_jfg.t where id < 1000000 order by id desc limit 1)
    union (select * from test_jfg.t where id > 1000000 order by id limit 1)"
# server: 1 
@@global.gtid_executed
00019024-1111-1111-1111-111111111111:1-127705
id
127702
# server: 2 
@@global.gtid_executed
00019024-1111-1111-1111-111111111111:1-160747
id
125181
1046327
The first slave is broken, and the second slave skipped transactions with id 1.000.001 to 1.046.326: CQFD / QED !

No comments:

Post a Comment