Saturday, December 5, 2020

Fixing low durability GTID replica with Voodoo

At my FOSDEM talk earlier this year, I gave a trick for fixing a crashed GTID replica.  I never blogged about this, so now is a good time.  What is pushing me to write on this today is my talk at MinervaDB Athena 2020 this Friday.  At this conference, I will present more details about MySQL replication crash safety.  So you know what to do if you want to learn more about this subject.  Let's now talk about voodoo.

(This post uses the new MySQL terminology: primary, source and replica.  You can read the Oracle terminology update for more details.)

So in my FOSDEM talk – The consequences of sync_binlog != 1 – I gave a trick for fixing a GTID replica after a crash.  But why is such replica in need of fixing ?  It is because in some situations, replication is not crash-safe, and one of these cases is when using GTID with low durability setting and log_slave_updates [sic].  If you want to know more about replication crash safety, you can attend my talk at MivervaDB Athena, and you can find the abstract on Facebook, Twitter or LinkedIn.

When replication is not crash-safe, you need to restore a replica from backups after a crash.  This trick allows you to avoid such restore on a GTID replica, on which low durability settings are used, with log_slave_updates, and if and only if this replica is doing in-order transaction committing.  That is a lot of information, let's take these one at a time.

By low durability settings above, I mean a replica using sync_binlog != 1 and innodb_flush_log_at_trx_commit != 1.  This configuration is common for getting extra replication throughput, especially in the cloud.  If you want to know more about this performance optimization and its consequences, watch my FOSDEM talk or read my blog post series.

By in-order transaction committing above, I mean a replica that is committing transactions in the same order as they appear in its master source binary logs.  This is always the case for single-threaded replication (when slave_parallel_workers [sic] is less than 2).  This is not the case with parallel replication in MySQL 5.6 or with the DATABASE parallel replication type (slave_parallel_type [sic]) in MySQL 5.7 and 8.0.  With the LOGICAL_CLOCK type, the replica also does out-of-order transaction committing by default (so voodoo cannot be used here), unless the slave_preserve_commit_order [sic] variable is set to ON (in which case voodoo can be used).

The trick can only be used with single-threaded replication
or with the LOGICAL_CLOCK parallel replication type
with slave_preserve_commit_order set to ON

Here is the trick:
  • Make sure the replica starts with replication disabled (skip-slave-start [sic])
  • Note the GTID state (SHOW MASTER STATUS [sic])
  • Wipe the binary logs (RESET MASTER [sic])
  • Use file+pos. replication (CHANGE MASTER TO MASTER_AUTO_POSITION = 0 [sic sic])
  • Start replication (START SLAVE [sic])
  • After running a few transactions, stop replication and restore the GTID state
Before continuing, please note that this method is of my own invention.  It is not supported.  So apply good judgment when using it and make sure you understand what you are doing.

I went a little fast on the "restore the GTID state" part above and it is on purpose.  If you do not understand how this works, you probably should not use this trick.  If you want to understand everything there is to know about this, attend my talk at MivervaDB Athena.  :-)

But is this trick really working ?  Let's see it with dbdeployer.  First, we create a replicated sandbox with MySQL 5.7.32 and our specific settings:
dbdeployer deploy replication mysql_5.7.32 --gtid -c log-slave-updates \
  -c "sync_binlog = 0" -c "innodb_flush_log_at_trx_commit = 2"
Now let's stop both replicas (node1 and node2), initialize the primary with some transactions and do a clean reboot to make sure everything is persisted on disk (I am doing this on a vm, and you should also test in an environment that you can crash because we will kernel-panic Linux in the next steps):
cd ~/sandboxes/rsandbox_mysql_*; \
  ./node1/stop& ./node2/stop& wait &&
  ./master/use <<< "
    create database test_jfg;
    create table test_jfg.t(id bigint not null primary key)" &&
  seq -f "insert into t value(%.0f);" 1 500000 | ./master/use test_jfg &&
  ./master/stop && sudo reboot
After the reboot, start both the primary and the first replica (node1), start replication, wait a little and crash the vm:
cd ~/sandboxes/rsandbox_mysql_*; \
  ./master/start& ./node1/start --skip-slave-start& wait &&
  ./node1/use <<< "start slave; do sleep(2); show slave status\G" &&
  sudo bash -c "echo c > /proc/sysrq-trigger"
Then reconnect and restart everything and apply the first steps of the trick:
cd ~/sandboxes/rsandbox_mysql_*; \
  ./master/start& ./node1/start --skip-slave-start& wait &&
  ./node1/use <<< "
    show master status\G reset master;
    change master to master_auto_position = 0;
    start slave; do sleep(1); stop slave; show slave status\G
    select count(*), max(id) from test_jfg.t"
The last steps above (SHOW SLAVE STATUS [sic] and SELECT [...]) are not part of the trick.  If replication broke, you would see it in SHOW SLAVE STATUS.  And if both numbers returned by the SELECT are the same it means there is no silent data corruption.  If everything is all good, you can finish applying the trick and crash the replica again:
sms="$(./node1/use <<< "show master status\G")" &&
  egs=$(awk '$1 =="Executed_Gtid_Set:"{print $2}' <<< "$sms") &&
  ./node1/use <<< "
    reset master; set global gtid_purged='${egs/:*-/:1-}';
    change master to master_auto_position = 1;
    start slave; do sleep(2); show slave status\G" &&
  sudo bash -c "echo c > /proc/sysrq-trigger"
And you can continue back at the trick (two steps above) until you get bored.  If you get a broken replica or if the numbers do not match, the trick is wrong and please let me know.

If you want to convince yourself that not using the trick would result in a replication breakage, you can do this with the second replica (node2):
cd ~/sandboxes/rsandbox_mysql_*; \
  ./master/start; ./node2/start --skip-slave-start &&
  ./node2/use <<< "start slave; do sleep(2); show slave status\G" &&
  sudo bash -c "echo c > /proc/sysrq-trigger"
Then reconnect and try resuming replication:
cd ~/sandboxes/rsandbox_mysql_*; \
  ./master/start && ./node2/start &&
  ./node2/use <<< "do sleep(1); show slave status\G stop slave"
Above should show a broken replica (it broke for me the first time).  If it is not broken for you, run below and do above again to give the replication breakage another try:
./node2/use <<< "start slave; do sleep(2); show slave status\G" &&
  sudo bash -c "echo c > /proc/sysrq-trigger"
Note that you will not be able to break replication this way with MySQL 8.0.17 and above.  Something changed in 8.0.17 but Bug#70659 and Bug#92109 have not been closed.  I asked precision about this in both bugs, to be followed.

No comments:

Post a Comment