Thursday, January 9, 2020

Puzzled by MySQL Replication (War Story)

Recently, I was puzzled by MySQL replication !  Some weird, but completely documented, behavior of replication had me scratching my head for hours.  I am sharing this war story so you can avoid losing time like me (and also maybe avoid corrupting your data when restoring a backup).  The exact justification will come in a follow-up post, so you can also scratch your head trying to understand what I faced.  So let's dive-in.

Update 2020-01-27: the follow-up post has been published: A Legacy Behavior of MySQL Corrupting Restored Backups.

I am working on automatic master failover with Orchestrator (the problem is with MySQL, not with Orchestrator).  My test environment is a few servers organized as below.

+---+      +---+         +---+
| M | ---> | S1| ---+--> | S2|
+---+      +---+    |    +---+
                    |         
                    |    +---+
                    +--> | S3|
                         +---+

My tests consist of causing a failure on M, and I am expecting Orchestrator to fail-over to S1.  But once S1 is the new master, I need to revive M to do more tests.  So I took a backup of M when it was replicating from S1 as shown below.

              +---+
         +--> | M |
         |    +---+
         |         
+---+    |    +---+
| S1| ---+--> | S2|
+---+    |    +---+
         |         
         |    +---+
         +--> | S3|
              +---+

After that, I did a takeover with Orchestrator to have M as the master to begin my tests.  And things started going wrong when I restored the backup on M.  After the restore, obtaining the setup above, I was not able to make S2 or S3 replicate from the restored M: there were missing GTIDs !

I thought this was a bug, so created a test case (details below).  I tested with MySQL 5.7 latest, with an older version of 5.7 (maybe it was a regression...) and even with 5.6: I always got the same missing GTIDs !  Once I convinced myself that it was not a bug, I quickly found the problem.  Can you guess what it is ?

If you want to verify with me before I publish the solution, either send me a private message in MySQL Community Slack (I am jfgagne there), send me a direct message on Twitter, or answer this Google Form.  If you want to be credited with the good solution in the next post, be explicit about it.  And please do not post a comment with the answer, I will moderate them anyway.

To be continued in a next post...

Update 2020-01-27: the follow-up post has been published: A Legacy Behavior of MySQL Corrupting Restored Backups.

Test case for reproducing the problem with dbdeployer:


# Create a sandbox:
dbdeployer deploy replication mysql_5.7.28 --gtid -c log-slave-updates

# We do not need s2/node2:
./node2/stop

# Initialize the test environment:
./m <<< "
  CREATE DATABASE test_jfg;
  CREATE TABLE test_jfg.t(
    id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    v INTEGER)"

# Load data in m:
yes "INSERT INTO test_jfg.t(v) value(1);" | head -n 100 | ./m

# Show the GTID content of m and s1:
./m -N <<< "SELECT 'm ', @@GLOBAL.gtid_executed" &&
  ./s1 -N <<< "SELECT 's1', @@GLOBAL.gtid_executed"
m       00019529-1111-[...]:1-118
s1      00019529-1111-[...]:1-118

# Take a backup of s1/node1:
./node1/stop && tar -zcf node1.tgz node1/ && ./node1/start

# Load more data in m:
yes "INSERT INTO test_jfg.t(v) value(1);" | head -n 100 | ./m

# Failover to s1/node1:
s1_port=$(./s1 -N <<< "SELECT @@port") &&
  s1_user=$(./s1 -N <<< "SELECT User_name from mysql.slave_master_info") &&
  s1_password=$(./s1 -N <<< "SELECT User_password from mysql.slave_master_info") && 
  ./s1 <<< "STOP SLAVE; RESET SLAVE ALL" &&
  ./m <<< "
    STOP SLAVE;
    CHANGE MASTER TO MASTER_HOST = '127.0.0.1', MASTER_PORT = $s1_port, 
      MASTER_USER = '$s1_user', MASTER_PASSWORD = '$s1_password',
      MASTER_AUTO_POSITION = 1;
    START SLAVE"

# Load data in s1/node1 (the new master after the failover):
yes "INSERT INTO test_jfg.t(v) value(2);" | head -n 100 | ./s1

# s1/node1 crash, m is the new master:
./m <<< "STOP SLAVE; RESET SLAVE ALL"

# Load even more data in m:
yes "INSERT INTO test_jfg.t(v) value(1);" | head -n 100 | ./m

# Show the GTID content of m and s1/node1:
./m -N <<< "SELECT 'm ', @@GLOBAL.gtid_executed" &&
  ./s1 -N <<< "SELECT 's1', @@GLOBAL.gtid_executed"
m       00019529-1111-[...]:1-318,\n00019530-2222-[...]:1-100
s1      00019529-1111-[...]:1-318,\n00019530-2222-[...]:1-100

# Stop and restore s1/node1 (you can check the GTID state of the backup a few steps above):
./node1/stop && rm -rf ./node1 && tar -zxf node1.tgz && ./node1/start

# Show the GTID content of m and s1/node1 once s1 caught up on replication:
# (should be the same as above but nothing in s1 from when it was the master !):
./m -N <<< "SELECT 'm ', @@GLOBAL.gtid_executed" &&
  ./s1 -N <<< "SELECT 's1', @@GLOBAL.gtid_executed"
m       00019529-1111-[...]:1-318,\n00019530-2222-[...]:1-100
s1      00019529-1111-[...]:1-318

# Show table content of m and s1/node1:
# (should be the same on both nodes but nothing in s1 from when it was the master !):
./m -N <<< "SELECT 'm ', v, COUNT(v) FROM test_jfg.t GROUP BY v" &&
  ./s1 -N <<< "SELECT 's1 ', v, COUNT(v) FROM test_jfg.t GROUP BY v"
m       1       300
m       2       100
s1      1       300

5 comments:

  1. I am thinking GTID Holes. Also during the switch, if we are still writing to master, chances are the last committed data might not be replicated. was your setting innodb_flush_log_trx_commit =1?

    ReplyDelete
    Replies
    1. I made sure there were no writes on the master during the switch. And it is not GTID holes. My setting for innodb_flush_log_trx_commit is the default, which I think is 1 (we would have to check the documentation to be sure). But this is not related to trx_commit.

      Delete
  2. I am a bit confused with the following:
    1. "So I took a backup of M, when it was replicating from S1 as shown below."
    2. # Take a backup of s1:
    Where the backup was taken actually?
    Also (please correct me if I am wrong) once you restored physical backup on master M, you've put it in the past comparing to slaves s2 and s3. Or have I misunderstood the test?

    In any case my first guess is that it might be related to server_id and the fact the slaves are
    skipping events with it own server ID.

    ReplyDelete
  3. Two good answers in comments, I will publish them in a few days.

    ReplyDelete