Tuesday, January 8, 2019

Care when changing MASTER_DELAY with CHANGE MASTER TO (delayed replication)

A few days ago, I stepped into a trap !  This made me lose time for fixing things (and even more for writing this post...).  In the hope that you will avoid my mistake, I am sharing this war story.  I also obviously opened a bug, more about this below.

TL&DR: be careful when using CHANGE MASTER TO MASTER_DELAY = N: it might wipe your relay logs !

As written in the TL&DR, running CHANGE MASTER TO MASTER_DELAY = N wipes the relay logs except if...  This is a documented feature.  This is what we can read in the manual:
Prior to MySQL 5.7.4, CHANGE MASTER TO deletes all relay log files and starts a new one, unless you specify RELAY_LOG_FILE or RELAY_LOG_POS.
In MySQL 5.7.4 and later, relay logs are preserved if at least one of the slave SQL thread and the slave I/O thread is running; if both threads are stopped, all relay log files are deleted unless at least one of RELAY_LOG_FILE or RELAY_LOG_POS is specified.
It is not because it is written in the manual that it makes this right.  If the behaviour of a software, documented or not, is setting traps for users, this is a bug !  This is what happened to me a few days ago:
  • I had a crash of a MySQL intermediate master which corrupted the binary logs (I do not give details about the corruption, you can read one of my last post or come to my FOSDEM talk for details).
  • All the slaves of this intermediate master (let's call it X) had downloaded up to the same position in the binary logs, so I regrouped all the up-to-date slaves under a single slave (let's call it Y), and invoked dark magic and voodoo (mysqlbinlog) to re-slave Y to the master or X.
  • I also had delayed slaves replicating from X, but I could not easily re-slave them to Y because this replication chain is not using GTIDs.  As these were my only delayed slaves, I did not want to process all their relay log yet, so I waited until they processed most of their relay logs (I should have been more patient...) and I set their MASTER_DELAY to 0.  This was a mistake...
I ran STOP SLAVE;
CHANGE MASTER TO MASTER_DELAY = 0;
START SLAVE;
 this was a mistake...

This was a mistake because I did not think that this would wipe my relay logs.  Also, to make things worse, I had purged the relay logs from X as they were corrupted (I kept a copy, but it was not easily usable).  I also stopped the IO Thread on the delayed slaves because I did not want them to reconnect to their master where I purged binary logs.  This created the perfect situation to make my delayed slave close to unrecoverable (I could have recovered them with dark magic and voodo, but re-cloning them was easier).

From what I gathered reading the documentation (cited above), I would have had the expected result if the IO Thread was still running (only if using MySQL 5.7.4+, so does not apply to 5.6), or if I would have specified RELAY_LOG_FILE and RELAY_LOG_POS.  I reported Bug#93843: Setting MASTER_DELAY to 0 on a stopped delayed slave wipes the relay-log.  Please subscribe to it and click the affect me link if you think this should be fixed.

Some could argue that my bug is actually a feature request, but as I said above and IMHO, a trap in a software, even if documented, should be considered a bug.  Feel free to try changing my mind by commenting below and/or sharing a use case for changing MASTER_DELAY alone needs purging the relay logs.

One last thing, there was a feature request opened in July 2017 about this (Bug#87249: Relay logs shouldn't be purged if options unrelated in CHANGE MASTER is executed): it is a little disappointing that this was already found by someone else and that this trap was not removed from MySQL for 17 months.  Hopefully, this trap will be gone soon.

Update 2019-01-10: I wrote "this trap was not removed" in paragraph above.  This might be a little hard as this was not identified as a trap in Bug#87249.  Even if Bug#93843 was closed as a duplicated of Bug#87249, the trap/bug has been acknowledge in Bug#93843.

1 comment:

  1. MariaDB is affected. Primitive check:

    MariaDB [test]> show slave status\G
    *************************** 1. row ***************************
    Slave_IO_State: Waiting for master to send event
    Master_Host: 127.0.0.1
    Master_User: root
    Master_Port: 3306
    Connect_Retry: 60
    Master_Log_File: ao756-bin.000001
    Read_Master_Log_Pos: 154
    Relay_Log_File: ao756-relay-bin.000003
    Relay_Log_Pos: 375
    Relay_Master_Log_File: ao756-bin.000001
    Slave_IO_Running: Yes
    Slave_SQL_Running: Yes
    ...
    SQL_Delay: 10
    SQL_Remaining_Delay: NULL
    Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
    1 row in set (0,00 sec)

    MariaDB [test]> stop slave;
    Query OK, 0 rows affected (0,02 sec)

    MariaDB [test]> change master to master_delay=0;
    Query OK, 0 rows affected (0,17 sec)

    MariaDB [test]> show slave status\G
    *************************** 1. row ***************************
    Slave_IO_State:
    Master_Host: 127.0.0.1
    Master_User: root
    Master_Port: 3306
    Connect_Retry: 60
    Master_Log_File: ao756-bin.000001
    Read_Master_Log_Pos: 154
    Relay_Log_File: ao756-relay-bin.000001
    Relay_Log_Pos: 4
    Relay_Master_Log_File: ao756-bin.000001
    Slave_IO_Running: No
    Slave_SQL_Running: No
    ...
    SQL_Delay: 0
    SQL_Remaining_Delay: NULL
    Slave_SQL_Running_State:
    1 row in set (0,00 sec)

    As you can see Relay_Log_File changed from ao756-relay-bin.000003 to ao756-relay-bin.000001, it was reset.

    Also:

    MariaDB [test]> \!sudo ls data/*relay*
    data/ao756-relay-bin.000001 data/ao756-relay-bin.index data/relay-log.info

    ReplyDelete