In a previous post, I listed some sequences of commands that you should not run on a MariaDB slave that is lagging and which is using the GTID protocol. Those are the following (do not run them, it's a trap):
- "STOP SLAVE; START SLAVE UNTIL ...;",
- or "STOP SLAVE; START SLAVE;" (to remove an UNTIL condition as an example),
- or "STOP SLAVE; SET GLOBAL slave_parallel_threads=...; START SLAVE;",
- and maybe others.
If those are bad, what should you run to achieve the same result ? Well, I believe those are much better:
- "STOP SLAVE SQL_THREAD; START SLAVE SQL_THREAD UNTIL ...;",
- or "STOP SLAVE SQL_THREAD; START SLAVE SQL_THREAD;" (to remove an UNTIL condition as an example),
- or "STOP SLAVE SQL_THREAD; SET GLOBAL slave_parallel_threads=...; START SLAVE SQL_THREAD;",
Ok, now you can guess that the problem is with the IO_THREAD, but what is wrong exactly ? Well, it is the following outbound bandwidth consumption on the master of the slave where I ran "STOP SLAVE; START SLAVE UNTIL ...;":
And I can tell you that maxing out the bandwidth of a master for more than 60 minutes is never good !
When the IO thread is started with GTIDs enabled, a MariaDB 10.0.19 slave will wipe its relay logs and re-download them from the position of the SQL_THREAD (I think this is part of replication crash safety). MySQL 5.6 (and I guess 5.7) has similar problems with relay_log_recovery (I already blogged about this on the Booking.com dev blog under Better Crash-safe replication for MySQL).
In this particular case, the slave was lagging by ~4 days and there were more than 250 GB of unprocessed relay logs on that salve. Re-downloading the corresponding binary logs generated the above graph.
With MariaDB, and if you are careful in which command you use, you will be able to avoid the bad side effects of GTIDs when restarting lagging slaves without restarting mysqld. But both MariaDB (with GTIDs enabled) and MySQL (with crash safe replication enabled) risk overloading the network interface of the master on restarting mysqld of a lagging (or delayed) slave.
Which one of MariaDB.com or Oracle will fix this first ? The future will show...
Some related feature requests, upvote or subscribe if you think it is important: