Wednesday, April 8, 2015

Even Easier Master Promotion (and High Availability) for MySQL (no need to touch any slave)

Dealing with the failure of a MySQL master is not simple.  The most common solution is to promote a slave as the new master but in an environment where you have many slaves, the asynchronous implementation of replication gets in your way.  The problem is that each slave might be in a different state:
  • some could be very close to the dead master,
  • some could be missing the latest transactions,
  • and some could be far behind (lagging, delayed slaves, or slaves in maintenance).

Before promoting a new master and reorganizing the slaves in a new replication topology, making all the slave identical is very important when you care about data consistency (and most of us do care).  I call this step leveling the slaves.  Many solutions exist to do that:
  • MHA parses and analyses the relay logs of the slaves to apply what is needed by each slave.
  • GTIDs (both MySQL and MariaDB) skip the leveling step if you promote the most up to date slave as the new master (the slaves of the new master will level themselves with the binary logs of the new master).  If the most up to date slave is not a good candidate master (not in the right vlan, not in the same data center, ...), you will need to level that candidate master with the most up to date slave (this is easy).  But for GTID to be useful in this way, log-slave-updates is needed on most (if not all) slaves, and this comes with costs.
  • Binlog Servers store a copy of the binary logs from the master and when the master is gone and the Binlog Server is still there, using this copy to level the slave is trivial.
But all the solutions above need reorganization of the replication topology (repointing all the slaves to the new master).  Even after leveling, this is not without pain, especially if you have lot of slaves.  This can be even more painful if some slaves are unreachable (rebooting, stopped for backup, maintenance, ...) and dealing with delayed slaves is also complex.

There is a better way !  The solution I have in mind:
  • does not need GTIDs,
  • does not need log-slave-updates on any node (no intermediate master),
  • is fully redundant with zero operation in most failure scenarios,
  • does not need slave repointing in all failure scenario (zero operation on all slaves except the one that will be the new master),
  • needs some Binlog Servers.
I will present this solution in my talk at Percona Live Santa Clara on Wednesday April 15, 14:00, Ballroom G.  I will be happy to meet you there, answer you questions and talk to you about Binlog Servers at Booking.com.

If you cannot make it to the talk but want to know more, I will be easy to find at the Booking.com booth (# 315).  I will also be at the community dinner at Pedro’s on Tuesday evening.

No comments:

Post a Comment