Tuesday, February 26, 2019

MySQL Master High Availability and Failover: more thoughts

Some months ago, Shlomi Noach published a series about Service Discovery.  In his posts, Shlomi describes many ways for an application to find the master.  He also gives detail on how these solutions cope with failover to a slave, including their integration with Orchestrator.

This is a great series, and I recommend its reading for everybody implementing master failover, with or without Orchestrator, and even if you are not fully automating the process yet.  Taking a step back, I realized that service discovery is only one of the five parts of a full MySQL Master Failover Strategy, and this post is about these five parts.  In some follow-up posts, I might analyze some deployments using the framework presented in this post.

UPDATE 2019-03-03: as pointed out by Shlomi in the comments, his post "MySQL High Availability tools" followup, the missing piece: orchestrator is a complement discussion about some of the topics covered in this post.

I consider this a Request for Comments.  I do not claim my thoughts are fully mature or that I am covering all aspects of MySQL master high availability and failover.  I just think that this presentation is more complete than everything I read so far on the subject, so let’s dive in.

I already mentioned the MySQL Master Failover Strategy/Framework at the GrowIT Conference (Novi Sad, Serbia) in my talk MySQL Scalability and Reliability for Replicated Environment (slides & recording).  At slide #27, I present the five parts which are 1) plan, 2) when, 3) tell, 4) protect, and 5) fix.  Being a little more explicit, we can name them as follows:
  1. Failure detection (when)
  2. Reaction to failures (plan or how)
  3. Service discovery (tell)
  4. Fencing (protect)
  5. Dealing with split-brain (fix)
You can see that the plan and the when are in a different order.  The list above presents things as they happen during a failure.  In my talk, I put plan first because it is something that strongly impacts the rest of the strategy, and for this reason, I describe it first below.

Reaction to Failures


The way we react to a failure is a very structuring part of implementing MySQL master high availability.  It can be as simple as restarting the mysqld process (or rebooting the server hosting the MySQL database), or as complex as failing-over to a slave in another datacenter.  Neither of the two allows to recover from dropping a table or schema, so restoring a backup or using delayed replication should also be included in a complete data availability strategy, but I am not covering these two in more details in this post.

The way of reacting to failures is the most important factor impacting the Recovery Point Objective (RPO).  Said otherwise, it structures significantly how much data is missing after recovery.  Restoring a backup implies losing a lot (potentially hours), and this can be mitigated by saving and applying the binary logs (maybe with Binlog Servers).  Promoting a slave as the new master loses very little (only the last few committed transactions that were not yet replicated).  Failing-over to another node in a XtraDB Cluster (Galera) or a Group Replication deployment has zero data lost, and it is the same for semi-synchronous replication when rpl_semi_sync_master_wait_point is set to AFTER_SYNC (also known as lossless)

Another important property of a failover strategy is the Recovery Time Objective (RTO).  Waiting for a server to reboot, or for MySQL crash recovery to complete (including InnoDB), can take a long time and this can be a major part of the RTO (moreover, and unrelated to the RTO, a MySQL instance has a cold cache after a restart, and this leads to degraded performance until all the data is back in RAM).  However and compared to other factors impacting the RTO, a failover to a slave (or to another node in a cluster) is relatively quick: another node is usually ready to accept write with minimal reconfiguration, and failing-over to a slave in a small or well-configured replication setup takes only some seconds for repointing other slaves (and in both cases, the cache of the new master is warm).  Other factors impacting the RTO in a failover solution are Failure Detection, Service Discovery and Fencing, and we will come back to these in the next sections.

Finally, to conclude this section, we have to highlight that the reaction to failures impacts the requirement for dynamic service discovery (or the absence of such requirement).  If, after a failure, we restart MySQL or reboot the server hosting the database, there is no need for dynamic service discovery.  But if we fail-over to a slave, the application must be informed that the master has changed.  As we will soon discover, all the parts of a MySQL master high availability strategy are interconnected, and an adjustment in one impacts the requirement and implementation of the others.

Failure Detection


In the previous section, we discussed the subject of reacting to a failure.  But having a plan is useless if we do not know when a failure happens.  And this is where Failure Detection comes into play.

Reliable failure detection is hard

The first thing to realize when implementing failure detection is that this is a very complex and hard subject: anyone saying to have a simple, reliable and efficient failure detector is naive.  The same way you would not take seriously someone claiming to have solved the halting problem, you should dismiss people saying they have a reliable failure detector (they might have solved failure detection for what they think is their class of failures, but they probably forgot something, and they surely could not have solved it for the most general case).  Implementing failure detection is always a trade-off between:
  • time taken for failure detection
  • probability of false positive
  • probability of false negative
We previously talked about the RTO saying that the way we react to failure is one of the factors impacting the total time taken for restoring availability of the master; the time taken to detect a failure is also a major part of the RTO.

Failure detection is, in most cases,
the biggest impacting factor of the RTO

Sometimes, the consequence of a false positive is so high that it is ok to spend a lot time detecting a failure.  In other cases, the cost of being down is so high that we do not want any false negative and we want to minimize the RTO, so the flip-side is coping with false positives.  And false positives, combined with imperfect service discovery, introduces the need for fencing and dealing with split-brains, but I am getting ahead of myself.

Some failures are easy to detect: an example is a mysqld process that crashed.  And depending on the way things are configured, we can make sure that it does not come back (fencing).  This assumes we have a view on the Linux server running MySQL, which is not always the possible (like in Cloud environments or other database hosting solutions), or because the failure prevents connection to Linux.

Sometimes, MySQL comes back from the dead…

Some failures are harder to ascertain: an example is when a server is unavailable due to a CPU overload, a network problem, etc.  In this case, and after triggering failure reaction, the old master might come back from the dead if the failure was transient.  We will discuss the consequence of this in more details in the sections about service discovery, fencing and dealing with split-brains.

Some failures depend on the observer: an example is a network partition.  Some nodes in the partition are still able to access the master, but others out of the partition are not.  This situation is one of the most complex to deal with in terms of recovery procedure for ensuring data consistency.

Service Discovery


Service discovery is the process by which the application knows how to connect to an endpoint/service, and the service that we are discussing in this post is the MySQL master.  Service discovery can be static (an IP address in a configuration file) or dynamic (a DNS name updated when switching master).  Shlomi is already doing a very good Service Discovery Inventory in his series, so I am referring you to his posts for more details about different ways of implementing this.

In this section, I cover the properties of service discovery.  As it involves many components (application nodes, network, DNS, proxies, service registries, …), service discovery is a complex distributed system on its own.  And as for all distributed systems, service discovery implies many choices and trade-offs in their implementation.  Those choices impact some fundamental characteristics of service discovery:
  • Global or local
  • Centralized of distributed
  • Atomic or convergent (eventual consistency)
  • Performance behavior (introducing latency or not)
  • Impact on RTO
  • Failure modes
  • Ease of debugging
And as we pointed out in the section about reaction to failures, some strategies do not need dynamic service discovery.  If your plan is to restart a crashed mysqld process, to reboot the server hosting MySQL, or to restore a backup, informing the application of the location of the new master dynamically is not needed.  In those cases, you can easily do with an IP address in a configuration file, and in the unlikely situation where this IP changes (restoring a backup on another server), you can change the configuration using your normal (and potentially slow) deployment procedure.  However, this usually means that the RTO is high and that the master is unavailable for a significant time after a failure, which is not always satisfactory.

To get a very short downtime when a master fails (small RTO), you need to fail-over to another node, which implies dynamic service discovery.  The way you implement service discovery also impact your overall failover strategy.  Taking a Local Service Discovery example, you can use a floating IP (VIP) for the master, but then you are limited to failing-over to a node in the same network segment (both old and new master must have the same gateway as their first IP routing hop).  A floating IP in the same Ethernet segment also usually mean that you cannot fail-over to another datacenter, so this does not solve disaster recovery (also, it does not solve unavailability caused by a network equipment failure).  But floating IPs are very simple to understand, which means they are easy to debug, but it comes at the risk of misbehaving if two nodes are claiming the same IP.  And to prevent this problem, we could avoid claiming the IP on boot, or maybe disconnect the old master at the switch level.  This operation at the network level is one way to fence the old master, but we will come back to fencing later in this section and in the following section.

To be able to do the most flexible failover, including to another node in a remote datacenter, we need a Global Service Discovery.  The most common example of such system is DNS.  I do not want to get into the details, but usually, updating a DNS entry is done on a master server, and this change propagate rapidly to all other DNS servers.  This means that a DNS change is not atomic, which implies that some application servers might connect to the new master while others still connect to the old master.  In this regard, DNS is a distributed and eventually consistent way of implementing service discovery.  Moreover, there is the TTL property of DNS entries that can get in the way, but now I am diving too much in the details of DNS implementation.

DNS is not an atomic way to implement service discovery

Back to the service discovery properties of DNS, one of its failure modes is network partition.  If a DNS slave server is co-located in a partition with the MySQL master, the DNS update cannot reach the slave, and when the partition is solved (a router or a firewall come back online), there is a short period of time during which the DNS slave still points to the old master, and this can lead to writes happening there instead of on the new MySQL master (I have seen that happening in production: not fun !).  This again looks like a situation that could be solved by fencing, and it also introduces the split-brain problem where writes are happening on two unsynchronized nodes: we will come back to both later in this section and in the following sections.

So we saw that DNS is a way to implement global service discovery in a distributed way, what about doing it in a centralized way ?  One architecture achieving this is using proxies (and this is something I am implementing right now with ProxySQL: I will surely blog about this in the future).  Obviously, using a single proxy would not work as it would be a single point of failure, but with being smart, I think two proxies can be enough for managing the simple failure of a master, and four additional proxies are enough to manage complex network partitions.  This centralized deployment (if we can call two to six servers a centralized solution) can totally avoid the need for fencing and for dealing with split-brains, but at the cost of higher latency for statement routing in the proxy layers.  Proxies also hide the source IP of the application server making the connection to MySQL, which makes things harder to debug.  This could be fixed using connection attributes, but I do not know of any proxies implementing this yet.

There are intermediary solutions involving service registries like ZooKeeper, Consul or etcd, and also sometimes involving proxies.  Shlomi describes some of these in his series, so I do not give details about them here.  I only mention that those have the benefit of reducing the latencies introduced by the centralized solutions, but at the drawback of reintroducing the need for fencing and dealing with split-brains.  They are also much complex and harder to understand, which make their failure modes harder to predict and debug.

I have already mentioned fencing and split-brain many times in this section, and before describing these in detail in the next two sections, I pause here for introducing them.  The need for fencing and dealing with split-brains is introduced by imperfect failure detection combined with imperfect service discovery.  If we would have perfect failure detection (and this is impossible — remember the halting problem — but let’s do this thought experiment), we would never need to make sure the old master does not receive write (fencing) and we cannot have a split-brain.  The same applies for perfect service discovery, and we saw above that it is possible in the centralized proxy solution, for which case fencing and dealing with split-brains are not needed.  But when false positive in failure detection might happen (in most cases), and if service discovery is not perfect, there is a risk of some application servers writing to the old master while some others are writing to the new master, and this is a split-brain.  To avoid a split-brain, one solution is to make sure the old master is unavailable, and this is called fencing.  And when fencing is not perfect, we might need to reconcile data from both the old and the new master, and this is what I call dealing with split-brains.  Those subject are discussed in the next two sections.

Fencing


As quickly explained in the previous section, the need for fencing the old master comes from the combination of imperfect failure detection and imperfect service discovery.  Improving failure detection is very hard to do, and perfect service discovery introduces latencies that might be unacceptable in some cases.  So sometimes, when we want to avoid these latencies, fencing is a good way to prevent split-brains.  Best effort fencing might be quick, but fencing with high confidence could take time, so depending on the effort invested in this, fencing might be a non-negligible part of the RTO.

There are many ways for fencing a MySQL master, one of which is avoiding restarting the mysqld process after a crash, or always starting MySQL in READ-ONLY (or SUPER READ-ONLY) or OFFLINE mode.  But this would not cover the case where a server is unreachable and could come back from the dead.

For an unreachable operating system (Linux), powering off the hardware via a LOM device might be possible (iLO for HP servers, DRAC for Dell servers, IPMI for the generic interface).  An option for virtual machines is to shut them down at the hypervisor level (or using APIs for Amazon Web Services, Google Cloud Platform or Microsoft Azure).  Those solutions are known as STONITH (Shoot The Other Node In The Head).  Both API and hardware solutions are not fully satisfying because they are not 100% reliable and they might not work for network partitions.  Disconnecting a server at the switch level has the same problems.

Another way of fencing is to put a circuit breaker in front of the master.  This circuit breaker can be a smart proxy, deployed on the same server as the database, which holds a lock in Zookeeper (or other distributed synchronization solutions like Consul or etcd).  This smart proxy also monitors the health of the database, and if the database misbehaves, the proxy blocks connections and releases the lock allowing another node to become the master (this is similar to STONITH to make sure a dead MySQL stays dead).  In the case of a server failure, the lock expires after some time (typically 30 seconds), so this adds to the RTO, but this guarantees that there is only one master.  But in the case of a network partition, this 30-second timeout keeps the circuit open and allows writes to reach the master.  In this case, this is not a full split-brain as the two masters are not alive at the same time, but those writes on the old master are lost on failover (not on the slave), so the RPO includes those 30 seconds, or data needs to be reconciliated (details about this in the split-brain section).

A good way to fence the master, and this is known to be used by Facebook (see “node fence dead master” at Slide #22 of the talk Binlog Server at Facebook), is to use lossless semi-synchronous replication to prevent commit from succeeding on the old master.  This is a good way to prevent writes from completing, but this does not avoid reads, and this blocks clients in commit (they eventually time-out).  We need to realize/remember that a commit timeout implies an undefined state in MySQL, so clients cannot know if they should retry or not (maybe data has been committed: the OK packet was lost; or maybe not: the commit packet was lost).  This bubble-up the failure at the application level, allowing to present the user with an error message, but without clear ways to move forward in handling the problem.

Semi-sync has even a more complex behavior on crash recovery: when there are no semi-sync slaves and when the master is configured to require semi-sync ACKs (rpl_semi_sync_master_wait_for_slave_count > 0 and rpl_semi_sync_master_wait_no_slave = ON), the commit hangs, but the transaction are rolled forward if MySQL crashes and restarts.  I wrote about this behavior on the Percona Community Blog in the post Question about Semi-Synchronous Replication: the Answer with All the Details, and I also mention the Facebook way of avoiding this roll-forward.

Back to fencing with semi-sync, the Facebook implementation does not block writes in the case of a network partition because their semi-sync slaves are co-located in the same datacenter as the master (they might even be co-located in the same rack as the master to minimize commit latency).  If we are willing to accept a longer commit latency (20 to 50 milliseconds), we could put the semi-sync slaves in other datacenters.  This avoids the higher latencies of sending statements to the database via proxies, and reliably fences the master in case of network partitions, but at the cost of a higher latency on commit (probably acceptable if you are sending statement to the master from another datacenter).  I have not yet implemented this myself, but I plan to experiment with this in the near future to make master accesses faster than via proxies.  This is the only 100% reliable fencing mechanism that I know which can deal with imperfect service discovery in network partitions.

A section about fencing would be incomplete without mentioning XtraDB Cluster (Galera) or Group Replication.  Those solutions have fencing built-in as they need at least the majority of nodes to agree on committing a transaction (I think Group Replication needs a majority and Galera needs all non-failed nodes to agree on commit, but I do not fully know those technologies, so I am not 100% sure).  The flip-side of this property is that commit can return an error, which is not possible in normal MySQL (commit taking forever, or losing the connection to the master, are different than commit returning an error).  Because this is an unusual behavior, a lot of code might not manage commit returning an error, and for that reason, claiming that Group Replication or XtraDB Cluster are drop-in replacement of MySQL is not something I fully agree with.  Also, unless you deploy those clustering solutions across multiple datacenters (and I am not sure that they are supported or well tested in geo-distributed deployments), you still have to manage disaster recovery with failing-over to another datacenter, in which case the built-in fencing does not prevent split-brains.

Dealing with Split-Brains


And now, we can cover the worst thing of all, the nightmare of all data engineers: data inconsistencies (in this case, introduced by split-brains).  Providing consistent data is highly valued by data geeks.  It is so highly value that when we cannot guarantee consistency, our intuitive and first reaction is to make a data-store unavailable for writes and also potentially for reads.  But this means low availability, which is conflicting with business continuity, so a good trade-off needs to be found.  But let’s go back to the specific subject of the split-brain.

Having a MySQL master split-brain means that two independent (or unsynchronized) servers are accepting writes at the same time.  The consequence of a split-brain is data inconsistency: the data on each of the servers is not the same because of different writes, and this can lead to disastrous situations (think bank account…).

In the context of MySQL master failover, split-brains happen when combining false positive in failure detection, with imperfect service discovery, and with improper fencing.  Because getting better at any of these implies increasing the RTO or degrading performances, split-brains cannot always be prevented.  To cover this risk, a good strategy is knowing beforehand what to do when facing a split-brain (a bad strategy would be ignoring this possibility when we know it can happen).

The first way to deal with a split-brain is to discard one of the brains.  This is equivalent to dropping committed transactions, which is an abomination for data engineers.  But this might be the right solution when recovering data would cost more than the value provided by this data.  Other situations where dropping transactions is acceptable are when the unavailability of the system costs more than losing data or when it does not imply a significant increase in existing compensating activities.

To extend a little on the last point above (existing compensating activities), if you are an online merchant, dropping 100 orders because of a split-brain obviously sounds bad.  But if you have 1000 orders that are lost daily because of stolen packages or bad delivery, maybe dropping those 100 orders once in a while is not such a big problem (maybe losing 5000 orders because of a split-brain would even be acceptable).  Because you already have a customer service department dealing with this type of problem, making them deal with 100 lost orders because of a split-brain is probably better than not accepting orders for 30 minutes.  As long as you know what you lost and it is in the acceptable range for your business, things are fine(-ish).

To have an idea of what you lost, you can check the binary logs of the discarded MySQL node.  There are probably ways of structuring your data that would make this easier to check (think transaction log), but I do not have good advice to give here.

In some other cases, you might want to reconciliate the data of the two brains.  This can be very challenging in the general case, and the only way you can achieve that is by knowing your data well.  I cannot give you advice here either because I do not know your data.

But one thing I know and experienced the hard way — I mentioned it in my talk Autopsy of an automation disaster (slides & recording) — is that AUTO_INCREMENTs (also known as SEQUENCEs in other databases, or as their alias SERIAL) makes data reconciliation much harder.  Because auto-increments are allocated by the database, when you have two masters, they might allocate the same values.  In this case, you have two rows sharing the same identifier but with different data.  When you use this identifier in another table, things become messy, and data reconciliation becomes very hard (you might even show the wrong data to users of your system depending on the behavior of the split-brain).  My advice here is to stay away from all identifiers that are allocated by the database, and to use UUIDs instead (in my talk, I mention the Percona post about Storing UUID in an optimized way).

An example of a split-brain situation where data reconciliation was challenging is in the GitHub 21st of October 2018 post-incident analysis.  In this case, when the split-brain was discovered, the service was interrupted/degraded for hours to allow data reconciliation.

It is better to avoid split-brain with perfect service discovery or perfect fencing, but if you cannot because of some reasons, and if you end-up in a split-brain situation, hopefully you thought about it beforehand and you know what to do.  If you have not planned for this, hopefully you can think quickly.  And if not, then you have to push the big red button and then…

Good luck !


Some Closing Remarks


As I wrote at the beginning of the post, this is the current state of my thoughts about the very complex subject of MySQL Master Failover Strategies.  If we want to summarize:
  • Perfect failure detection is very hard, and to provide a small RTO, there are always false positive.
  • Service discovery is hard (but not impossible), but perfect service discovery impacts the RTO on failover and performance of normal operations.
  • Imperfect service discovery can lead to split-brains, which could be prevented by fencing.
I do not claim this whole discussion is completely mature, I just think it is better than what I read so far.  Also, I did not describe all the ways of implementing each of the five parts of a MySQL Master Failover Strategy, I probably forgot some, and new one will be invented.  This post can be considered a Request For Comment, so please share your thoughts in the comments below, or on the MySQL Community Slack (I am @jfgagne there), or via email (jfg DOT mysql AT gmail.com).

I am planning follow-up posts.  The two I have in mind so far are:
  • Circular/ring replication: living in constant split-brain
  • XtraDB Cluster, Galera and Group Replication: a partial way of doing fencing but not a full availability solution
Moreover, I did not cover how to make failing-over the master to a slave efficient, fast or simple.  This has partially been covered in my previous post Abstracting Binlog Servers and MySQL Master Promotion without Reconfiguring all Slaves.

Finally, I want to write a post about the Rules of MySQL Replication for explaining how to organize a MySQL replication topology for making sure failing-over the master to a slave does not have adverse consequences, including when dealing with replication filters, multi-source replication and other edge cases.

As you can guess, I am not done writing on this very complex subject...

10 comments:

  1. Hi JF,

    Excellent and enjoyable post as always. Thanks for the mention. I agree with many of the issues you raise here. I have some comments, thoughts and clarifications. Comments by order of content in the post:


    Failure detection
    =================

    I agree that it is a very complex and hard subject.

    > you should dismiss people saying they have a reliable failure detector
    > they might have solved failure detection for what they think is their class of failures, but they probably forgot something, and they surely could not have solved it for the most general case

    Erm, I would like to humbly present orchestrator's reliable failure detector (it's not "simple"). It's been developed over the past 4.5 years, at different environments, the largest of which was Booking.com which in itself was a crazy big MySQL playground. It has adapted over the years onto many environments, thanks to community feedback and contributions, and does attempt to solve the general use case.

    Orchestrator's failure detection is very reliable, saying so by examining the past incidents. I will not share percentages here. But the holistic approach, where orchestrator monitors both master and its replicas, has proven to be sturdy. Latest developments tackle more difficult cases, such as the master going into some limbo, existing connections seem to be alive but doing nothing, new clients unable to connect, replicas claiming they're replicating well while their binlog coordinates stay in one place. This can happen on "too many connections" error or other scenarios we've seen. In such case orchestrator will proactively restart replication on the replicas to kick their connections, to find out they actually report replication to be broken, to reach the conclusion the master is dead or "as good as dead". Orchestrator's failure detection goes from detecting the death of a single master and as far as correctly (verified in real life incident) analyzing a full DC network isolation scenario.

    Docs: https://github.com/github/orchestrator/blob/master/docs/failure-detection.md


    Service discovery
    =================

    > with being smart, I think two proxies can be enough for managing the simple failure of a master, and four additional proxies are enough to manage complex network partitions.

    Having multiple proxies is of course a necessity, as you suggest, or else it's a single point of failure. But then again, it reintroduces similar issues to our MySQL detection/failover problem:

    Are all your proxies in the same datacenter? Probably not, because you would not survive a DC disaster.
    Then they are in different datacenters. What happens when a DC goes network isolated? The isolated proxies cannot communicate with the other proxies; which server do they relay the traffic to? How long will it take them to realize they are isolated? During that time, they did not do any fencing.


    Continued in next comment due to 4096 characters limitation...

    Shlomi

    ReplyDelete
    Replies
    1. Continued, 2nd part of comment

      Fencing
      =======

      > Another way of fencing is to put a circuit breaker in front of the master. This circuit breaker can be a smart proxy, deployed on the same server as the database, which holds a lock in Zookeeper (or other distributed synchronization solutions like Consul or etcd). This smart proxy also monitors the health of the database, and if the database miss behaves, the proxy blocks connections and releases the lock allowing another node to become the master

      Unfortunately this approach reintroduces the problem of detection. With this approach you now have two detection mechanisms who do not collaborate. One is your general solution (say orchestrator). The other is some logic incorporated by a smart proxy, which gets to be the judge on the health of a master.

      Then, how does it pass judgement? Now you have a single point of judgement. If the proxy is wrong in its judgement, it will block your master, leading to loss of availability. Define "misbehaves"? If it were simple to decide that a server "misbehaves" then we could have a "simple, reliable and efficient failure detector"...

      The problem intensifies because the proxy is not designed to speak with other components about those health checks. This is why higher level solutions such as orchestrator (or others of course) can pass better judgement: they are designed to monitor the larger scale topology, cross DC and based on insights on the MySQL topology and rules.


      Dealing with Split-Brains
      =========================

      > An example of a split-brain situation where data reconciliation was challenging is in the GitHub 21st of October 2018 post-incident analysis. In this case, when the split-brain was discovered, the service was interrupted/degraded for hours to allow data reconciliation.

      I found that many people were confused about the situation we had. To summarize it: we had a datacenter network isolation,. Orchestrator recognized it and failed over all masters to another DC. A split brain was meanwhile created in the isolated datacenter (local apps were able to keep writing to their local master for some time).

      The split brain scenario was known and _expected_, see this post from before the incident: https://githubengineering.com/mysql-high-availability-at-github/#limitations-and-drawbacks

      What we did not expect is that we would need to move back to the original datacenter. The new datacenter was already taking substantial writes which we could not afford to lose. But circumstances forced us to move back. The time of the outage largely accounted for the time to restore original DC servers from backup / recloning them, and bringing them up to speed.

      That outage was unquestionably painful to everyone affected and involved. We had many resolutions, understandings, insights as result, and quite a few things did come out of this incident. In our databased team, we tackled many of the time impacting barriers we met during the incident and are managing to bring down time-to-restore. I will mention here one out of multiple outcomes:

      gh-mysql-rewind, aka Un-split brain for MySQL, is a new tool I was happy to present at FOSDEM:

      - https://www.youtube.com/watch?v=UL--ew3n3QI
      - https://speakerdeck.com/shlominoach/un-split-brain-mysql

      It automatically reconcilliates a split brain scenario by canceling (reverting/rewinding) the writes on one of the masters, essentially moving it back in time to a consistent point, and then configures it as a healthy replica in the other master's topology.

      It is to be released to the public very shortly (I will write a blog post when this happens).

      =====

      Finally, I'd like to link to http://code.openark.org/blog/mysql/mysql-high-availability-tools-followup-the-missing-piece-orchestrator which similarly discusses many issues around detection/failovers.

      Thank you!

      Delete
    2. As clarification, I do not claim orchestrator's detection is perfect, but I do find it reliable. Reliability is a matter of numbers, of course; one's tolerance for false negatives and false positives.

      Delete
    3. Hi Shlomi, thanks for your comments.  I will reply to each of them individually to be able to generate many independent discussions.  I also updated the post with a link to your post. And btw, I corrected “miss behave” by “misbehave” in the post, which could explain why some of your quotes are out-of-sync with the post.  Cheers, JFG

      Delete
  2. Thanks to Shlomi comment, I realized that split-brain is only one consequence of “bad failover”.  The general case is data inconsistency. This is not only caused by split-brain because of a combination of imperfect failure detection, followed by imperfect service discovery, finally followed by imperfect fencing.  It can also be caused by a bad failover. The case presented in my talk Autopsy of an automation disaster, and the GitHub outage, are examples of data inconsistency caused by an error in failover. I will not update the post yet, I will think about this more and eventually write a follow-up.

    ReplyDelete
  3. Shlomi, about your failure detection comment:

    > I would like to humbly present orchestrator’s reliable failure detector

    I agree that Orchestrator failure detector is very good (and it is not simple).  It is probably the best we have so far. However, I think we do not have an agreement on terminology.

    As you point out, Orchestrator’s failure detector is not perfect and its imperfection has the risk of causing split-brains.  In that sense and IMHO, it is not 100% reliable. I believe that something that is right 99,9% of the time cannot be called reliable, and for that reason, I do not call the Orchestrator’s failure detector a reliable one.

    However, and as you point out, it is still very well battle-tested and has been iterated on many times.  As I point out in the post, failure detection is a matter of trade-off, and I would happily use the Orchestrator’s failure detector in most of my deployments (I am actually deploying it at my current employer).  I think we agree it does not fully avoid split-brains (IMHO, no failure detector is able to reliably avoid them).

    The reference paper on this is “Unreliable Failure Detectors for Reliable Distributed Systems” published by Chandra and Toueg in 1996.  I remember spending a lot of time understanding this during my master degree (2000 to 2002), it is not an easy read.

    Globally, I think we do not fundamentally disagree here.

    ReplyDelete
  4. Shlomi, about your service discovery questions and comment:

    > Are all your proxies in the same datacenter?

    No.

    > What happens when a DC goes network isolated?

    Things keep working.  I do not want to give details here, I will do this in a next post presenting the whole solution.  You can already guess that I am doing a trade-off for avoiding split-brains, and I can hint you that this implies degrading performances in a way that is acceptable in my case.

    > How long will it take them [the proxies] to realize they are isolated? During that time, they did not do any fencing.

    In my solution, the proxies are not collaborating for doing fencing.  More about this in my next post which should be published in the next three months.

    ReplyDelete
  5. Shlomi, about your “Fencing Proxy” questions and comments…

    > With this approach you now have two detection mechanisms who do not collaborate.

    This is right, there are two failure detectors here. However, I think one collaborates with the other. If the proxy detects a failure and fences the master, Orchestrator also detects it and trigger a failover. More about this below.

    > If the proxy is wrong in its judgement, it will block your master, leading to loss of availability.

    True: this is a false positive in failure detection. I think I explained in my post that this is unavoidable. The good thing here is that the proxy fenced the master, so this class of false positives cannot lead to split-brains. Maybe we end-up doing a failover that was not needed, but sometimes it is better to fail-over quickly with avoiding split-brain. The lesser alternative is taking more time to do failure detection, maybe eventually failing-over, but risking a split-brain. I am not claiming this solution is the best of all, but I think it is an interesting trade-off that someone might want to do.

    > Define misbehaves

    MySQL bugs, InnoDB long semaphore wait, stalled data_dir filesystem (maybe full), … There are many classes of misbehavior for a MySQL server that can be detected by the fencing proxy failure detector.

    > If it were simple to decide that a server "misbehaves" then we could have a "simple, reliable and efficient failure detector"

    This generalization does not work. In some specific cases, we are able to decide that a server misbehaves (and I like the word decide here, it has a very specific meaning in computation theory, which is exactly the meaning it should have here). But in other cases, we are not able to decide. In this sense, the “Fencing Proxy” failure detector is reliable for a class of failures, but its behavior has not yet been defined for all the other classes of failures. It is the same thing for the halting problem: some programs obviously stop or run in an infinite loop, but others are more complicated to analyze. Things are not simple !

    > The problem intensifies because the proxy is not designed to speak with other components about those health checks.

    I believe this is not a problem. If Orchestrator polls the MySQL master via the proxy, when the proxy detects a failure and fences the master, this failure bubbles-up to Orchestrator, and failover is triggered. The nice thing here is that the “Fencing Proxy” guarantees that some classes of transient MySQL master failures become permanent, hence avoiding a split-brain (only in some specific classes of failures).

    ReplyDelete
  6. Shlomi, about your split-brain questions and comments:

    > many people were confused about the situation [GitHub outage]

    This was indeed a very intricated situation: as I wrote in my post, network partitions are the most complex cases to manage.

    > gh-mysql-rewind, aka Un-split brain for MySQL, is a new tool I was happy to present at FOSDEM

    This is an interesting idea.  MariaDB has something very similar with Flashback ([1]).  I agree it can help in some situations for data reconciliation, but I am not sure it solves the problem in the general case. We will probably discuss more about this in the future.  As I point-out in my post, I do not mention everything and new things will be invented: gh-mysql-rewind and Flashback match both cases.

    [1]: https://mariadb.com/kb/en/library/flashback/

    ReplyDelete
  7. Food mentionning i discoverd PostgreSQL similar rewind feature https://github.com/avito-tech/dba-docs/blob/f1754faf62c49e30e43101bc07a72a39400cef89/PGCon2018_Ottawa/Recovery_Use_Cases_for_Logical_Replication_in_PostgreSQL10.txt#L92

    ReplyDelete