Update 2018-11-15: the second post of the series is out: MySQL Master Replication Crash Safety Part #2: lagging slaves.
I call the theme of this series MySQL Master Replication Crash Safety. By this, I mean that after a crash and restart of a master, its slaves are able to reconnect to it and resume replication. Note that intermediate masters (IM) are also masters, so everything I write in this series applies to IM.
TL&DR: a master is not replication crash safe for OS failures when sync_binlog != 1.
I always knew that bad things could happen when setting sync_binlog to a value different from 1, but before a few weeks ago, I did not think about this much. I am now diving into this subject for preparing my talk Demystifying MySQL Replication Crash Safety. I am giving this talk on Tuesday 6th of November 2018 at Percona Live Frankfurt and on Friday 9th of November 2018 at HighLoad++ Moscow. If you are interested in MySQL Replication Crash Safety and want to know more about the internals of replication, I will be happy to give you more details and to answer your questions during the talks or in the hallway track.
Why would sync_binlog be set to a value different from 1 ?
For providing full durability (the D of ACID), MySQL has to permanently store a transaction to the InnoDB Redo logs and to the binary logs. Naively using the POSIX write system-call is not enough as the operating system might buffer those writes in memory for batching flushing to disk. This optimisation is very useful for magnetic disk — which I like to call spinning-rust — where storing data takes more than 10 milliseconds (ms).
For permanently writing data to the storage, MySQL has to use the fsync or fdatasync system-call. As explained above, doing this for spinning-rust is a high latency operation. For SSDs, things are faster and a fsync cost is in the 0.5 ms range. It is even better if there is a write cache in front of the disks where a fsync cost is less than 0.1 ms (this cache can be RAM in the RAID controller and might include a battery to survive power failure). For more details about disk latency, you can read the following post by Yves Trudeau on the Percona blog: Fsync Performance on Storage Devices.
As making a transaction fully durable needs two fsyncs (a fsync is also called sync or flush) and when a fsync costs 10 ms, the maximum MySQL throughput in transactions per second (TPS) is around 50. With SSDs, we can have ~1000 TPS; and ~5000 TPS with a RAM write cache. All previous numbers are for local disks: when the storage is remote, the link latency is usually the limiting factor and we can get ~500 TPS with a network round-trip of 1 ms.
(For now and for simplicity, I am omitting the details about InnoDB and binary log group commit. I am assuming that all transactions are serialised from a storage and sync point of view. Please forgive my temporary laziness and note that this post is already very long. I will come back to group commit later in the series.)
When the throughput is too low, or when adding the cost of two syncs to the total commit latency is too much, it is tempting to skip the flushing part for both the binary logs and the InnoDB Redo logs (the InnoDB behaviour can be controlled by the innodb_flush_log_at_trx_commit parameter, which I call trx_commit for short). This is typically done by setting sync_binlog to 0 and trx_commit to 2. When doing this, the transaction throughput usually increases greatly and the commit latency drops at lease a little. However, as most good things, it comes with side effects.
When sync_binlog != 1 and trx_commit = 2, a mysqld crash (just of the MySQL process, not of the operating system) is of no consequences: even if the transactions have not reached the storage, the data written to the binary logs and to the InnoDB Redo Logs is still present in the OS file buffers. However, things are different for operating system failures.
What happens on a master OS crash when sync_binlog != 1 ?
On an operating system crash (or a power failure), the data stored in the OS file buffers is lost. This means that InnoDB loses in the order of one second of transactions (trx_commit = 2 means that the InnoDB Redo logs are flushed approximately every second) and that the last transactions have vanished from the binary logs. I will discuss what this means for a master below, but first and to fully understand what is coming next, I need to give more details about binary logs flushing.
When sync_binlog = 1, the binary logs are flushed after each transaction. When it is set to a value greater than 1 (let’s call this N), the binary logs are flushed every N transactions (the 5.6 documentation and newer mentions every N commit groups, but as I want to avoid this complexity for now, please allow me this simplification). When sync_binlog = 0, the binary logs are not flushed at transaction commit. However, there is an exception/subtility to this.
On binary log rotation, the content of the binary log is flushed to storage all at once. This behaviour has been identified by Vadim Tkachenko in a MyRocks Performance post on the Percona blog. This has been the subject of a follow-up post: How Binary Logs Affect MySQL Performance. Both have detailed benchmark results, but less information about the reason for this behaviour. My guess (and this is only a hypothesis) is that MySQL wants to avoid losing a full binary log on an OS crash. So it syncs the previous binary log and the header of the new file on binlog rotation. This has the drawbacks described by Vadim in his posts, with the advantage that a binary log is never lost (only its content might be truncated if sync_binlog != 1).
Back to the consequences of an OS failure on a master when sync_binlog != 1, we have to understand what it means to truncate the last binary log. Let’s assume that the following sequence of transactions has been committed on the master with the described durability properties:
- Transactions A, B, ..., G are committed and persisted in both the InnoDB Redo logs and in the binary logs.
- Transactions H, I, ..., N are committed but only persisted in the InnoDB Redo logs (with trx_commit = 2, the Redo logs are synced every second, which is usually more than the frequency of binary log flushing).
- Transactions O, P, Q, R, S, T and U are committed but not persisted in either the InnoDB Redo logs or the binary logs.
And then our master suffers an operating system failure. After bringing the server back up, starting MySQL and waiting for crash recovery to complete (all this might happen auto-magically thanks to init.d or systemd), we have up to transaction N in InnoDB and up to G in the binary logs. So we obviously have a violation of durability (the lost transactions O to U in InnoDB and H to U in the binary logs) and a violation of consistency (InnoDB and the binary logs are out-of-sync). However, ACID violations do no stop here: remember that we are in a replicated environment and we have slaves... As our master is back up and running, it is probably accepting new transactions: let’s call those V, W, ...
About slaves, let’s suppose there are two: S1 and S2. S1 might have received up to transaction R and S2 up to T. After the crash, those slaves try to reconnect to the master and they succeed when MySQL has recovered However, they would report the following error:
Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from position > file size'This is expected as the slaves are pointing to a position in the binary logs that has vanished because of the crash. We are exactly in the situation depicted in the diagram below:
Illustration #1: IO Thread pointing in a vanished binary log (after an operating system crash on the master). |
If this happens on a critical system, we have two choices:
- Fix replication accepting inconsistencies,
- Reclone the slaves from the master.
Obviously the less worse solution is #2 above (solution #0, avoiding all this, is the subject of a later post in the series), but it would take a non-negligible time, and if we need read capacity from slaves, we might be ready to accept a compromise. Solution #1 can be achieved by skipping the truncated binary log and by pointing the slaves to the new binary log created by MySQL after the master restart (in red in the diagram above). In this case, we would have the following inconsistent data in the replication topology:
- The master has transactions A to N and then V, W, ...
- S1 has transactions A to R and then V, W, ...
- S2 has transactions A to T and then V, W, ...
2018-10-27T18:08:46.907543Z 0 [Note] InnoDB: Last MySQL binlog file position 0 11887354, file name mysql-bin.000004If you want to see such inconsistencies for yourself, I have added an annexe below with detailed steps to reproduce such behaviour.
There is an edge case to what I described here and things are more complicated with GTIDs. These will be the subject of the next posts of the series, but I cannot write them now as I am still preparing my talk for Percona Live Frankfurt and HightLoad++ Moscow. If you want the details quicker, you have to attend one of my talks:
- Frankfurt: Demystifying MySQL Replication Crash Safety (Tuesday, November 6, 2018)
- Moscow: Demystifying MySQL Replication Crash Safety (Friday, November 9, 2018)
To be continued...
Update 2018-11-15: the second post of the series is out: MySQL Master Replication Crash Safety Part #2: lagging slaves.
Update 2018-11-15: the second post of the series is out: MySQL Master Replication Crash Safety Part #2: lagging slaves.
Annexe: how to see things break by yourself ?
For reproducing the situation described in this post, I am using two virtual machines (vm) in Google Cloud with dbdeployer (I guess you can do the same with Amazon, Azure, ..., or with two physical servers, but I have not tested myself). We need two vms as we crash Linux on the vm hosting the master while the vm hosting the slave stays up. I am not explaining how to set up vms, you will have to figure this out yourself. Same for dbdeployer, but for helping you with this, I can point you to the following post on the Percona blog: Using dbdeployer to manage MySQL, Percona Server and MariaDB sandboxes.
Once we have two vms with dbdeployer (let’s call them V1 and V2), we need to set up a master on V1 and a slave on V2. The easiest way I found to achieve that is to create a dbdeployer replicated environment on both V1 and V2 and to tweak them. So let’s start:
1) On each vm, create a dbdeployer replicated environment (by default, dbdeployer binds MySQL on 127.0.0.1, so we have to change that for the slave on V2 being able to replicate from the master on V1):
Once we have two vms with dbdeployer (let’s call them V1 and V2), we need to set up a master on V1 and a slave on V2. The easiest way I found to achieve that is to create a dbdeployer replicated environment on both V1 and V2 and to tweak them. So let’s start:
1) On each vm, create a dbdeployer replicated environment (by default, dbdeployer binds MySQL on 127.0.0.1, so we have to change that for the slave on V2 being able to replicate from the master on V1):
2) On V1, stop the slave nodes, and on V2, the master node and slave node2 (note the prompt notation for V1 and V2 below, they assume the current directory is the respective sandboxes):$ dbdeployer deploy replication mysql_5.7.23 -c "bind-address = *"
V1$ node1/stop; node2/stop
3) On the master, modify the replication user to be available from the network (SQL commands are in blue, execution output is in orange). The flush binary logs and show master status are useful to allow easy alignment in replication for the slave (next step). The sync is needed to persist the user as MySQL 5.7 still stores grants in MyISAM (if we do not do that, the crash might lose the user and the slave will not be able to reconnect).V2$ master/stop; node2/stop
4) Make the slave replicate from the master (use the IP of your master instead of mine and use the binary log file name shown by show master status above). A small master_connect_retry is useful for avoiding delays in slave connection retries to the master, and a small slave_net_timeout for triggering retries earlier, and the reduced durability for allowing the slave to process transactions at a rate comparable to the master.V1$ master/use -N <<< "rename user 'rsandbox'@'127.%' to 'rsandbox'@'%'"; \ master/use -N <<< "flush binary logs; show master status"; \ sync mysql-bin.000002 154
V2$ node1/use -N <<< "
stop slave;
change master to master_host='10.164.0.10', master_connect_retry=1,
master_log_file='mysql-bin.000002', master_log_pos=4;
set global slave_net_timeout = 10;
set global innodb_flush_log_at_trx_commit = 2; set global sync_binlog = 0;
start slave"
5) On the master, start the test:6) Crash the master:V1$ master/use -N <<< " set global innodb_flush_log_at_trx_commit = 2; set global sync_binlog = 0; create database test_jfg; create table test_jfg.t (id int not null primary key)"; \ seq -f "insert into t values(%.0f);" 1 1000000 | master/use test_jfg &
7) Restart the master vm, then MySQL, and look at the content of the database:V1$ sudo bash -c "echo c > /proc/sysrq-trigger"
8) Use show slave status to see the broken state of the slave (not shown), and look at the content of the database:V1$ master/start; \ master/use -N test_jfg <<< "select * from t order by id desc limit 1" 87550
The slave has more data than the master: CQFD / QED !V2$ node1/use -N test_jfg <<< "select * from t order by id desc limit 1" 93667
I had deal with strange situation with a application + database server Application on php and database on mysql .
ReplyDeleteinnodb_flush_log_at_trx_commit = 1 , sync_binlog = 1;
Replication was getting failed due to missing data in binary log .
server get improper shutdown 4-5 times per day
Hard to help you without more details. I suggest posting a complete description of your problem in Slack MySQL Community ([1]) using the #dbahelp channel.
Delete[1]: https://lefred.be/mysql-community-on-slack/
Great article. Question: how would the slaves receive not persisted binlog data? Wouldn't the data have to be at least partially persisted for the slave's I/O thread to ever pick them up and drop them in the relay log? Or can an I/O thread pick up data somehow from committed but not persisted data?
ReplyDelete> how would the slaves receive not persisted binlog data?
DeleteIt is all in the definition of "persisted". Persistance, also known as the durability of in ACID, means that after a crash, data written to the db is still there. Not exactly the meaning here.
When I write that the binlogs are not yet persisted, I mean not yet "sync-ed" / "flushed" to disk. A transaction might have been appended to the binlogs, but in the case of an OS crash, the data have not yet been "sync-ed" / "flushed" to disk. So after recovery, the transaction will not be in the binlogs.
> Wouldn't the data have to be at least partially persisted for the slave's I/O thread to ever pick them up
Yes, it is partially persisted. It has been "written" to the binlog file, but it has not yet been flushed to disk. It basically still lives in an operating system memory buffer.
> Or can an I/O thread pick up data somehow from committed but not persisted data?
Yes, the IO thread can get data that is not yet "fully" persistent. When running with sync_binlog = 0, a source can send a replica data that had been written to the binlogs, but not yet been "flushed" to disk. With sync_binlog = 1, this is avoided by a source not sending data to a replica before it has been flushed to disk (even if it is already written).
(If above is not clear, feel free to ask more questions.)
Thank you for the response. I assumed only flushed/synced binlog data was available to the slave I/O thread. I'm anxious to learn how to better evaluate proper server and replication continuation post-crash. I ran into this exact situation and wondered how the slaves ended up with data the master didn't have. Essentially both logs were "written" to, but only the relay logs persisted, and I ended up with a truncated binlog. On to your next post! I'll try and get through all of your material before I ask more questions :)
DeleteOk, good luck figuring things out. I also gave a talk on this subject, I am putting the link below in case it is useful to you.
Deletehttps://archive.fosdem.org/2020/schedule/event/sync_binlog/
I am not sure there is a clear answer in my nest posts, but if you are running with sync_binlog != 1, the TL&DR is that you cannot trust the binlogs after an OS crash. This means that on a primary, you need to failover the a replica and not try to write on the recovered crashed primary.