Tuesday, July 9, 2019

MySQL Master Replication Crash Safety Part #4: benchmarks of high and low durability

This is a follow-up post in the MySQL Master Replication Crash Safety series.  In the three previous posts, we explored the consequence of reducing durability on masters (including setting sync_binlog to a value different from 1).  But so far, I only quickly presented why a DBA would run MySQL with such configuration.  In this post, I present actual benchmark results.  I also present a fundamental difference between on-premise servers and cloud virtual machines as my tests are done in Google Cloud Platform (GCP).  But before going further, let's summarise the previous posts.

In Part #1, I presented the theory behind why a DBA would set sync_binlog to something else than 1 and what happens on a master OS crash in this case, including how slaves using file+position replication react to such crash.  TL&DR: we get performance improvements, at the cost of losing binary logs when the OS crashes, which will lead to data inconsistencies on master & slaves and replication breakage on slaves.

In Part #2, we extended our understanding of the impact of running a master with sync_binlog != 1 by considering lagging slaves using file+position replication.  TL&DR: this introduces a new inconsistency scenario after a master OS crash and the resulting data drift is trickier to detect because replication might not break.

In Part #3, we even more extended our understanding of the impact of running a master with sync_binlog != 1 by considering GTID replication.  TL&DR: this introduces even more inconsistency scenarios after a master OS crash and the resulting data drift is also tricky to detect because replication might not break.

In this post, I present real numbers explaining why DBAs are running MySQL with reduced durability (sync_binlog != 1 and innodb_flush_log_at_trx_commit != 1 — I am using trx_commit for short in my posts).  The theory behind this has been presented in Part #1, so I am not repeating myself: if you do not remember, you should read the Why would sync_binlog be set to a value different from 1 section of that post again.

This post has an annexe: Under the Hood.  Benchmarking is a complex art and reporting results accurately is even harder.  If all the details were put in a single article, it would make a very long post.  The links to the annexe should satisfy readers eager for more details.

For my tests, I am using two sysbench client vms and one MySQL vm.  The MySQL vm is using dbdeployer for running MySQL.  The details about the environment, dbdeployer and sysbench are in the annexe.

My first test is running sysbench with one thread on the same server as MySQL (co-located).  I ran with high durability (sync_binlog = 1 and trx_commit = 1), and with low durability (sync_binlog = 0 and trx_commit = 2).  The results are below in transactions per seconds (TPS).
  • Co-located, high durability: ~220 TPS
  • Co-located, low durability: ~6280 TPS
This (220 TPS vs 6280 TPS) is a huge difference: low durability provides more than 25 times the throughput of high durability !  If you need more than 200 TPS in those conditions, you might be tempted to run MySQL with reduced durability, but this comes with the consequences presented in the previous posts of the series.  However, this test is somehow unfair: a normal application needing a high transaction throughput is usually not co-located with the database server.  If I try the same tests from two other vms, one in the same zone (co-located with the master) and one a remote zone from the same region, I get below:
  • Remote vm, zone co-located, high durability: ~200 TPS
  • Remote vm, zone co-located, low durability: ~3770 TPS
  • Remote vm, remote zone, high durability: ~200 TPS
  • Remote vm, remote zone, low durability: ~2190 TPS
For a high durability configuration, we basically get the same throughput for the three setups (co-located, remote vm in the same zone, and remote vm in a remote zone of the same region).  This is because the cost of the network round-trip to the database is negligible compared with the cost of the two syncs at transaction commit (real numbers are in the environment section of the annexe).  For a low durability configuration, as those two syncs are removed, the network round-trip becomes the limiting factor of the benchmark so we see a big difference in the transaction throughput of the three setups.  And still, in the architecture that is the most likely (the application talking with MySQL is in a remote zone but in the same region as the database server), we see a ten time difference in throughput between high and low durability configurations.  This is very big: it is understandable that people are tempted to get this performance boost !

Low durability provides better transaction throughput !

So clearly, running with reduced durability provides a great performance improvement.  This is especially visible in a cloud environment where a sync to disk is a network round-trip to the storage system (I am using GCP SSD persistent disks for my tests, but magnetic disks have similar behaviour latency-wise for binary logs — and probably also for InnoDB Redo logs — because sequential writes are cached at the storage level, and this is why on my production systems, I store binary logs on a dedicated magnetic disk which is cheaper than SSD).  In a high durability configuration in GCP, the high latencies of disk syncs are the limiting factor.  I would expect the performance of a high durability configuration with lower latencies, like a local SSD or a battery-backed-up RAID cache, to be much higher.

Physical servers with local SSDs
or with a battery-backed-up RAID cache
would reach higher transaction throughput !

And for those of you who know about local SSD in cloud environments, more is coming about this subject in a next post.  But in short and on local GCP SSD, I see higher write latency than on persistent SSD (yes, this is weird !).  On local SSD in Amazon Web Services (AWS), I see the lower write latencies that I am expecting.  However, it is too soon for me to state that AWS is better than GCP.

The high sync latencies of cloud environments
are making the problem worse !

As cloud environments have higher sync latency than physical servers, it is not completely false to say that for getting decent transaction throughput in such environments, compromising on durability is the easiest solution (there is more to say about this statement, but it is the subject of a next post).  With more and more people moving to the cloud, I am expecting more and more people to run with low durability configuration, and this is one of the reasons I think this series is important.  I would also like database editors (Oracle and MariaDB) to understand the importance of supporting such configuration (cf the discussion about Bug#92109 below).

Database editors must support low durability configuration
for allowing running their products in the cloud !

It is also important that cloud providers allow running managed database with reduced durability settings (from what I have seen, Google Cloud SQL does not support such configuration, and I cannot say about Amazon RDS, if you know, please write a comment below).

Cloud providers must provide low durability configuration
for managed database !

It is relatively easy to get past the high durability transaction throughput limitation on a master without lowering durability (I am not telling how to do this yet, this is the subject of a next post, but the title of my last post — Getting past 2400 threads with sysbench and dbdeployer for benchmarking MySQL 5.7.26 — is already giving you a good hint), but we are missing an important part of the puzzle: replication.

Yes, replication !  So far, the results above were sysbench pointed to a master, what about transaction throughput on slaves ?  Replication is by default single-threaded, so blocking on sync is problematic.  Below are transaction throughputs from my tests, the methodology is in the replication section of the annexe:
  • Replication, high durability: ~230 TPS
  • Replication, low durability: ~7050 TPS
So replication is giving us a little more throughput than sysbench co-located on the master.  This is probably because replication is using row-based binary logging format (the default in MySQL 5.7) which is skipping statement parsing.  But there is one hidden conclusion in those numbers:

Reducing durability is a very efficient and common way
for dealing with a replication lag problem !

And this is why I am so disappointed that replication with GTID is not yet crash safe with low durability configuration and I would like Bug#92109 to get the attention it deserves.  My trick to avoid this problem right now is to disable GTID replication for slaved on which I am using low durability by running the below command, but this is only a patch and things should not be this way.
> CHANGE MASTER TO MASTER_AUTO_POSITION = OFF;
Hopefully, this post will remind Oracle of the importance of fixing this bug, which is not affecting MariaDB as its GTID table is updated after each transaction on slaves.

Before closing this post, I would like to mention a few things that I detail in the annexe:
  • I open Bug#96134 for adding START UNTIL and WAIT functions for the IO Thread.
  • For replication tests, I mentioned having to work around the very annoying limitations of relay log recovery, which I reported as Bug#74321 in MySQL 5.6.  It is disappointing that this is still not fixed in 5.7 and 8.0, more than 4 years after being reported.
This is all I have for now.  In the next posts, I will explain how to go beyond the numbers presented in this post (and yes, it will involve using many threads).

P.S.: if you want to learn more about running MySQL in GCP, you can watch the great Percona Live talk MySQL on Google Cloud: The Good, The Bad, and The Ugly by Jeremy Cole.

5 comments:

  1. Great article JF. To your point, Amazon RDS does support modifying innodb_flush_log_at_trx_commit and sync_binlog. Hope Google implements that soon.

    ReplyDelete
    Replies
    1. Hello Unknown, good to know RDS allows reducing durability. But this probably impacts their availability solution. I have to check that in more details.

      Delete
  2. 0) Excellent post

    1) Can replication replay do more work per fsync? Run N transactions, then commit changes rather than fsync per transaction today.

    2) SSD is fast usually means reads are fast. Fsync on SSD can be much slower. Would be interesting to document this across clouds and devices.

    ReplyDelete
    Replies
    1. 0) Thanks Mark.

      2) I have some of what you mention in my next post to be published Tuesday July 16.

      1) Yes, replication can do more work per fsync, and this is the subject of a later post in the series.

      Delete
  3. I decided to not publish a comment that had a suspicious link in it. If you want to maximise your chances to successfully pass my moderation, do not include unrelated links in your comments.

    ReplyDelete