Wednesday, February 8, 2017

A Metric for Tuning Parallel Replication in MySQL 5.7

MySQL 5.7 introduced the LOGICAL_CLOCK type of multi-threaded slave (MTS).  When using this type of parallel replication (and when slave_parallel_workers is greater than zero), slaves use information from the binary logs (written by the master) to run transactions in parallel.  However, enabling parallel replication on slaves might not be enough to get a higher replication throughput (VividCortex blogged about such a situation recently in Solving MySQL Replication Lag with LOGICAL_CLOCK and Calibrated Delay).  To get a faster slave with parallel replication, some tuning is needed on the master.

I already wrote/spoke many times about this tuning (when presenting slave group commit on the Booking.com blog, when speaking about MariaDB 10.0 parallel replication, and when speaking about MySQL 5.7 parallel replication).  I call this slowing down the master to speed-up the slave.  This consists of delaying commit in the hope of identifying more parallelism on the master and in the hope that the slaves will replicate faster.  This has visible side effects on clients by increasing commit latency, which could be a problem if a session is in a tight commit loop (the commit throughput will drop) or in other situations where a client relies on commit being quick.  So this tuning should be done with caution and its impact should be monitored carefully.

In MariaDB Server 10+


When doing the "slowing down the master to speed-up the slave" tuning in MariaDB Server 10+ (10.0, 10.1 and 10.2), the binlog_commit_wait_count and binlog_commit_wait_usec global variables are used.  By default, the delay is 100 milliseconds (binlog_commit_wait_usec=100000), but delaying commit is disabled (binlog_commit_wait_count=0).  To enable delaying commit, binlog_commit_wait_count should be increased to a positive value.

When tuning those parameters, the binlog_commits and binlog_group_commits global statuses should be monitored and from those, an average group commit size can be computed.  This average group commit size is useful to get a sense of the tuning quality.  While increasing binlog_commit_wait_usec:
  • if the group commit size grows: the tuning has a good change of giving better parallel replication speed (I write "a good chance" here as some situation, like large transactions, could slowdown parallel replication and this is not shown by this metric).
  • if the group commit size stays constant: the tuning should probably be reverted because the parallelism sent to slave does not increase and commit latency probably grows.
  • if the commit rate drops noticeably: the tuning might have an undesirable effect on the application as throughput is reduced, and we should be extra-careful in moving forward.
Below is an example of this tuning where the commit rate clearly drops (slowing down the master) and the group commit size increases noticeably (speeding up the slave).  The value of binlog_commit_wait_usec is the default (100000).  In this particular case, after more investigations and because I needed a maximum slave throughput, I went even further into obtaining a much higher replication throughput.


Also, not only the group commit size allows to measure the tuning quality and to quantify the amount of parallelism identified by the master and sent to slaves, it allows to size the value of slave_parallel_threads to use on the slave.  In my opinion, the average group commit size is an invaluable metric to have to monitor and optimize performance.

So in MariaDB Server, we have all instrumentation to perform an efficient and safe tuning of parallel replication on the master and on the slaves.  What about MySQL 5.7 ?

In MySQL 5.7


When doing the "slowing down the master to speed-up the slave" tuning in MySQL 5.7, the binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count global variables are used.  By default, both values are set to zero, which disable delaying commit.  To enable this, binlog_group_commit_sync_delay should be set to a positive value.  Note the difference between MySQL and MariaDB here: binlog_commit_wait_count=0 disable delay in MariaDB and binlog_group_commit_sync_no_delay_count=0 simply put no upper bound on group commit size in MySQL.

However, because MySQL 5.7 does not rely on group commit to identify parallelism, there is no corresponding global statuses allowing to measure the tuning quality.  This increases the complexity of this already challenging tuning.  At this point, the only way to measure the tuning efficiency is to look at the slave throughput.  But if slaves are lagging when doing the tuning, we need to wait until the optimized transactions reach the slave and then measure if the slave is actually faster.  This is not very practical.

The Wish for a MySQL 5.7 MTS Tuning Metric


I explain above how the average group commit size can be used as a metric to tune parallel replication in MariaDB Server.  I wanted to find a corresponding metric for MySQL 5.7.  Before looking at the metric that I came up with and to be able to understand it, we need to dive into how a MySQL 5.7 master transfers parallelism information to slaves.  To do that, I will first describe the MariaDB implementation because MySQL is more complex and it is useful to first understand something simpler.

A MariaDB 10+ master puts a group commit id in the binary log header of each transaction.  Below, I extracted the group commit id (cid) from a MariaDB binary log where tuning has been done.
...
#150316 11:33:46 ... GTID 0-1-184 cid=2324
#150316 11:33:46 ... GTID 0-1-185 cid=2335
...
#150316 11:33:46 ... GTID 0-1-189 cid=2335

#150316 11:33:46 ... GTID 0-1-190
#150316 11:33:46 ... GTID 0-1-191 cid=2346
...
#150316 11:33:46 ... GTID 0-1-197 cid=2346

#150316 11:33:46 ... GTID 0-1-198 cid=2361
...
In the extract above, we can see that:
  • the first line is the last transaction of a group commit identified by cid 2324,
  • we then have the first and the last transactions of cid 2335 (by looking at the GTID, we can see that there are five transactions in that group),
  • we then have a transaction that commits alone (for a group commit size of one, MariaDB omits the cid),
  • we then have the first and the last transactions of cid 2346 (seven transactions in that group),
  • and we then have the first transaction of cid 2361.
Enough about MariaDB, what about MySQL ?  In MySQL, the parallelism information is not modeled as group commit but as intervals.  The beginning of the interval corresponds to a view on which the current transaction is based and the end of the interval corresponds to the view materialized by the current transaction.

So, the master tags each transaction with an increasing number.  This number is reset to one on binary log rotation, so the number for the first transaction in a binary log is always one.  This transaction number, not to be confused with its GTID, is the interval end and is labeled as sequence_number in the output of mysqlbinlog.  The beginning of the interval, identified as last_committed, corresponds to the sequence number of the most recently committed transaction when the current transaction gets its last lock (this is zero for the first transaction in a binary log).  So the beginning of the interval (last_committed) represents the most recent transaction on which the current transaction depends on.

Below, I have extracted the interval information from a MySQL instance where partial tuning has been done.
...
#170206 20:08:33 ... last_committed=6201 sequence_number=6203
#170206 20:08:33 ... last_committed=6203 sequence_number=6204
#170206 20:08:33 ... last_committed=6203 sequence_number=6205
#170206 20:08:33 ... last_committed=6203 sequence_number=6206
#170206 20:08:33 ... last_committed=6205 sequence_number=6207
...
In the extract above, we can see:
  • the first line is the transaction with sequence number 6203 and this transaction depends on transaction 6201,
  • we then have three transactions (6204, 6205 and 6206) committing together (they depend on 6203),
  • and we then have a transaction (6207) that depends on the second transaction of the previous group (6205).
That last transaction shows the advantage of the intervals: transaction 6207 can be started as soon as transaction 6205 completes (6207 can run in parallel with 6206).  So the MySQL interval design might allow for a quicker scheduling of some transactions.  However, the interval design increases the complexity of analyzing parallelism potential, so tuning becomes harder.

A MySQL 5.7 MTS Tuning Metric: the Average Modified Interval Length


I came up with a tuning metric for MySQL 5.7: the average "modified" interval length (I will explain the modified part in a few paragraphs below).  This metric is based on the following observation: the larger the intervals are, the better the parallelism identification is.

The larger the intervals are,
the better the parallelism identification is.

If we prefer to think in terms of group commit size, we can map the average interval length to a pseudo-group commit size.  This can be useful to be able to tune the number of slave threads (slave_parallel_workers).  Looking at the group commit of size three in the mysqlbinlog extract above (all transactions depending on 6203), we can see that the interval length of the transactions are respectively 1, 2 and 3, which average to 2.  We can generalize that for a group commit of size n, the sum of the intervals length should be n*(n+1)/2, and the average length should be (n+1)/2.  So to get back n, we need to multiply this by 2 and subtract 1.  This means that to map an average interval length to a group commit size, we should multiply it by 2 and subtract 1 (we can validate that an average interval length of 1 (no parallelism) maps to a group commit size of 1).

The way to get the average "modified" interval length is by parsing the binary logs and by doing some messy computations.  This is done by running the script below on a master (it could easily be modified to be run from anywhere but I advise against it as it would waste network bandwidth to transfer binary logs):
file=my_binlog_index_file;
echo _first_binlog_to_analyse_ > $file;
mysqlbinlog --stop-never -R --host 127.0.0.1 $(cat $file) |
  grep "^#" | grep -e last_committed -e "Rotate to" |
  awk -v file=$file -F "[ \t]*|=" '
    $11 == "last_committed" {
       if (length($2) == 7) {$2 = "0" $2;}
       if ($12 < max) {$12 = max;} else {max = $12;}
       print $1, $2, $14 - $12;}
    $10 == "Rotate"{print $12 > file; close(file); max=0;}' |
  awk -F " |:" '
    {my_h = $2 ":" $3 ":" $4;}
     NR == 1 {d=$1; h=my_h; n=0; sum=0; sum2=0;}
     d != $1 || h < my_h {print d, h, n, sum, sum2; d=$1; h=my_h;}
    {n++; sum += $5; sum2 += $5 * $5;}'
The script above is a complex pipe of many commands.  Let's understand them one by one:
  1. mysqlbinlog connects (locally) to the master, download the binary log specified in the file $file, and never stops (you will need to CTRL-C the script or kill one of its processes to stop it),
  2. two consecutive grep filter the output of mysqlbinlog to what is needed by the next awk commands (I could also do the same in awk and save two processes but using grep makes things more efficient by reducing the work done by awk: if I remove grep, awk is the bottleneck to the script, and if I shift the work from awk to grep, mysqlbinlog becomes the bottleneck of the script),
  3. the first awk computes the "modified" interval length (modified explained below) and saves the binary log name on rotation in the file $file (this allows to start back the script from this saved binary log if interrupted),
  4. and the last awk counts the number of transactions, computes the sum of the modified interval length, and the sum of their square (I am computing the sum of the square to eventually be able to expose a standard deviation).
The "modified" part is in the first awk: if the start of the current interval is smaller than the one from the previous interval, this start is replaced by the highest last_committed seen so far.  Said otherwise, the last_committed are modified to never "rewind".  This is needed because of a limitation in the slave MTS applier: in the current implementation, a transaction that cannot be scheduled acts as a barrier in scheduling.  To understand this better, let's take the partial output of mysqbinlog below.
...
#170206 21:19:31 ... last_committed=93124 sequence_number=93131
#170206 21:19:31 ... last_committed=93131 sequence_number=93132
#170206 21:19:31 ... last_committed=93131 sequence_number=93133
#170206 21:19:31 ... last_committed=93131 sequence_number=93134
#170206 21:19:31 ... last_committed=93131 sequence_number=93135

#170206 21:19:31 ... last_committed=93124 sequence_number=93136
#170206 21:19:31 ... last_committed=93131 sequence_number=93137
#170206 21:19:31 ... last_committed=93131 sequence_number=93138

#170206 21:19:31 ... last_committed=93132 sequence_number=93139
#170206 21:19:31 ... last_committed=93138 sequence_number=93140
...
In the first line, the transaction 93131 must wait for 93124 to complete.  Once 93124 is completed, 93131 is scheduled.  However, because 93132 needs to wait for 93131, the MTS applier stop processing transaction.  This barrier will be lifted once 93131 completes.  So even if 93136 could start after 93124 (in red and underlined above), it will actually start after 93131 because of the current implementation of the MTS applier.  Hence, for computing the "modified" interval length, I use 93131 as the start of transaction 93136 (instead of 93124).

Note again the advantage of the interval parallelism identification model in the mysqlbinlog extract above: transaction 93139 will be started once 93132 completes and does not need to wait for transactions 93133 to 93138 to complete.

Also note that a more complex MTS applier could see that a further transaction (93136 in red above) can be run at the same time as 93131 (both are unblocked by transaction 93124).  The current applier blocks at 93132, but this might change in the future.

If you need to transform the output of the script above to include UNIX epoch (for sending to graphite as an example), you can pipe it to the following hairy awk script:
awk -F " |:" '
    {YY = substr($1, 2, 2);
     MM = strtonum(substr($1, 4, 2));
     DD = strtonum(substr($1, 6, 2));
     YYYYMMDDHHMMSS = "20" YY " " MM " " DD " " $2 " " $3 " " $4;
     print mktime(YYYYMMDDHHMMSS), $5, $6, $7, $6 / $5;}' |
  awk -v p="my.graphite.metric.name.$(hostname -s).binlog_mts_metric." '
    {print p "trx", $2, $1;
     print p "interval_sum", $3, $1;}'

Some Average Modified Interval Length Metric Examples


Below, you will find results of the average modified interval length metric while tuning binlog_group_commit_sync_delay on four different masters.  As you can see, with the default value (binlog_group_commit_sync_delay=0), the metric is very close to 1 (no parallelism).  As I increase binlog_group_commit_sync_delay (to 10 milliseconds, 50 ms, 100 ms and 200 ms), I get bigger and bigger value for the metric (bigger is better as it means more parallelism to work with on the slave).





Looking at the four graphs above and without any tuning, I infer that I would not benefit from MTS on slaves.  With binlog_group_commit_sync_delay=200000, I can expect the first and third master to get moderate MTS speedups (average modified interval length of ~2.25 and ~3.1).  I can also expect the second and fourth masters to get good MTS speedups (average modified interval length of ~21 in both cases).  For those last two masters, we might even consider using a delay lower to reduce the commit latency.

However, a good metric does not mean a good MTS performance on the slave.  A low value of the metric will always lead to a poor MTS performance but a high value might not result in a good performance.  This could happen in the situation where some transactions are much bigger than others but this is a whole different subject (you can learn more about this by looking at the slides of my parallel replication talk).

Some Bug Reports (Feature Requests)


While doing this work, I imagined some improvements that could be made in MySQL.  The first one is that parsing binary log is a pain, so MySQL should expose metrics allowing the computation of the average modified interval length.  Those metrics could be the following global statuses:
  • BINLOG_COMMIT: would count the number of transactions written to the binary logs,
  • BINLOG_INTERVAL_LENGTH: would accumulate (sum) the unmodified interval length,
  • BINLOG_INTERVAL_LENGTH_SQUARE: would accumulate (sum) the square of the unmodified interval length (for computing a standard deviation),
  • BINLOG_INTERVAL_LENGTH_MODIFIED_MYSQL_5_7_19: would accumulate (sum) the "modified" interval length,
  • BINLOG_INTERVAL_LENGTH_MODIFIED_MYSQL_5_7_19_SQUAREwould accumulate (sum) the square of the "modified" interval length.
Note that I am using _MYSQL_5_7_19 as a suffix for the suggested modified length metrics because the MTS applier might be improved and a new "modified" counter (or the unmodified one) might be needed.  Both metrics should live side-by-side because some slaves could be 5.7.19 and others could be 8.0.

The metrics above are useful on the master but something similar is also useful on slaves.  The corresponding metrics as "seen" by the SQL_THREAD would provide information on what is executed on the slaves.  Those metrics would be "shifted" by the slave lag and compressed/extended with the throughput difference between master and slave.  Those slave metrics cannot be global statuses as they are replication channel specific, so we need new fields to SHOW SLAVE STATUS to the corresponding tables in performance schema.

The last improvement I see is adding an option to mysqlbinlog to only output specific information.  As quickly mentioned above, the limiting process in the pipe is mysqlbinlog.  As I did mysqlbinlog | grep "^#", an option for only showing event headers would have been useful.

So I opened (will open) the following MySQL bugs:
  • Bug#85965: Expose, on the master, counters for monitoring parallel information quality.
  • Bug#85966: Expose, on slaves, counters for monitoring parallel information quality.
  • Bug#XXXXX (I still need to open this bug...): Provide a way to avoid mysqlbinlog to generate useless output.


I hope that, thanks to the average modified interval length metric, you will be able to tune MySQL 5.7 parallel replication more easily.  I would appreciate if you would share your experience using this metric in the comments below or in a blog post.

If you want to know more about this subject, I will be happy to discuss it with you in April at Percona Live Santa Clara where I have a talk on MySQL/MariaDB Parallel Replication.

11 comments:

  1. Hi - Is it safe to use this setting binlog_group_commit_sync_delay with
    innodb_flush_log_at_trx_commit = 1 ?

    Thanks
    Prasad

    ReplyDelete
  2. I also meant to ask that in general setting binlog_group_commit_sync_delay will not impact the crash safety of the server.

    ReplyDelete
    Replies
    1. My understanding is that binlog_group_commit_sync_delay does not impact crash safety, it only increases the latency of a COMMIT. Think of it as adding a sleep before a COMMIT: this should not impact crash safety. But maybe you have something else in mind: could you share you concerns in a little more details ?

      Delete
    2. Hi Jean - Thanks for your quick response.
      To be more specific on my question, I was just wondering that since innodb_flush_log_at_trx_commit = 1, is for every commit, will it be safe to batch the commits for binlog sync.
      Also, is it recommended to have this setting on the slaves as well or just on master?

      Thanks!

      Delete
    3. Ah, ok, I understand better.

      The way InnoDB manages its REDO log (and data) is very different to the way MySQL manages its binary log. The binary log is only written to at the end of the transaction at commit - if a transaction rolls-back, normally (and there are exceptions) nothing is written to the binary log. The InnoDB REDO log is "written" to at every mutation of InnoDB data. I put written in quotes as this is not directly written to the REDO log files, but written in RAM in the InnoDB Redo Log Buffer ([1]). When InnoDB flushes its log, it flushes the whole content of the InnoDB Redo Log Buffer, not just the current transaction.

      [1]: https://dev.mysql.com/doc/refman/5.7/en/innodb-redo-log-buffer.html

      To give an example, imagine that 2 transactions consisting or one INSERT and one UPDATE run in parallel. An order of serialization of those transactions could be: t1-BEGIN, t1-INSERT, t2-BEGIN, t1-UPDATE, t2-INSERT, t1-COMMIT, t2-UPDATE and t2-COMMIT. At t1-COMMIT, the InnoDB Redo Log Buffer contains t1-INSERT, t1-UPDATE and t2-INSERT, so those 3 mutations are flushed at t1-COMMIT; and the t2-COMMIT will only flush t2-UPDATE.

      Let's take a very similar example, but this time with the following serialization (closer to grouping transactions): t1-BEGIN, t1-INSERT, t2-BEGIN, t1-UPDATE, t2-INSERT, t2-UPDATE, t1-COMMIT and then t2-COMMIT. As t1-COMMIT will flush both mutations of t2, t2-COMMIT will not have to flush anything. This is InnoDB Group Commit.

      So when many transactions are grouped (let's say 4 transactions), the flush of the first transaction will actually flush the 4 transactions.

      Something that is not part of your question, but that needs to be written for completeness: what happen if t2 rolls-back... In the case of the 1st serialization (t1-BEGIN, t1-INSERT, t2-BEGIN, t1-UPDATE, t2-INSERT, t1-COMMIT, t2-UPDATE, t2-ROLLBACK), the InnoDB REDO log will include the following:
      -t1-INSERT, t1-UPDATE, t2-INSERT (flushed by t1-COMMIT)
      -t2-UPDATE, t2-UNDO-INSERT, t2-UNDO-UPDATE (flushed by t2-ROLLBACK)
      This is why a ROLLBACK is expensive in InnoDB: if must UNDO (including logging) the transaction. For a small transaction, this is quick, but for a big transaction, it can be very much longer. Said otherwise, InnoDB optimizes COMMITs (which is the "usual" case) at the cost of penalizing ROLLBACKs (which should happen less often).

      I hope this clarifies things.

      Delete
    4. Hi Jean - Thanks for the detailed explanation. One clarification I would still need is that for MYSQL server durability, I am setting both innodb_flush_log_at_trx_commit=1 and sync_binlog=1. The point here is that in case of a crash, i might lose max 1 transaction in the binary log.
      Now suppose I set binlog_group_commit_sync_delay = 100ms and there are 10 transactions that have happened within those 100 ms, and the server crashes - will I not be losing 10 transactions now ?

      Delete
    5. I am not sure that with trx_commit=1 and sycn_binlog=1 (high durability settings) you will loose max 1 transaction. IMHO, you will loose 0 committed transaction.

      The confusion could be on the exact state of transactions. With respect to COMMIT and IMHO, you can be in 3 states: uncommitted, committing and committed. Those are respectively:
      -uncommitted: COMMIT has not been called yet,
      -committing: COMMIT has been called but has not returned yet,
      -committed: COMMIT has been called and has returned without error.

      In my statement above, I wrote that with high durability settings, you will loose 0 committed transaction when crashing. Obviously, you will loose all uncommitted transactions. It is the state of the committing transactions that are more complicated.

      The state of those committing transactions is actually undefined. Once you issue COMMIT and MySQL crashes without the application receiving an answer to COMMIT, you cannot know if the transaction will be committed or rolled-back after crash-recover. Two obvious cases are the following and all other cases in between are more complicated:
      1) you issue COMMIT and MySQL crashes right away without anything being done: the transaction will be rolled-back after crash-recovery,
      2) MySQL crashes after COMMIT is complete but just before it returns to the client: the transaction will be committed after crash-recovery.

      When you write that you might loose max 1 transaction with high durability settings, I think you mean loose max 1 _committing_ transaction. This is not even true. If your application COMMIT 10 transactions exactly at the same time and then MySQL crash, if all those transactions are in state #1 above (MySQL only received the COMMIT request and has not started doing anything), those 10 transactions will be roll-ed back after crash recovery.

      This is exactly what would happen in the example you give ("I set binlog_group_commit_sync_delay = 100ms and there are 10 transactions that have happened within those 100 ms"). For MySQL, those 10 transactions just started committing, so if MySQL crashes in those 100ms, the 10 transactions are lost. This does not violate any persistence guaranty: those transactions were not _committed_ (commit did not return) so their state is undefined.

      I hope this is clear...

      Delete
  3. Hi Jean - Thanks so much for your detailed explanation. I got a good understanding of this now.

    ReplyDelete
  4. Hi Jean - I ran into a strange issue with high durability settings and semi synchronous replication. Basically, in my master, there are executed GTIDs but are missing in the binary logs. I dont think it is directly related to this blogging topic though and my apologies for raising this here. If there are other suitable ways to share this question, please let me know, so that I can share more details.

    ReplyDelete
    Replies
    1. Sorry for the late reply, I was busy... You could share you question at below mail, but I am not sure if I will have a lot of time for looking into that.
      jfg DOT mysql AT gmail.com

      Delete