Wednesday, February 8, 2017

A Metric for Tuning Parallel Replication in MySQL 5.7

MySQL 5.7 introduced the LOGICAL_CLOCK type of multi-threaded slave (MTS).  When using this type of parallel replication (and when slave_parallel_workers is greater than zero), slaves use information from the binary logs (written by the master) to run transactions in parallel.  However, enabling parallel replication on slaves might not be enough to get a higher replication throughput (VividCortex blogged about such a situation recently in Solving MySQL Replication Lag with LOGICAL_CLOCK and Calibrated Delay).  To get a faster slave with parallel replication, some tuning is needed on the master.

I already wrote/spoke many times about this tuning (when presenting slave group commit on the blog, when speaking about MariaDB 10.0 parallel replication, and when speaking about MySQL 5.7 parallel replication).  I call this slowing down the master to speed-up the slave.  This consists of delaying commit in the hope of identifying more parallelism on the master and in the hope that the slaves will replicate faster.  This has visible side effects on clients by increasing commit latency, which could be a problem if a session is in a tight commit loop (the commit throughput will drop) or in other situations where a client relies on commit being quick.  So this tuning should be done with caution and its impact should be monitored carefully.

In MariaDB Server 10+

When doing the "slowing down the master to speed-up the slave" tuning in MariaDB Server 10+ (10.0, 10.1 and 10.2), the binlog_commit_wait_count and binlog_commit_wait_usec global variables are used.  By default, the delay is 100 milliseconds (binlog_commit_wait_usec=100000), but delaying commit is disabled (binlog_commit_wait_count=0).  To enable delaying commit, binlog_commit_wait_count should be increased to a positive value.

When tuning those parameters, the binlog_commits and binlog_group_commits global statuses should be monitored and from those, an average group commit size can be computed.  This average group commit size is useful to get a sense the tuning quality.  While increasing binlog_commit_wait_usec:
  • if the group commit size grows: the tuning has a good change of giving better parallel replication speed (I write "a good chance" here as some situation, like large transactions, could slowdown parallel replication and this is not shown by this metric).
  • if the group commit size stays constant: the tuning should probably be reverted because the parallelism sent to slave does not increase and commit latency probably grows.
  • if the commit rate drops noticeably: the tuning might have an undesirable effect on the application as throughput is reduced, and we should be extra-careful in moving forward.
Below is an example of this tuning where the commit rate clearly drops (slowing down the master) and the group commit size increases noticeably (speeding up the slave).  The value of binlog_commit_wait_usec is the default (100000).  In this particular case, after more investigations and because I needed a maximum slave throughput, I went even went further into obtaining a much higher replication throughput.

Also, not only the group commit size allows to measure the tuning quality and to quantify the amount of parallelism identified by the master and sent to slaves, it allows to size the value of slave_parallel_threads to use on the slave.  In my opinion, the average group commit size is an invaluable metric to have to monitor and optimize performance.

So in MariaDB Server, we have all instrumentation to perform an efficient and safe tuning of parallel replication on the master and on the slaves.  What about MySQL 5.7 ?

In MySQL 5.7

When doing the "slowing down the master to speed-up the slave" tuning in MySQL 5.7, the binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count global variables are used.  By default, both values are set to zero, which disable delaying commit.  To enable this, binlog_group_commit_sync_delay should be set to a positive value.  Note the difference between MySQL and MariaDB here: binlog_commit_wait_count=0 disable delay in MariaDB and binlog_group_commit_sync_no_delay_count=0 simply put no upper bound on group commit size in MySQL.

However, because MySQL 5.7 does not rely on group commit to identify parallelism, there is no corresponding global statuses allowing to measure the tuning quality.  This increases the complexity of this already challenging tuning.  At this point, the only way to measure the tuning efficiency is to look at the slave throughput.  But if slaves are lagging when doing the tuning, we need to wait until the optimized transactions reach the slave and then measure if the slave is actually faster.  This is not very practical.

The Wish for a MySQL 5.7 MTS Tuning Metric

I explain above how the average group commit size can be used as a metric to tune parallel replication in MariaDB Server.  I wanted to find a corresponding metric for MySQL 5.7.  Before looking at the metric that I came up with and to be able to understand it, we need to dive into how a MySQL 5.7 master transfers parallelism information to slaves.  To do that, I will first describe the MariaDB implementation because MySQL is more complex and it is useful to first understand something simpler.

A MariaDB 10+ master puts a group commit id in the binary log header of each transaction.  Below, I extracted the group commit id (cid) from a MariaDB binary log where tuning has been done.
#150316 11:33:46 server id 1 end_log_pos x GTID 0-1-184 cid=2324
#150316 11:33:46 server id 1 end_log_pos x GTID 0-1-185 cid=2335
#150316 11:33:46 server id 1 end_log_pos x GTID 0-1-189 cid=2335

#150316 11:33:46 server id 1 end_log_pos x GTID 0-1-190
#150316 11:33:46 server id 1 end_log_pos x GTID 0-1-191 cid=2346
#150316 11:33:46 server id 1 end_log_pos x GTID 0-1-197 cid=2346

#150316 11:33:46 server id 1 end_log_pos x GTID 0-1-198 cid=2361
In the extract above, we can see that:
  • the first line is the last transaction of a group commit identified by cid 2324,
  • we then have the first and the last transactions of cid 2335 (by looking at the GTID, we can see that there are five transactions in that group),
  • we then have a transaction that commits alone (for a group commit size of one, MariaDB omits the cid),
  • we then have the first and the last transactions of cid 2335 (seven transactions in that group),
  • and we then have the first transaction of cid 2361.
Enough about MariaDB, what about MySQL ?  In MySQL, the parallelism information is not modeled as group commit but as intervals.  The beginning of the interval corresponds to a view on which the current transaction is based and the end of the interval corresponds to the view materialized by the current transaction.

So, the master tags each transaction with an increasing number.  This number is reset to one on binary log rotation, so the number for the first transaction in a binary log is always one.  This transaction number, not to be confused with its GTID, is the interval end and is labeled as sequence_number in the output of mysqlbinlog.  The beginning of the interval, identified as last_committed, corresponds to the sequence number of the most recently committed transaction when the current transaction gets its last lock (this is zero for the first transaction in a binary log).  So the beginning of the interval (last_committed) represents the most recent transaction on which the current transaction depends on.

Below, I have extracted the interval information from a MySQL instance where partial tuning has been done.
#170206 20:08:33 ... last_committed=6201 sequence_number=6203
#170206 20:08:33 ... last_committed=6203 sequence_number=6204
#170206 20:08:33 ... last_committed=6203 sequence_number=6205
#170206 20:08:33 ... last_committed=6203 sequence_number=6206
#170206 20:08:33 ... last_committed=6205 sequence_number=6207
In the extract above, we can see:
  • the first line is the transaction with sequence number 6203 and this transaction depends on transaction 6201,
  • we then have three transactions (6204, 6205 and 6206) committing together (they depend on 6203),
  • and we then have a transaction (6207) that depends on the second transaction of the previous group (6205).
That last transaction shows the advantage of the intervals: transaction 6207 can be started as soon as transaction 6205 completes (6207 can run in parallel with 6206).  So the MySQL interval design might allow for a quicker scheduling of some transactions.  However, the interval design increases the complexity of analyzing parallelism potential, so tuning becomes harder.

A MySQL 5.7 MTS Tuning Metric: the Average Modified Interval Length

I came up with a tuning metric for MySQL 5.7: the average "modified" interval length (I will explain the modified part in a few paragraphs below).  This metric is based on the following observation: the larger the intervals are, the better the parallelism identification is.

The larger the intervals are,
the better the parallelism identification is.

If we prefer to think in terms of group commit size, we can map the average interval length to a pseudo-group commit size.  This can be useful to be able to tune the number of slave threads (slave_parallel_workers) .  Looking at the group commit of size three in the mysqlbinlog extract above (all transactions depending on 6203), we can see that the interval length of the transactions are respectively 1, 2 and 3, which average to 2.  We can generalize that for a group commit of size n, the sum of the intervals length should be n*(n+1)/2, and the average length should be (n+1)/2.  So to get back n, we need to multiply this by 2 and subtract 1.  This means that to map an average interval length to a group commit size, we should multiply it by 2 and subtract 1 (we can validate that an average interval length of 1 (no parallelism) maps to a group commit size of 1).

The way to get the average "modified" interval length is by parsing the binary logs and by doing some messy computations.  This is done by running the script below on a master (it could easily be modified to be run from anywhere but I advise against it as it would waste network bandwidth to transfer binary logs):
echo _first_binlog_to_analyse_ > $file;
mysqlbinlog --stop-never -R --host $(cat $file) |
  grep "^#" | grep -e last_committed -e "Rotate to" |
  awk -v file=$file -F "[ \t]*|=" '
    $11 == "last_committed" {
       if (length($2) == 7) {$2 = "0" $2;}
       if ($12 < max) {$12 = max;} else {max = $12;}
       print $1, $2, $14 - $12;}
    $10 == "Rotate"{print $12 > file; close(file); max=0;}' |
  awk -F " |:" '
    {my_h = $2 ":" $3 ":" $4;}
     NR == 1 {d=$1; h=my_h; n=0; sum=0; sum2=0;}
     d != $1 || h < my_h {print d, h, n, sum, sum2; d=$1; h=my_h;}
    {n++; sum += $5; sum2 += $5 * $5;}'
The script above is a complex pipe of many commands.  Let's understand them one by one:
  1. mysqlbinlog connects (locally) to the master, download the binary log specified in the file $file, and never stops (you will need to CTRL-C the script or kill one of its processes to stop it),
  2. two consecutive grep filter the output of mysqlbinlog to what is needed by the next awk commands (I could also do the same in awk and save two processes but using grep makes things more efficient by reducing the work done by awk: if I remove grep, awk,  is the bottleneck to the script, and if I shift the work from awk to grep, mysqlbinlog becomes the bottleneck of the script),
  3. the first awk computes the "modified" interval length (modified explained below) and saves the binary log name on rotation in the file $file (this allows to start back the script from this saved binary log if interrupted),
  4. and the last awk counts the number of transactions, computes the sum of the modified interval length, and the sum of their square (I am computing the sum of the square to eventually be able to expose a standard deviation).
The "modified" part is in the first awk: if the start of the current interval is smaller than the one from the previous interval, this start is replaced by the highest last_committed seen so far.  Said otherwise, the last_committed are modified to never "rewind".  This is needed because of a limitation in the slave MTS applier: in the current implementation, a transaction that cannot be scheduled acts as a barrier in scheduling.  To understand this better, let's take the partial output of mysqbinlog below.
#170206 21:19:31 ... last_committed=93124 sequence_number=93131
#170206 21:19:31 ... last_committed=93131 sequence_number=93132
#170206 21:19:31 ... last_committed=93131 sequence_number=93133
#170206 21:19:31 ... last_committed=93131 sequence_number=93134
#170206 21:19:31 ... last_committed=93131 sequence_number=93135

#170206 21:19:31 ... last_committed=93124 sequence_number=93136
#170206 21:19:31 ... last_committed=93131 sequence_number=93137
#170206 21:19:31 ... last_committed=93131 sequence_number=93138

#170206 21:19:31 ... last_committed=93132 sequence_number=93139
#170206 21:19:31 ... last_committed=93138 sequence_number=93140
In the first line, the transaction 93131 must wait for 93124 to complete.  Once 93124 is completed, 93131 is scheduled.  However, because 93132 needs to wait for 93131, the MTS applier stop processing transaction.  This barrier will be lifted once 93131 completes.  So even if 93136 could start after 93124 (in red and underlined above), it will actually start after 93131 because of the current implementation of the MTS applier.  Hence, for computing the "modified" interval length, I use 93131 as the start of transaction 93136 (instead of 93124).

Note again the advantage of the interval parallelism identification model in the mysqlbinlog extract above: transaction 93139 will be started once 93132 completes and does not need to wait for transactions 93133 to 93138 to complete.

Also note that a more complex MTS applier could see that a further transaction (93136 in red above) can be run at the same time as 93131 (both are unblocked by transaction 93124).  The current applier blocks at 93132, but this might change in the future.

If you need to transform the output of the script above to include UNIX epoch (for sending to graphite as an example), you can pipe it to the following hairy awk script:
awk -F " |:" '
    {YY = substr($1, 2, 2);
     MM = strtonum(substr($1, 4, 2));
     DD = strtonum(substr($1, 6, 2));
     YYYYMMDDHHMMSS = "20" YY " " MM " " DD " " $2 " " $3 " " $4;
     print mktime(YYYYMMDDHHMMSS), $5, $6, $7, $6 / $5;}' |
  awk -v p="$(hostname -s).binlog_mts_metric." '
    {print p "trx", $2, $1;
     print p "interval_sum", $3, $1;}'

Some Average "Modified" Interval Length Metric Examples

Below, you will find results of the average "modified" interval length metric while tuning binlog_group_commit_sync_delay on four different masters.  As you can see, with the default value (binlog_group_commit_sync_delay=0), the metric is very close to 1 (no parallelism).  As I increase binlog_group_commit_sync_delay (to 10 milliseconds, 50 ms, 100 ms and 200 ms), I get bigger and bigger value for the metric (bigger is better as it means more parallelism to work with on the slave).

Looking at the four graphs above and without any tuning, I infer that I would not benefit from MTS on slaves.  With binlog_group_commit_sync_delay=200000, I can expect the first and third master to get moderate MTS speedups (average "modified" interval length of ~2.25 and ~3.1).  I can also expect the second and fourth masters to get good MTS speedups (average "modified" interval length of ~21 in both cases).  For those last two masters, we might even consider using a delay lower to reduce the commit latency.

However, a good metric does not mean a good MTS performance on the slave.  A low value of the metric will always lead to a poor MTS performance but a high value might not result in a good performance.  This could happen in the situation where some transactions are much bigger than others but this is a whole different subject (you can learn more about this by looking at the slides of my parallel replication talk).

Some Bug Reports (Feature Requests)

While doing this work, I imagined some improvements that could be made in MySQL.  The first one is that parsing binary log is a pain, so MySQL should expose metrics allowing the computation of the average "modified" interval length.  Those metrics could be the following global statuses:
  • BINLOG_COMMIT: would count the number of transactions written to the binary logs,
  • BINLOG_INTERVAL_LENGTH: would accumulate (sum) the unmodified interval length,
  • BINLOG_INTERVAL_LENGTH_SQUARE: would accumulate (sum) the square of the unmodified interval length (for computing a standard deviation),
  • BINLOG_INTERVAL_LENGTH_MODIFIED_MYSQL_5_7_17: would accumulate (sum) the "modified" interval length,
  • BINLOG_INTERVAL_LENGTH_MODIFIED_MYSQL_5_7_17_SQUAREwould accumulate (sum) the square of the "modified" interval length.
Note that I am using _MYSQL_5_7_17 as a suffix for the suggested "modified" length metrics because the MTS applier might be improved and a new "modified" counter (or the unmodified one) might be needed.  Both metrics should live side-by-side because some slaves could be 5.7.17 and others could be 8.0.

The metrics above are useful on the master but something similar is also useful on slaves.  The corresponding metrics as "seen" by the SQL_THREAD would provide information on what is executed on the slaves.  Those metrics would be "shifted" by the slave lag and compressed/extended with the throughput difference between master and slave.  Those slave metrics cannot be global statuses as they are replication channel specific, so we need new fields to SHOW SLAVE STATUS to the corresponding tables in performance schema.

The last improvement I see is adding an option to mysqlbinlog to only output specific information.  As quickly mentioned above, the limiting process in the pipe is mysqlbinlog.  As I did mysqlbinlog | grep "^#", an option for only showing event headers would have been useful.

So I will open the following MySQL bugs (as I want to put a link to this post in the bug reports, I need to publish this post before opening the bugs, so I will update this post with bug numbers and links in the next hours/days):
  • Bug#XXXXX: Expose, on the master, counters for monitoring parallel information quality.
  • Bug#XXXXX: Expose, on the slave, counters for monitoring parallel information quality.
  • Bug#XXXXX: Provide a way to avoid mysqlbinlog to generate useless output.

I hope that, thanks to the average "modified" interval length metric, you will be able to tune MySQL 5.7 parallel replication more easily.  I would appreciate if you would share your experience using this metric in the comments below or in a blog post.

If you want to know more about this subject, I will be happy to discuss it with you in April at Percona Live Santa Clara where I have a talk on MySQL/MariaDB Parallel Replication.

No comments:

Post a Comment