Wednesday, January 3, 2018

An update on Write Set (parallel replication) bug fix in MySQL 8.0

In my MySQL Parallel Replication session at Percona Live Santa Clara 2017, I talked about a bug in Write Set tracking for parallel replication (Bug#86078).  At the time, I did not fully understand what was going wrong but since then, we (Engineers at Oracle and me) understood what happened and the bug is supposed to be fixed in MySQL 8.0.4.  This journey thought me interesting MySQL behavior and bug reporting practices.  In this post, I am sharing both in addition to some insight on Write Set tracking for parallel replication.

Let's first start with a little note: even though this post is talking about a bug, my goal is not to do product bashing but to use this bug for describing the inner workings of MySQL.  Write Set is a new technology and bugs/corner cases are often found in new stuff.  My opinion is that Write Set is an awesome improvement for Parallel Replication !  With this out of the way, let's dive in Write Set.

Write Set tracking for parallel replication


The Write Set is a new technology in MySQL 8.0 (also in 5.7 but a little hidden: this will be the subject of another post) for improving parallel replication.  It is a data structure that lives in the MySQL Server that allows widening the parallelism intervals (it also allows other things, but let's focus on Parallel Replication in this post).  You can read more about parallelism intervals in my previous post: A Metric for Tuning Parallel Replication in MySQL 5.7.

MySQL 5.7 use parallelism on the master to generate parallelism intervals.  Without good parallelism on the master, intervals will be short and the slaves will not run much faster.  The Write Set allows to track actual transaction dependencies, which will produce much wider intervals.  Explaining in details how Write Set works would need a complete post (more information can be found on the MySQL High Availability blog in the post Improving the Parallel Applier with Writeset-based Dependency Tracking by Vitor Oliveira) but the example below allows to understand the important parts:
> SELECT version();
+--------------+
| version()    |
+--------------+
| 8.0.3-rc-log |
+--------------+
1 row in set (0.00 sec)

> SET autocommit = ON;
Query OK, 0 rows affected (0.00 sec)

> SET GLOBAL transaction_write_set_extraction = XXHASH64;
Query OK, 0 rows affected (0.00 sec)

> SET GLOBAL binlog_transaction_dependency_tracking = WRITESET;
Query OK, 0 rows affected (0.00 sec)

> FLUSH BINARY LOGS;
Query OK, 0 rows affected (0.00 sec)

> CREATE DATABASE test_jfg_ws;
Query OK, 1 row affected (0.01 sec)

> CREATE TABLE test_jfg_ws.test_jfg_ws (
    ->   id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    ->   str VARCHAR(80) NOT NULL UNIQUE);
Query OK, 0 rows affected (0.01 sec)

> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('a');
Query OK, 1 row affected (0.00 sec)

> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('b');
Query OK, 1 row affected (0.00 sec)

> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('c');
Query OK, 1 row affected (0.00 sec)
In above, after enabling Write Set, initializing database and table, and inserting three rows, we get the following in the binary logs (comments at the end added for clarity):
# mysqlbinlog my_binlog.N | grep last_ |
    sed -e 's/server id.*last/[...] last/' -e 's/.rbr_only.*/ [...]/'
#171127 10:04:58 [...] last_committed=0 sequence_number=1 [...] -- CREATE DB
#171127 10:05:21 [...] last_committed=1 sequence_number=2 [...] -- CREATE TB
#171127 10:07:13 [...] last_committed=2 sequence_number=3 [...] -- INSERT a
#171127 10:07:15 [...] last_committed=2 sequence_number=4 [...] -- INSERT b
#171127 10:07:18 [...] last_committed=2 sequence_number=5 [...] -- INSERT c
As the intervals of the three INSERTs overlap, those three transactions can be run in parallel on slaves.  Note that even though the transactions are executed in the same session on the master (so not in parallel), the Write Set is able to detect that they are non-conflicting.  This is a great improvement for replication speed in MySQL 8.0.

Bug in Write Set tracking in MySQL 8.0.1 (and still in 8.0.3)


While testing Write Set in MySQL 8.0.1 and preparing my talk for Percona Live Santa Clara 2017 (the talk title is MySQL Parallel Replication: all the 5.7 - and some of the 8.0 - details and the slides are available on SlideShare), I found a bug in Write Set tracking.  At the time, I did not understand what was the problem and I opened Bug#86078 with very limited information (the bug title is Bad Write Set tracking with UNIQUE KEY on a DELETE followed by an INSERT).

Note: I am writing this section only with the information I had at the time, the full analysis is in the next two sections.

I discovered this bug in the case of a DELETE followed by an INSERT where the INSERT is creating a row (new auto-increment as the primary key) but with the same UNIQUE KEY as the previously deleted row.  This is a weird usage of MySQL as normally, this DELETE and INSERT would be done in a single transaction.  Using two transactions means that a SELECT using the UNIQUE KEY could return zero row (between the DELETE and INSERT).  This might be a bug in the usage of MySQL but this is out of scope of this post.

The problem with this DELETE followed by an INSERT in two different transactions is that the Write Set was not correctly detecting the dependency between the transactions.  The INSERT is obviously depending on the DELETE but I got the following in the binary logs (comments at the end added for clarity):
#170415 23:49:04 ... last_committed=113398 sequence_number=113432 ... -- DELETE
#170415 23:49:04 ... last_committed=113398 sequence_number=113433 ... -- INSERT
Transaction with sequence_number 113433 (the INSERT) should have last_committed to 113432 (the DELETE) but its last_committed is something else (113398).  This allows both transactions to be executed in parallel on slaves where the INSERT might fail if it runs just before the DELETE.  If this happens, replication on the slave breaks with a duplicate key error.

I got this replication breakage many times during my tests.  When this happened, restarting replication would solve the problem as the DELETE would now be completed (just before replication breaks) and the INSERT would then succeed.  This is still a bug in Write Set tracking for parallel replication and it needed fixing.  However, without more information, Oracle was not able to reproduce this bug, which made things hard to fix.

MySQL behavior


In above, I described a situation leading to a duplicate key error on the slave.  I was (naively) assuming that a duplicate key error could only happen if the data was strictly the same.  That was not taking into account case-insensitive collation.

The default collation for MySQL 5.7 is latin1_swedish_ci (it is the same for MySQL 8.0 so far).  The _ci at the end of the collation name means case-insensitive (there are also accent sensitive/insensitive collations, more details in the manual).  Let's do some tests with the table created above:
> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('d');
Query OK, 1 row affected (0.00 sec)

> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('D');
ERROR 1062 (23000): Duplicate entry 'A' for key 'str'
So 'd' and 'D' are causing a duplicate entry error as the field str is in a UNIQUE KEY.  So even if 'd' and 'D' are not exactly the same value, their comparison returns equal because a case-insensitive collation is used.  So even if data is not strictly the same ('d' and 'D'), a duplicate key can happen.

MySQL collation behavior combined with Write Set


Knowing the collation behavior of MySQL, let's look again at our Write Set bug:
> FLUSH BINARY LOGS;
Query OK, 0 rows affected (0.00 sec)

> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('e');
Query OK, 1 row affected (0.00 sec)

> DELETE FROM test_jfg_ws.test_jfg_ws WHERE str = 'e';
Query OK, 1 row affected (0.00 sec)

> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('E');
Query OK, 1 row affected (0.00 sec)
The above commands result in the following in the binary logs:
# mysqlbinlog my_binlog.M | grep last_ |
    sed -e 's/server id.*last/[...] last/' -e 's/.rbr_only.*/ [...]/'
#171127 10:11:14 [...] last_committed=0 sequence_number=1 [...] -- INSERT e
#171127 10:12:21 [...] last_committed=1 sequence_number=2 [...] -- DELETE e
#171127 10:12:26 [...] last_committed=1 sequence_number=3 [...] -- INSERT E
So the second INSERT is not identified as depending on the DELETE.  This is the bug we were looking for.  To understand why this happens, we need to dive in the Write Set implementation.

The Write Set is tracking rows modified by each transaction.  With this information, the parallelism intervals can be computed.  Keeping the full rows in the Write Set would consume a lot of RAM and would make dependency tracking computationally expensive.  To make the dependency tracking compact and efficient, MySQL only stores hashes of the PRIMARY and UNIQUE KEYs of the rows modified by a transaction.  With storing the hashes of the previous transactions in the Write Set, it becomes easy to compute dependencies.  Hashing collision can happen but in this case, a non-existent dependency is found, which is not a big problem.

However, hashing and case-insensitive collations do not play well together.  If you hash 'e' and 'E', you will end up with different values.  So this is the source of Bug#86078: case-insensitive collation (and accent-insensitive) were lost when hashing was done.

Bug reporting practices


While working on this bug, I did not see that the field causing a UNIQUE KEY violation had a case difference.  I probably was too focused on the duplicate key error assuming the data would be the same.

When I reported the bug and because it happened on a production system with production data, I did not share the exact binary logs with Oracle.  I shared a manually anonymized output of mysqlbinlog where I lost the case difference.

To give an example, let's take my last name: Gagné.  If the bug above was happening with "Gagné" and "gagné", sharing the exact information would have allowed to reproduce the problem.  However, if we manually anonymize the data to "xyz" and "xyz" thinking it is the same because of the duplicate key violation, we lost the case difference.  This is exactly what I did in my bug report, making it impossible for Oracle to reproduce the bug.

So my learning on bug reporting practices are:
  • Data anonymization is hard,
  • and sometimes, it might be better to share the exact production data.
But sharing production data is not easy: the other party must be able to process that data with the necessary care and agreements should be in place allowing sharing that data.


Thanks for reading and best wishes for 2018.

2 comments:

  1. mysql [localhost] {msandbox} ((none)) > show variables like '%coll%';
    +----------------------+--------------------+
    | Variable_name | Value |
    +----------------------+--------------------+
    | collation_connection | utf8mb4_0900_ai_ci |
    | collation_database | utf8mb4_0900_ai_ci |
    | collation_server | utf8mb4_0900_ai_ci |
    +----------------------+--------------------+
    3 rows in set (0.00 sec)

    mysql [localhost] {msandbox} ((none)) > use test
    Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

    Database changed
    mysql [localhost] {msandbox} (test) > insert t2(str) value('d');
    Query OK, 1 row affected (0.01 sec)

    mysql [localhost] {msandbox} (test) > insert t2(str) value('D');
    ERROR 1062 (23000): Duplicate entry 'D' for key 'str'
    mysql [localhost] {msandbox} (test) > delete from t2 where str='d';
    Query OK, 1 row affected (0.01 sec)

    mysql [localhost] {msandbox} (test) > insert t2(str) value('D');
    Query OK, 1 row affected (0.01 sec)

    #180105 22:17:31 [...] last_committed=5 sequence_number=9 [...] insert t2(str) value('d');
    #180105 22:18:19 [...] last_committed=9 sequence_number=10 [...] delete from t2 where str='d';
    #180105 22:18:22 [...] last_committed=5 sequence_number=11 [...] insert t2(str) value('D');

    ### INSERT INTO `test`.`t2`
    ### SET
    ### @1=4 /* INT meta=0 nullable=0 is_null=0 */
    ### @2='d'

    ### DELETE FROM `test`.`t2`
    ### WHERE
    ### @1=4 /* INT meta=0 nullable=0 is_null=0 */
    ### @2='d' /* VAR

    ### INSERT INTO `test`.`t2`
    ### SET
    ### @1=6 /* INT meta=0 nullable=0 is_null=0 */
    ### @2='D' /* VARSTRING(320) meta=320 nullable=0 is_null=0 */
    # at 3024

    Do you notice in your case that the primary key is the same ?

    ReplyDelete
    Replies
    1. The primary key is different in my case (like in your case).

      Look at the intervals:

      #180105 22:18:19 [...] last_committed=9 sequence_number=10 [...] delete from t2 where str='d';
      #180105 22:18:22 [...] last_committed=5 sequence_number=11 [...] insert t2(str) value('D');

      Even with a UNIQUE KEY on str, those 2 transactions can be run in parallel on slaves, and this will break if the insert is run before the delete.

      Delete