Monday, May 22, 2023

Avoiding a STOP SLAVE Crash with MTR in Percona Server older than 5.7.37-40

I am finalizing my Percona Live talk MySQL and Vitess (and Kubernetes) at HubSpot.  In this talk, I mentioned that I like that Percona is providing better MySQL with Percona Server.  This comes with a little inconvenience though: with improvements, sometimes comes regression.  This post is about such regression and a workaround I implemented some time ago (I should have shared it earlier).  This regression is fixed in Percona Server 5.7.37-40, so this is mostly interesting for historical purposes, or if you are running a Percona Server version older than that.

Another similar subject / inconvenience is the Free Wait Stalls of the Percona backoff algorithm.  I fully describe this in another post: Tail Latencies in Percona Server because of InnoDB Stalls on Empty Free List.

So Percona introduced a regression, but in which context ?  They attempted fixing Bug#96400: MTS STOP SLAVE takes over a minute when master crashed during event logging.  This was done in PS-5824 by merging a Facebook patch (available in 5.7.28-31 and 8.0.17-8, released on 2019-11-13 and 2019-10-30).  But there was a crashing bug in this patch, which was addressed in PS-7147 (fixed 8.0.19-10, released 2020-03-23, so 5 months later, which is quick).  However, fixing 5.7 was forgotten at that time, and addressed later in PS-8030 (reported 2022-01-11, fixed in 5.7.37-40, released 2022-03-31).

My guess why there was a delay between the fix in 8.0 and 5.7 is that not many people are using MTS (multi-threaded slave, also known as MTR - multi-threaded replication - in more recent terminology).  Because this feature is not used much, it was not brought to the attention of Percona earlier.  We have to recognized Percona reactivity: the bug was fixed very quickly after being reported (less than 3 months).

Me mentioning above (regression, crashing bug, and forgotten in 5.7) should not be considered a criticism of Percona.  When improving things, sometimes regressions are introduced.  This is part of the risk involved in pushing new code.  In the case of Percona, I think more good is coming from improving things that upstream is not addressing, so I accept that regression are a normal consequence of bug fixing.

In the specific case of this bug and where I work, we were affected by this regression.  I already mentioned that we have automation in place to enable MTR when replication lag starts to manifest, this automation also disables MTR when replication lag is gone (the post where I mentioned this automation is Triggering Replication Lag for Testing a Script).  Disabling MTR requires running STOP SLAVE, and because of the bug, it risks crashing MySQL.

A logical question to ask at this point is: Why MTR Needs to be Disabled ?  After all, if disabling MTR risks crashing MySQL, we could just leave it enabled.  The problem with this is that Replication Load Monitoring does not work well when MTR is enabled, so by keeping MTR enabled, we generate a blind-spot on replication load.  But what is Replication Load Monitoring ?  It is a nice usage of Performance Schema to keep track of the idle, hence busy, time percentage of the SQL Thread.  Replication Load Monitoring is a subject I still need to write about, but you can already understand most of it in a post by Mark Leith: A MySQL Replication Load Average with Performance Schema (at the time of publishing this post, the link to Mark's post is not working and returns a Wordpress error, but it can be viewed on the Internet Archive).

To avoid crashing MySQL when disabling MTR, I came-up with an interesting workaround.  Because the crash involves one of the workers finishing its work while STOP SLAVE is running, a way to prevent the crash is to make sure the workers have no work to execute.  This can be done by starving the SQL Thread from relay log events to execute.  To do this, we just need stopping the IO Thread and waiting for all relay logs to be processed.  This is what it done in the bash script in Annex #1, most of its inner-working is explained in the comments.

Starving the SQL Thread by stopping the IO Thread
and waiting for all relay logs to be processed avoids the crash

Another solution that I thought about but did not test: running FLUSH TABLE WITH READ LOCK.  This would freeze the SQL Thread and would probably avoid the STOP SLAVE crash as no worker would be finishing work while stopping replication.  This solution might be better for situations where there is replication lag, because the starving solution only works well, and quickly, when there is little lag, because it needs all relay logs to be processed before safely stopping replication.

There is one comment needing a little more explanation in the script below though:

Vitess (at least, in v5) luckily does not restart the IO Thread

When running STOP SLAVE on a MySQL replica that is managed by Vitess, Vitess will fight us and restart replication.  To correctly stop replication in a Vitess environment, the Stop Replication command from Vitess must be called.  Luckily for us, when only stopping the IO Thread, Vitess does not fight us and will not restart it.  Also, from what I know, there is no command in Vitess to only stop the IO Thread.  Maybe this is needed, I might open a feature request for this.

This is all I have for today, I hope you found this interesting, and thanks Percona for fixing bugs upstream is not addressing.


Annex #1: Script for Safely Disabling MTR (avoiding the crashing bug)

function disable_mtr() {
  # STOP SLAVE SQL_THREAD might crash mysqld until PS 5.7.37-40 [1].
  #   [1]: https://jira.percona.com/browse/PS-8030

  # Because the crash is related to workers finishing their work during
  #   STOP SLAVE SQL_THREAD, a workaround is to starve the workers by stopping
  #   the IO Thread first.  By doing this and then waiting for the SQL Thread to
  #   process all relay logs before running STOP SLAVE, the crash is avoided.

  # Vitess (at least, in v5) luckily does not restart the IO Thread.

  local sss=$(mysql <<< "STOP SLAVE IO_THREAD; SHOW SLAVE STATUS\G")

  local sss_rgs=$(awk '$1 == "Retrieved_Gtid_Set:"{print $2}' <<< "$sss")

  local wait_timeout=5
  local sql_cmd="SELECT WAIT_FOR_EXECUTED_GTID_SET('${sss_rgs}', ${wait_timeout})"
  local wait_gtid=$(mysql -N <<< "$sql_cmd")

  test "$wait_gtid" -eq 1 && {
    # If we are here, WAIT_FOR_EXECUTED_GTID_SET timed-out, so we were not able
    #  to process all relay logs in a decent amount of time.
    mysql $tracking_schema <<< "START SLAVE IO_THREAD"
    return 1
  }
  
  mysql <<< "STOP SLAVE; SET GLOBAL slave_parallel_workers = 0; START SLAVE"  
}

No comments:

Post a Comment