InnoDB Tablespace Duplicate Check Threads (and EBS Volumes for MySQL Startup with Many Tables)

In the last weeks / months, I have been working on understanding / improving MySQL startup with many tables. I already wrote five posts on the subject, they are listed below. In this post, I use the knowledge we gained in the previous two posts to show the interest of tuning InnoDB Tablespace Duplicate Check Threads, making startup 30% in one case (2:28 vs. 3:33) and 5% in another (5:33 vs. 5:53).

(sorry, even though this post was published in December, the url makes it think it was published in November : Blogger played me a trick on this one and I do not know how to fix this)

This post is written in the context of my work on MySQL at Aiven : check our Developer Center, blog or careers page to know more about us.

The five posts I already wrote on the subject of long MySQL startup with many tables are :

In the first post above, I present the fadvise and light optimizations (reminder: they consist of reducing IO size from 32 KiB to 16 and 4 KiB). The fourth post explains in greater detail the phase InnoDB Tablespace Duplicate Check (Duplicate Check for short). In that post, I explained the thread heuristic for Duplicate Check (reminder: using max 16 threads, with max 2 threads per core, and one thread per 8000 tablespaces). Also in that fourth post, I presented a system analysis of Duplicate Check, which can be summarized as:

t threads (example 4) for an IO latency of l (example 0.5 ms) are able to use t/l iops (8000 for our example);
for a disk capacity of c (example 64k iops) and an IO latency of l (example 0.5 ms), c*l threads are needed to saturate capacity (32 for our example).

From b. above, we would expect the current maximum of threads (16) not being able to fully use 64k iops, and from a. above, we would expect a 2 core vm to not being able to use more than 8k iops (because only 4 threads). Let's verify this.

For this verification, we need a very simple patch that allows setting the number of threads for Duplicate Check. This patch can be found in a Fake PR on GitHub. Also, this patch has been contributed in Bug #116716 : Allow Setting Nb Thread for Tablespace Duplicate Check.

The tests with an io2 volume of 64k iops were inconclusive (at least, from a point of view of Duplicate Check threads). On this volume and on a m6i.2xlarge instance, the IO latency I observed was lower than what I expected (0.3 ms vs. 0.5 ms) and I was not able to schedule more than 40k iops (both shown in annex #1). Using the formula above, for using 40k iops with 0.3 m latency, we only need 12 threads, so using more than 16 threads is not useful here. More in-dept tests allowed me to find interesting characteristics of io2 volumes, but this is too long a discussion for this post on Duplicate Check threads, and it will be covered in a follow-up post.

The tests with a small vm and a "beefy" volume were conclusive (done with a m6i.large instance which only has 2 vcpu). Because of the current way the number of threads for Duplicate Check is computed, this instance is only using 4 threads. According to the system analysis, this would not saturate a 16k iops gp3 volume (it would only use 8k iops). Tests done with 4, 8 and 16 threads show exactly this (the full results are in annex #2) : 3:33 with 4 threads (3 minutes 33 seconds) compared to 2:28 and 2:23 with 8 and 16 threads, which is 65 seconds faster (30%).

Duplicate Check thread tuning is useful for small vm
connected to high-throughput network disks
(these usually have high latency, needing many threads)

The last test of this post is not as conclusive as above, but I cannot skip it with what we found in the previous post. In that post, more precisely in its annex #4, we found that to fully use all the iops of very large local ssd vms, we need 32 or even 64 threads. Because Duplicate Check is capped at 16 threads, all iops are not utilized on such vms. With 1 million table, 32 and 64 threads lead to a startup time of 0:55 and 0:56 vs 0:58 for 16 threads. This is not a noticeable improvement and is in the margin of error of the tests. So I did tests with 10 million tables (yes, this is a little excessive, but I was curious), and the results with 32 and 64 threads are 5:33 and 5:30 vs 5:53 for 16 threads. So 20 seconds faster (5%), which even if it is nice to have, is not that much of a game changer (saving 20 seconds, when already more than 333 are spent, is not such significant saving). The full results are in annex #3.

The reason explaining why the saving is relatively small with 10M tables is that the startup time is not dominated by Duplicate Check. So even with Duplicate Check going from 1:07 with 16 threads to 0:45 and 0:40 with 32 and 64 threads (I explain how I get these numbers in annex #3), startup time with 10M tables stays high because DD Tablespace Files Reading takes 2:03.

As our attention is brought to Tablespace Reading, it is interesting to mention that Huaxiong Song wrote, in a comment in Bug #116715 (20 Nov 3:39), that Alibaba parallelized the acquisition of DD tablespace information. This looks like an interesting optimization, I hope this patch will be made available, or even better, contributed to MySQL.

Before ending this post, even if this is not especially relevant, I want to bring your attention to latency test results on gp3 volumes and AWS local SSDs. They confirm an average latency of 0.51 ms for gp3 volumes, and show an average latency or 0.028 ms for AWS local SSDs. The full results are in annex #4. Lower latency of io2 volumes will be confirmed in a follow-up post.

Annex #1 : Tests with io2 Volume of 64k iops

Below latency test is run on a m6i.2xlarge instance with an io2 volume scaled to 64k iops : IO latency on io2 volumes is better than I expected (0.3 ms vs. 0.5 ms). The important arguments are in blue and the output of the commands are in orange for readability.

$ sudo ioping -q -i 0 -w 60 $(df ~/sandboxes | sed -ne '2s/ .*//p') | grep "^min"
min/avg/max/mdev = 75.6 us / 307.7 us / 2.22 ms / 86.8 us

I did a test comparing the throughput of 4, 16, and 32 KiB IOs (sysbench fileio). As you can see below, contrarily to gp3 volumes, this io2 volume throughput drops by 25% when increasing the IO size from 4 to 32 KiB (in annex #1 of the previous post, we saw that default gp3 volumes and 4x-scaled gp3 volumes have the same throughput for 4 and 32 KiB IO, which explained why the light optimization did not show improvement on gp3 volumes). This hints us that the light optimization could be useful on io2 volumes, and this will be covered in a follow-up post.

$ prepare="--file-num=1 --file-total-size=100G"; \
  sysbench fileio $prepare prepare | grep "bytes written"
min/avg/max/mdev = 76.3 us / 306.4 us / 2.21 ms / 86.4 us

$ for s in 4 16 32; do
    fbs="--file-block-size=$(printf "%-5d" $(($s*1024)))"
    run="$prepare --file-test-mode=rndrd --file-extra-flags=direct --time=60"
    run2="$run --file-block-size=$(($s*1024)) --threads=16"
    sysbench fileio $run2 run | sed -ne "s|reads/s: *|$fbs: |p"
  done
    --file-block-size=4096 : 40165.74
    --file-block-size=16384: 38219.43
    --file-block-size=32768: 30034.69

On this io2 volume, above result of 40k iops somehow strange. With an IO latency of 0.3 ms and 16 threads, we would expect a throughput of 53k iops (even increasing the latency to 0.35 ms, we would expect 45k iops). There is something unexpected here, and even with 32 threads, I was stuck at 40k iops. This explains why using more Duplicate Check threads did not show an improvement in startup time. I found out why, and this will be covered in a follow-up post.

$ for t in 16 32 64; do
    run="$prepare --file-test-mode=rndrd --file-extra-flags=direct --time=60"
    run2="$run --threads=$t --file-block-size=4096"
    sysbench fileio $run2 run | sed -ne "s|reads/s: *|--threads=$t: |p"
  done
    --threads=16: 40176.96
    --threads=32: 40017.03
    --threads=64: 40017.29

Annex #2 : Tests with a Small VM and a gp3 Volume of 16k iops

Below latency test is run on a m6i.large instance with a 16k iops gp3 volume leaned-scaled for MySQL (bandwidth of 256 MiB/s). The important arguments are in blue and the output of the commands are in orange for readability.

$ sudo ioping -q -i 0 -w 60 $(df ~/sandboxes | sed -ne '2s/ .*//p') | grep "^min"
min/avg/max/mdev = 267.7 us / 518.1 us / 8.04 ms / 110.7 us

Below script is inspired from Bug #115988.

function status_string() { printf "%6s %15s %9s" $mv $str $1; }

function start() {
  sudo sync; sudo /usr/bin/bash -c "echo 3 > /proc/sys/vm/drop_caches"
  ( ./start > /dev/null & )
  while sleep 1; do test -e data/mysql_sandbox*.pid && break; done | pv -t -N "$(status_string start)"
}

function stop() {
  local pid=$(cat data/mysql_sandbox*.pid)
  ( ./stop > /dev/null & )
  while sleep 1; do ps -p $pid | grep -q $pid || break; done
}

function set_bin() {
  ( # In a sub-shell to not have to undo cd.
    cd ~/opt/mysql/mysql_$mv/bin
    rm -f mysqld; ln -s mysqld_$1 mysqld; )
}

function stop_wipe_restore() {
  stop; rm -rf data; set_bin $1; str=$2
  pv -te ~/sandboxes/mysql_${mv}.data.${n}.tgz -N "$(status_string restore)" | tar -zx
}

# Example: run_test 1000000 9.1.0 "-1 0 2 4 8 16 32"
function run_test {
  n=$1; mv=$2; is="$3"

  set_bin org
  cd; dbdeployer deploy single mysql_$mv > /dev/null
  cd ~/sandboxes/msb_mysql_${mv//./_}

  stop_wipe_restore org   default_org;   start
  stop_wipe_restore explo default_explo; start
  
  echo "innodb_tablespace_startup_testing_light=ON"   >> my.sandbox.cnf
  echo "innodb_tablespace_duplicate_check_threads=-1" >> my.sandbox.cnf

  for i in $is; do
    sed -ie "/innodb_tablespace_duplicate_check_threads/s/=.*/=$i/" my.sandbox.cnf
    stop_wipe_restore explo light_explo_$i; start
  done

  stop; cd; rm -rf ~/sandboxes/msb_mysql_${mv//./_}
  set_bin org
}

Below is the result of running the above scripts on a m6i.large instance with a 16k iops gp3 volume lean-scaled for MySQL (bandwidth of 256 MiB/s) : we see an improvement to 2:28 and 2:23 with 8 and 16 threads from 3:33 with the default of 4 threads on this vm (arguments in blue and the output in orange for readability).

$ run_test 1000000 9.1.0 "-1 4 8 16 32"
 9.1.0     default_org   restore: 0:08:55
 9.1.0     default_org     start: 0:04:07
 9.1.0   default_explo   restore: 0:08:57
 9.1.0   default_explo     start: 0:04:01
 9.1.0  light_explo_-1   restore: 0:08:55
 9.1.0  light_explo_-1     start: 0:03:33
 9.1.0   light_explo_4   restore: 0:08:55
 9.1.0   light_explo_4     start: 0:03:33
 9.1.0   light_explo_8   restore: 0:08:55
 9.1.0   light_explo_8     start: 0:02:28
 9.1.0  light_explo_16   restore: 0:08:55
 9.1.0  light_explo_16     start: 0:02:23
 9.1.0  light_explo_32   restore: 0:08:53
 9.1.0  light_explo_32     start: 0:02:23

# Start results (without the restore details).
 9.1.0     default_org     start: 0:04:07
 9.1.0   default_explo     start: 0:04:01
 9.1.0  light_explo_-1     start: 0:03:33
 9.1.0   light_explo_4     start: 0:03:33
 9.1.0   light_explo_8     start: 0:02:28
 9.1.0  light_explo_16     start: 0:02:23
 9.1.0  light_explo_32     start: 0:02:23

Below is the sysbench fileio simplification of above : we clearly see better throughput with 16 threads compared to 4 threads (arguments in blue and the output in orange for readability).

$ prepare="--file-num=1 --file-total-size=100G"; \
  sysbench fileio $prepare prepare  | grep "bytes written"
107374182400 bytes written in 398.55 seconds (256.93 MiB/sec).

$ for t in 4 8 16 32; do
    threads="--threads=$(printf "%-2d" $t)"
    run="$prepare --file-test-mode=rndrd --file-extra-flags=direct --time=60"
    run2="$run --threads=$t --file-block-size=4096"
    sysbench fileio $run2 run | sed -ne "s|reads/s: *|$threads: |p"
  done
    --threads=4 : 7698.38
    --threads=8 : 15308.09
    --threads=16: 16266.85
    --threads=32: 16002.48

Annex #3 : Tests with a Big VM with Local SSDs

Reminder from the annex #3 of the previous post : on a m6id.8xlarge instance, 32 and 64 threads are able to get more throughput for 4 KiB IOs than 16 threads, which hints at 32 and 64 Duplicate Check threads potentially allowing a quicker startup than with 16 threads. The important arguments are in blue, the output of the commands are in orange for readability, and note added after %%.

$ prepare="--file-num=1 --file-total-size=100G"; \
  sysbench fileio $prepare prepare | grep "bytes written"
107374182400 bytes written in 90.26 seconds (1134.50 MiB/sec).

$ for t in 16 32 64; do
    run="$prepare --file-test-mode=rndrd --file-extra-flags=direct --time=60"
    run2="$run --threads=$t --file-block-size=4096"
    sysbench fileio $run2 run | sed -ne "s|reads/s: *|--threads=$t: |p"
  done
    --threads=16: 162055.74  %% Not actually run, copied from other test for comparison.
    --threads=32: 255258.95
    --threads=64: 335556.25

Below is the result of running the scripts of annex #2 on a m6id.8xlarge instance (1900 GB local SSDs) : we get a very small improvement with 32 and 64 threads, but it is almost in the margin of error. To undoubtedly see an improvement, we might need more than 1 million tables (arguments in blue and the output in orange for readability).

$ run_test 1000000 9.1.0 "-1 16 32 64 128"
 9.1.0     default_org   restore: 0:08:06
 9.1.0     default_org     start: 0:01:10
 9.1.0   default_explo   restore: 0:08:04
 9.1.0   default_explo     start: 0:01:04
 9.1.0  light_explo_-1   restore: 0:08:07
 9.1.0  light_explo_-1     start: 0:01:00
 9.1.0  light_explo_16   restore: 0:08:03
 9.1.0  light_explo_16     start: 0:00:58
 9.1.0  light_explo_32   restore: 0:08:02
 9.1.0  light_explo_32     start: 0:00:55
 9.1.0  light_explo_64   restore: 0:08:02
 9.1.0  light_explo_64     start: 0:00:56
 9.1.0 light_explo_128   restore: 0:08:04
 9.1.0 light_explo_128     start: 0:00:57

# Start results (without the restore details).
 9.1.0     default_org     start: 0:01:10
 9.1.0   default_explo     start: 0:01:04
 9.1.0  light_explo_-1     start: 0:01:00
 9.1.0  light_explo_16     start: 0:00:58
 9.1.0  light_explo_32     start: 0:00:55
 9.1.0  light_explo_64     start: 0:00:56
 9.1.0 light_explo_128     start: 0:00:57

Below is the adjustment of the scripts of annex #2 for 10 million tables (skipping subsequent restores to save more than 9 hours for the tests) : we see a 20-second improvement from 16 to 32 threads (5:53 to 5:33; output in orange for readability). This is clearly not as big of a relative improvement compared to other tests, a more complete analysis is below.

$ n=10000000; mv=9.1.0; is="-1 16 32 64 128"; {
  set_bin org
  cd; dbdeployer deploy single mysql_$mv > /dev/null
  cd ~/sandboxes/msb_mysql_${mv//./_}
  stop_wipe_restore org   default_org;   start
  function stop_wipe_restore() { stop; set_bin $1; str=$2; }
  stop_wipe_restore explo default_explo; start
  echo "innodb_tablespace_startup_testing_light=ON"   >> my.sandbox.cnf
  echo "innodb_tablespace_duplicate_check_threads=-1" >> my.sandbox.cnf
  for i in $is; do
    sed -ie "/innodb_tablespace_duplicate_check_threads/s/=.*/=$i/" my.sandbox.cnf
    stop_wipe_restore explo light_explo_$i; start
  done
  stop; set_bin org; }
 9.1.0     default_org   restore: 1:21:10
 9.1.0     default_org     start: 0:08:10
 9.1.0   default_explo     start: 0:06:55
 9.1.0  light_explo_-1     start: 0:05:55
 9.1.0  light_explo_16     start: 0:05:53
 9.1.0  light_explo_32     start: 0:05:33
 9.1.0  light_explo_64     start: 0:05:30
 9.1.0 light_explo_128     start: 0:05:40

When executing above tests with INFO logging enabled, I am able to see that Duplicate Check speed actually improves from 1:07 with 16 threads to 0:45 and 0:40 with 32 and 64 threads. The reason these improvements are not super visible is because the startup time with 10M tables is dominated by DD Tablespace Files Reading, taking 2:03. The full logs and the summary is in my GitHub blog repo.

Annex #4 : More Latency Tests

Not especially relevant for this post but for completeness, below are results of latency tests on a m6i.2xlarge instance with different gp3 volumes. The important arguments are in blue and the output of the commands are in orange for readability. These confirm 0.515 ms being a good average latency of gp3 volumes.

# Below with a default gp3 volume (3k iops & 125 MiB/s).
$ sudo ioping -q -i 0 -w 60 $(df ~/sandboxes | sed -ne '2s/ .*//p') | grep "^min"
min/avg/max/mdev = 252.6 us / 523.0 us / 11.6 ms / 199.1 us

# Below with a 4x-scaled gp3 volume (12k iops & 500 MiB/s).
$ sudo ioping -q -i 0 -w 60 $(df ~/sandboxes | sed -ne '2s/ .*//p') | grep "^min"
min/avg/max/mdev = 291.3 us / 510.9 us / 34.9 ms / 184.7 us

# Below with a 16k iops gp3 volume leaned-scaled for MySQL (256 MiB/s).
$ sudo ioping -q -i 0 -w 60 $(df ~/sandboxes | sed -ne '2s/ .*//p') | grep "^min"
min/avg/max/mdev = 165.8 us / 510.4 us / 42.9 ms / 317.7 us

Not especially relevant for this post but for completeness, below are results of latency tests on a AWS local SSDs instances. These confirm 0.0286 ms being a good average latency of AWS local SSDs.

# Below for a m6i.large instance (118 GB local SSDs).
$ sudo ioping -q -i 0 -w 60 $(df ~/sandboxes | sed -ne '2s/ .*//p') | grep "^min"
min/avg/max/mdev = 24.9 us / 28.6 us / 427.5 us / 1.82 us

# Below for a m6id.xlarge instance (237 GB local SSDs).
$ sudo ioping -q -i 0 -w 60 $(df ~/sandboxes | sed -ne '2s/ .*//p') | grep "^min"
min/avg/max/mdev = 24.4 us / 28.6 us / 1.49 ms / 2.63 us

# Below for a m6id.2xlarge instance (474 GB local SSDs).
$ sudo ioping -q -i 0 -w 60 $(df ~/sandboxes | sed -ne '2s/ .*//p') | grep "^min"
min/avg/max/mdev = 24.6 us / 28.7 us / 561.9 us / 2.03 us

# Below for a m6id.4xlarge instance (950 GB local SSDs).
$ sudo ioping -q -i 0 -w 60 $(df ~/sandboxes | sed -ne '2s/ .*//p') | grep "^min"
min/avg/max/mdev = 24.6 us / 28.9 us / 499.4 us / 3.68 us

# Below for a m6id.8xlarge instance (1900 GB local SSDs).
$ sudo ioping -q -i 0 -w 60 $(df ~/sandboxes | sed -ne '2s/ .*//p') | grep "^min"
min/avg/max/mdev = 24.4 us / 28.4 us / 239.7 us / 1.53 us

J-F Gagné's MySQL Blog

Tuesday, December 3, 2024