The Light MySQL Startup Optimization on EBS Volumes

In the last weeks / months, I have been working on understanding / improving MySQL startup with many tables. I already wrote four posts on the subject, they are listed below. In this post, I use the system analysis of the previous post to revisit the light optimization on EBS volumes. With this analysis, I am able to determine why the previous tests did not show improvements, and I am able to provide an example of a faster startup.

This post is written in the context of my work on MySQL at Aiven : check our Developer Center, blog or careers page to know more about us.

The four posts I already wrote on the subject of long MySQL startup with many tables are :

In the first post above, I present the fadvise and light optimization (reminder : they consist of reducing IO size from 32 KiB to 16 and 4 KiB respectively). In that post, I mentioned that the tests with EBS volumes do not lead to significant improvements. At that time, I had not done the system analysis presented in the fourth post above. I only guessed that 32 and 4 KiB IOs have the same "cost" on these volumes. I later confirmed this hypothesis, the results of sysbench fileio tests are in annex #1. But I was missing an important parameter : EBS volume bandwidth.

My previous tests were done with default gp3 volumes (3k iops and 125 MiB/s) and 4x-scaled gp3 volumes (12k iops and 500 MiB/s). The reason I did not see improvements, in addition to 32 and 4 KiB IOs having the same cost, is that in both cases, volume bandwidth was over-provisioned.

In a normal MySQL workload, IOs are of size 16 KiB (this can be changed with the innodb_page_size parameter / global variable). With the default page size, a 12k iops volume only needs a bandwidth of 192 MiB/s, not 500 MiB/s. When scaling in a lean way, we are able to see an improvement (the details are in annex #2) : on a m6i.2xlarge and a gp3 volume of 16k iops and 256 MiB/s, it takes 3:19 to start MySQL without any optimizations (3 minutes and 19 seconds), and 2:18 with the fadvise or the light optimization.

Contrarily to AWS Local SSDs, the light optimization does not improve on the fadvise optimization. This is expected because reducing the IO size from 16 to 4 KiB does not allow higher IO throughput on this specific type of disk. And these results are confirmed by a sysbench fileio tests, the results are in annex #3.

I am not specifically suggesting you should be running MySQL on lean-scaled EBS volume (quite the contrary, see next paragraph), I just want to complement my previous analysis : the light optimization can be useful on EBS volumes. And there is even another possible optimization on EBS volumes, it will be the subject of a follow-up post.

Back at right / lean scaling a gp3 volume for MySQL, considering that...

a 1 TiB volume with 3k iops and 125 MiB/s costs 81.92$ / month,
scaling it to 16k iops costs 80$ / month,
scaling it to 256 MiB/s (from 125) costs 5.24$ / month,
scaling it to 512 MiB/s (from 125) costs 15.48 / month,
scaling it to 1000 MiB/s (from 125) costs 35$ / month...

...spending 10.24$ (15.48$ - 5.24$, or even 35$ - 5.24$ = 29.76$) on a storage cost of 167.16$ (81.92$ + 80$ + 5.24$) for having extra bandwidth might not be such a big additional cost. Obviously, this might add up at scale, but there are advantages to having extra bandwidth that offset the additional costs, including faster backups and restores. In the end, the way an infra is scaled depends on many things, which are beyond the scope of this post. Restating my goal here : make sure something important is not forgotten, and showing that in some situations, the light optimization benefits EBS volumes.

Back at Local SSDs

Equipped with a better understanding of IO throughput and bandwidth, we can revisit the light optimization on Local SSDs. The sysbench fileio tests from annex #4 show that AWS Local SSDs behave differently from EBS volumes : when increasing the IO size, the throughput is reduced proportionally. And this is why the fadvise and light optimizations showed such great improvements on a m6id.xlarge AWS instance, because...

when shrinking IO size on AWS local SSDs,
throughput increases proportionally

This is true up to and including m6id.2xlarge instances. From m6id.4xlarge instance, 16 threads are not able to achieve the full throughput of the SSDs, 32 and 64 threads are needed. Why am I focusing on 16 threads ? Because, and this is explained in the previous post, the number of threads used for Duplicate Checked is capped at 16. And this is a perfect transition for the follow-up post, where I will cover Duplicate Check Threads.

Update 2024-12-03 : I published the follow-up post : InnoDB Tablespace Duplicate Check Threads.

Annex #1 : IO Size and EBS IO Throughput

Below is run on a m6i.2xlarge instance with a default gp3 volume (3k iops and 125 MiB/s) : we see that increasing IO size from 4 KiB to 16 and 32 KiB keeps the IO / read throughput almost constant. The important arguments are in blue and the output of the commands are in orange for readability.

$ prepare="--file-num=1 --file-total-size=100G"; \
  sysbench fileio $prepare prepare | grep "bytes written"
107374182400 bytes written in 818.67 seconds (125.08 MiB/sec).

$ for s in 4 16 32; do
    fbs="--file-block-size=$(printf "%-5d" $(($s*1024)))"
    run="$prepare --file-test-mode=rndrd --file-extra-flags=direct --time=60"
    run2="$run --file-block-size=$(($s*1024)) --threads=16"
    sysbench fileio $run2 run | sed -ne "s|reads/s: *|$fbs: |p"
  done
    --file-block-size=4096 : 3048.28
    --file-block-size=16384: 2975.80
    --file-block-size=32768: 2973.12

Below is the result of the same commands as above for a m6i.2xlarge instance with a 4x-scaled gp3 volume (12k iops and 500 MiB/s) : we see the same result as above (increasing IO size keeps the IO throughput almost constant).

107374182400 bytes written in 204.53 seconds (500.67 MiB/sec).

    --file-block-size=4096 : 12200.13
    --file-block-size=16384: 12001.58
    --file-block-size=32768: 12001.64

Annex #2 : The Light Optimization on a MySQL Sized EBS Volume

I reused the script of Bug #115988 on a m6i.2xlarge instance with a gp3 volume scaled at 16k iops and 256 MiB/s (lean-scaling bandwidth for MySQL with InnoDB page size of 16 KiB) : we see an improvement of the light optimization over the default (2:18 vs. 3:19) and no improvement from fadvise to light (explained in the post and showed in annex #3, this is because reducing the IO size from 16 to 4 KiB does not allow higher IO throughput on this specific type of disk).

# Full results.
 9.1.0   default_org   restore: 0:08:14
 9.1.0   default_org     start: 0:03:23
 9.1.0   default_org diskstats: 65695556 sector read.
 9.1.0 default_explo   restore: 0:08:12
 9.1.0 default_explo     start: 0:03:19
 9.1.0 default_explo diskstats: 65686316 sector read.
 9.1.0       fadvise   restore: 0:08:12
 9.1.0       fadvise     start: 0:02:18
 9.1.0       fadvise diskstats: 33666060 sector read.
 9.1.0         light   restore: 0:08:11
 9.1.0         light     start: 0:02:18
 9.1.0         light diskstats: 9668564 sector read.

# Diskstats results (without the restore and start details).
 9.1.0   default_org diskstats: 65695556 sector read.
 9.1.0 default_explo diskstats: 65686316 sector read.
 9.1.0       fadvise diskstats: 33666060 sector read.
 9.1.0         light diskstats: 9668564 sector read.

# Start results (without the restore and diskstats details).
 9.1.0   default_org     start: 0:03:23
 9.1.0 default_explo     start: 0:03:19
 9.1.0       fadvise     start: 0:02:18
 9.1.0         light     start: 0:02:18

Annex #3 : IO Size and EBS IO Throughput Capped

Below is run on a m6i.2xlarge instance with a 16k iops gp3 volume lean-scaled for MySQL (256 MiB/s) : we see the same IO throughput for 4 and 16 KiB IO (saturating iops) and lower throughput with 32 KiB IO (saturating bandwidth). The important arguments are in blue and the output of the commands are in orange for readability.

$ prepare="--file-num=1 --file-total-size=100G"; \
  sysbench fileio $prepare prepare | grep "bytes written"
107374182400 bytes written in 399.10 seconds (256.58 MiB/sec).

$ for s in 4 16 32; do
    fbs="--file-block-size=$(printf "%-5d" $(($s*1024)))"
    run="$prepare --file-test-mode=rndrd --file-extra-flags=direct --time=60"
    run2="$run --file-block-size=$(($s*1024)) --threads=16"
    sysbench fileio $run2 run | sed -ne "s|reads/s: *|$fbs: |p"
  done
    --file-block-size=4096 : 16266.95
    --file-block-size=16384: 16002.32
    --file-block-size=32768: 8342.63

Annex #4 : IO Size and AWS Local SSDs IO Throughput

Below is run on a m6id.large instance (118 GB local SSDs) : we see that increasing the IO size reduces the throughput almost proportionally. The important arguments are in blue and the output of the commands are in orange for readability.

$ prepare="--file-num=1 --file-total-size=100G"; \
  sysbench fileio $prepare prepare | grep "bytes written"
107374182400 bytes written in 1426.65 seconds (71.78 MiB/sec).

$ for s in 4 16 32; do
    fbs="--file-block-size=$(printf "%-5d" $(($s*1024)))"
    run="$prepare --file-test-mode=rndrd --file-extra-flags=direct --time=60"
    run2="$run --file-block-size=$(($s*1024)) --threads=16"
    sysbench fileio $run2 run | sed -ne "s|reads/s: *|$fbs: |p"
  done
    --file-block-size=4096 : 34177.22
    --file-block-size=16384: 9757.92
    --file-block-size=32768: 4793.425

Below is the result of the same commands as above on a m6id.xlarge instance (237 GB local SSDs) : we see the same result as above (increasing the IO size reduces the throughput almost proportionally).

107374182400 bytes written in 712.26 seconds (143.77 MiB/sec).

    --file-block-size=4096 : 68236.07
    --file-block-size=16384: 19553.94
    --file-block-size=32768: 9606.49

(Please bear with me, even though m6id.2xlarge has the same result as m6id.xlarge and m6id.large, something interesting is coming for m6id.4xlarge and m6id.8xlarge)

Below is the result of the same commands as above on a m6id.2xlarge instance (474 GB local SSDs) : same result as above.

107374182400 bytes written in 356.11 seconds (287.55 MiB/sec).

    --file-block-size=4096 : 136652.40
    --file-block-size=16384: 39091.36
    --file-block-size=32768: 19212.81

Below is the result of the same commands as above on a m6id.4xlarge instance (950 GB local SSDs) : we have an anomaly here, the details are below.

107374182400 bytes written in 178.02 seconds (575.21 MiB/sec).

    --file-block-size=4096 : 158538.84
    --file-block-size=16384: 78323.53
    --file-block-size=32768: 38424.97

In the above, we do not have the same ratio between 4 and 16 KiB IOs as for the previous tests. My hypothesis was that 16 threads do not saturate iops on this instance, and this is confirmed with below using 32 and 64 threads.

$ for t in 16 32 64; do
    run="$prepare --file-test-mode=rndrd --file-extra-flags=direct --time=60"
    run2="$run --threads=$t --file-block-size=4096"
    sysbench fileio $run2 run | sed -ne "s|reads/s: *|--threads=$t: |p"
  done
    --threads=16: 158538.84  %% Not actually run, copied from above for comparison.
    --threads=32: 249438.90
    --threads=64: 273438.75

Below is the result of the same commands as above on a m6id.8xlarge instance (1900 GB local SSDs) : we have the same anomaly as above, and also an improved throughput with more threads.

107374182400 bytes written in 90.26 seconds (1134.50 MiB/sec).

    --file-block-size=4096 : 162055.74
    --file-block-size=16384: 122257.32
    --file-block-size=32768: 79475.42

    --threads=16: 162055.74  %% Not actually run, copied from above for comparison.
    --threads=32: 255258.95
    --threads=64: 335556.25

J-F Gagné's MySQL Blog

Tuesday, November 26, 2024