In my previous post about InnoDB Stalls on Empty Free List, I used a test environment that might look a little artificial : a table larger than the InnoDB Buffer Pool but fitting in the Linux Page Cache. This configuration allows serving very quickly what MySQL thinks are IOs because these are hit in the filesystem cache. In this post, I explain why this environment is not totally artificial and can be a good representation of another more complex environment. I also explain how an Empty Free List backoff stall can occur when no IO is done. You have probably already guessed what this post is covering from the title, so let's dive in.
InnoDB Table Compression
Note that in this post, I am not covering InnoDB Compression based on filesystem hole punching, also known as InnoDB Page Compression and sometimes called InnoDB Transparent Page Compression. I am covering the InnoDB Compression using the Compressed Row Format which needs adding a KEY_BLOCK_SIZE in the table definition.
Disk space is not super expensive these days, but it is still interesting to be lean in its usage to optimize costs (disk space, IOPS and disk bandwidth, backups, replicas, RAM, ...). One way to do that, once all other schema optimizations have been explored, is to use InnoDB Table Compression. This compression works at the page level (more about InnoDB Pages in my previous post), trying to make a 16-kilobyte uncompressed page fit in an 8, 4, 2 or 1 KB compressed page (in my previous post, when I mention that pages can be smaller than 16 KB, these smaller compressed pages were what I had in mind). I have previously blogged about disk space saving with InnoDB Table Compression, you can check these two posts for more detail :
- An Adventure in InnoDB Table Compression (for read-only tables)
- More Details about InnoDB Compression Levels (innodb_compression_level)
In my previous post, in the section about Free Pages, I mention that before being able to read or write data, the matching pages need to be in the InnoDB Buffer Pool. This is also true for data stored in a compressed table, but with more complexity: InnoDB first fetches the compressed page, and then decompresses that page, storing the result in another page. This means that when a cache miss happens on a compressed page, InnoDB reads the compressed page from disk (let's assume the compressed page is of size 4 KB), stores this page (of 4 KB) in the Buffer Pool, and then decompress this in a 16 KB page also stored in the Buffer Pool. At this point and still using the example of 4 KB compressed pages, we have 4 + 16 = 20 KB in two pages in the Buffer Pool. These pages live their life in the LRU, and the decompressed page might be evicted for storing other data while the compressed page is kept in the Buffer Pool for future usage. This means that with the same amount of RAM, we can store more compressed pages than uncompressed pages. This means that IOs can be traded for CPU, which is another advantage of InnoDB Page compression (in addition to using less disk space).
This advantage has a consequence though : guess what happens when a compressed page is decompressed ? This needs a Free Page ! And what happens if the Free List is empty and the backoff algorithm is used : there will be a backoff stall (micro-stall or macro-stall, impossible to tell). This is where a backoff stall can occur without needing an IO: when InnoDB Table Compression is used and compressed pages are decompressed. If the last sentences about Free Pages, the Free List and backoff stalls read like Latin to you, you can read more about this in my previous post in the Free List section and the Empty Free List section.
A backoff stall can happen without needing an IO
when a compressed page already in the Buffer Pool
is decompressed and the Free List is empty
Observing backoff Stalls in a Compressed Table
For observing backoff stalls in a compressed table, I am using a very similar test configuration as in my previous post (dbdeployer, Percona Server 8.0.29 (I get similar results with 5.7.39), and sysbench, running on a vm with 16 vCPU and 64 GB of RAM). One difference this time is that I do not want the table to fit in the Linux Page Cache because I want realistic IO behavior, so I am using my Huge Page trick to avoid this. I also use a larger table to make sure no significant portion of it is in the Page Cache (it is very hard to avoid 500 MB to 1 GB of data to be in the Page Cache), especially that compression reduces the size of the table (the 2.5 GB table would be 600 MB compressed, so I use a 20 GB table that compresses to 4.7 GB). The script to create the uncompressed table is in Annex #1 and the compression and Huge Pages trick is in Annex #2.
Out of curiosity, and opening a parenthesis, I ran sysbench single threaded for 60 seconds before and after compressing to compare the performance of non-compressed and compressed tables (these were run with a 1 GB Buffer Pool and with the table fitting in the Page Cache, so before applying the Huge Pages trick). I have 33,056 qps for non-compressed and 10,528 qps for compressed, which is 3 times slower. In this test configuration, decompression really shows its overhead. Closing the parenthesis.
In Annex #3, I repeat the tests done in the previous post for showing the backoff stalls. Before doing the tests, I extended the size of the Buffer Pool to 8 GB to allow fully fitting the 4.7 GB table "in RAM" and I ran a SELECT for trying loading the table in the Buffer Pool. However, and as seen later in the post, this does not fully lead to the expected result of having the table "fully in RAM".
Very similar stalls are happening, with the difference that most (not all) these stalls are logged in the slow log as not needing an IO. Below is an example of a slow log entry for one stall not needing an IO.
# Time: YYYY-MM-DDThh:mm:ss.uuuuuuZ
# User@Host: msandbox[msandbox] @ localhost [] Id: 27
# Schema: sbtest Last_errno: 0 Killed: 0
# Query_time: 0.763248 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 1 Rows_affected: 0 Bytes_sent: 197
# InnoDB_trx_id: 0
# InnoDB_IO_r_ops: 0 InnoDB_IO_r_bytes: 0 InnoDB_IO_r_wait: 0.000000
# InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000
# InnoDB_pages_distinct: 6
SET timestamp=1670168419;
SELECT c FROM sbtest1 WHERE id=1379872;
However, some slow queries need an IO, and I was surprised by this because the 4.7 GB table should fully fit in the Buffer Pool. Below is an example of a slow log entry for one stall needing an IO. These do not happen very often, you might have to run the tests many times to get one.
# Time: YYYY-MM-DDThh:mm:ss.uuuuuuZ
# User@Host: msandbox[msandbox] @ localhost [] Id: 27
# Schema: sbtest Last_errno: 0 Killed: 0
# Query_time: 0.199434 Lock_time: 0.000001 Rows_sent: 1 Rows_examined: 1 Rows_affected: 0 Bytes_sent: 197
# InnoDB_trx_id: 0
# InnoDB_IO_r_ops: 1 InnoDB_IO_r_bytes: 4096 InnoDB_IO_r_wait: 0.000566
# InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000
# InnoDB_pages_distinct: 5
SET timestamp=1670174908;
SELECT c FROM sbtest1 WHERE id=145080;
When I saw that for the first time, I was puzzled and wanted to know more. So in addition to counting slow queries while running sysbench, I added looking at IOs and decompressions with the server status Innodb_buffer_pool_reads and uncompress_ops from the Information Schema table INNODB_CMP. This confirmed what I saw in the slow logs : even if the compressed table should fit in the Buffer Pool, InnoDB is doing IOs for serving queries. I understood why and I will give more detail in a future post (every time I start looking closer at something, I often find more things to look at : MySQL is full of surprises).
Another parenthesis, in Annex #3, I show results for Percona Server 8.0.29, but I also ran the tests for 5.7.39. The 5.7 tests are also showing stalls, but more interestingly, they are showing 36,161 and 11,143 qps for non-compressed and compressed tests (reminder : it was 33,056 and 10,528 for 8.0.29). This is 8.4% and 5.5% less throughput in 8.0. This is one other example of performance regressions with newer MySQL versions (link #1, link #2, and link #3), which is concerning. I have to confess I did not do a full analysis of where these regressions are coming from, maybe it is a default change (like character set), but as these regressions are also observed and reported by others, I guess they are real. Closing this other parenthesis.
From the point of view of Free Page Consumption,
a non-compressed table larger than the Buffer Pool (BP)
and fitting in the Page Cache is similar to
a compressed table smaller than the BP
while decompressing this table would not fit the BP
So this is it : reproducing Empty Free List stalls in an environment where a non-compressed table is larger than the Buffer Pool but fits in the Page Cache is similar to reproducing these stalls with a compressed table smaller than the Buffer Pool and with the uncompressed pages of this table not fitting in the Buffer Pool (this is a complex sentence, sorry I do not have enough time / skills to simplify it !).
This is all I have for today. The TL&DR is the following :
- Even without doing an IO, a SELECT query can need Free Pages because of decompressions (also, a SELECT can need Free Pages because of Page-Splits merging index updates from the Change Buffer, but this is another story),
- When using the backoff Empty Free List Algorithm, these Free Page needs might cause stalls,
- InnoDB Table Compression can significantly impact the throughput of MySQL (3 times less qps for the micro-benchmark of this post),
- There are significant throughput regressions from 5.7 to 8.0 (8.4% and 5.5% throughput for the micro-benchmarks of this post),
- InnoDB Table Compression generates unexpected IOs (this will be the subject of another post),
- and my previous environment with a table larger than the Buffer pool but fitting in the Page cache was to simulate InnoDB Table Compression.
Annex #1 : dbdeployer Sandbox for Free Page Consumption with Compression
# Create a sandbox for our tests. # Because the used sysbench binaries do not work with caching_sha2_password, # use mysql_native_password. dbdc="-c skip-log-bin" dbdc="$dbdc -c slow_query_log=ON -c log_slow_verbosity=innodb" dbdc="$dbdc -c slow_query_log_file=slow.log -c long_query_time=0.05" dbdc="$dbdc -c innodb_adaptive_hash_index=OFF" dbdc="$dbdc -c innodb_buffer_pool_instances=1" dbdc="$dbdc -c innodb_buffer_pool_size=$((1024*1024*1024))" dbdc="$dbdc -c default-authentication-plugin=mysql_native_password" dbdeployer deploy single ps_8.0.29 $dbdc # Create a schema for sysbench and fill it. # The table size is computed for having a 16 GB table with 16 rows per 16KB page. # (sba: sysbench arguments) ./use <<< "CREATE DATABASE sbtest" nb_rows="$((16*1024*1024*1024 / (16*1024) * 16))" socket=$(awk '$1=="socket"{print $3; exit}' my.sandbox.cnf) sba="oltp_point_select --rand-type=uniform --mysql-socket=$socket" sba="$sba --mysql-user=msandbox --mysql-password=msandbox --mysql-db=sbtest" sba="$sba --tables=1 --table_size=$nb_rows" time sysbench $sba prepare --create_secondary=off sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3) Creating table 'sbtest1'... Inserting 16777216 records into 'sbtest1' real 2m37.847s user 0m18.220s sys 0m0.971s # Make the table bigger by padding rows with data. # This is done in such a way as each InnoDB Page contains 16 rows # (the table was created with two CHAR of size 120 and 60). time { echo "ALTER TABLE sbtest1 ADD COLUMN c0 CHAR(60) DEFAULT ''" seq -f " ADD COLUMN c%.0f CHAR(240) DEFAULT ''" 1 3 } | paste -s -d "," | ./use sbtest real 10m17.757s user 0m0.009s sys 0m0.000s # Normally, because of instant ADD COLUMN, 8.0 needs running an identity # ALTER for inflating the table. But the ALTER above took longer than # expected, which looks like a bug, so I opened PS-6857. For PS 8.0.29, # below is not absolutely needed. time ./use sbtest <<< "ALTER TABLE sbtest1 ENGINE=InnoDB" real 10m22.727s user 0m0.002s sys 0m0.005s # Make sure all pages are flushed. time ./use sbtest <<< "FLUSH TABLE sbtest1 FOR EXPORT" real 0m0.010s user 0m0.007s sys 0m0.000s # Look at the size of the table. du -h data/sbtest/sbtest1.ibd 20G data/sbtest/sbtest1.ibd
Annex #2 : Compressing the Table and Realistic IOs (not All Hit in the Page Cache)
# Common stuff. nb_rows="$((16*1024*1024*1024 / (16*1024) * 16))" socket=$(awk '$1=="socket"{print $3; exit}' my.sandbox.cnf) sba="oltp_point_select --rand-type=uniform --mysql-socket=$socket" sba="$sba --mysql-user=msandbox --mysql-password=msandbox --mysql-db=sbtest" sba="$sba --tables=1 --table_size=$nb_rows" # Out of curiosity, sysbench before compression (for comparing performance). ./use -e "SET GLOBAL innodb_empty_free_list_algorithm = legacy" sysbench $sba --threads=1 --time=60 run | grep -e queries: -e avg: -e max queries: 1983454 (33056.57 per sec.) avg: 0.03 max: 0.51 # Compress the table. time ./use sbtest <<< "ALTER TABLE sbtest1 KEY_BLOCK_SIZE=4" real 9m44.327s user 0m0.002s sys 0m0.005s # Make sure all pages are flushed. time ./use sbtest <<< "FLUSH TABLE sbtest1 FOR EXPORT" real 0m0.010s user 0m0.002s sys 0m0.005s # Look at the size of the table. du -h data/sbtest/sbtest1.ibd 4.7G data/sbtest/sbtest1.ibd # Out of curiosity, sysbench after compression (for comparing performance). ./use -e "SET GLOBAL innodb_empty_free_list_algorithm = legacy" sysbench $sba --threads=1 --time=60 run | grep -e queries: -e avg: -e max queries: 631707 (10528.13 per sec.) avg: 0.09 max: 1.94 # Allocate Huge Pages to prevent the table from being in the Linux Page Cache. # Depending of the amount of RAM, the numbers might have to be adjusted. # To learn more about this trick, read this blog post: # https://jfg-mysql.blogspot.com/2021/11/trick-simulate-linux-server-with-less-ram.html grep -e MemTotal -e SwapTotal -e MemFree -e "^Cached" /proc/meminfo MemTotal: 64819156 kB MemFree: 54962156 kB Cached: 6115828 kB SwapTotal: 0 kB sudo bash -c "echo 3 > /proc/sys/vm/drop_caches" grep -e MemTotal -e SwapTotal -e MemFree -e "^Cached" /proc/meminfo MemTotal: 64819156 kB MemFree: 61419540 kB Cached: 362476 kB SwapTotal: 0 kB sudo bash -c "echo $((512*50+256)) > /proc/sys/vm/nr_hugepages" grep -e MemTotal -e SwapTotal -e MemFree -e "^Cached" /proc/meminfo MemTotal: 64819156 kB MemFree: 8857076 kB Cached: 52460 kB SwapTotal: 0 kB
Annex #3 : Reproducing backoff Stalls with a Compressed Table
# Common stuff. nb_rows="$((16*1024*1024*1024 / (16*1024) * 16))" socket=$(awk '$1=="socket"{print $3; exit}' my.sandbox.cnf) sba="oltp_point_select --rand-type=uniform --mysql-socket=$socket" sba="$sba --mysql-user=msandbox --mysql-password=msandbox --mysql-db=sbtest" sba="$sba --tables=1 --table_size=$nb_rows" sq_query="SHOW GLOBAL STATUS LIKE 'Slow_queries'" function init_sq_count() { local a; read a sq1 <<< $(./use -Ne "$sq_query"); } function show_sq_count() { local a; read a sq2 <<< $(./use -Ne "$sq_query") echo "Slow_queries: $(($sq2 - $sq1))"; } ibpr_query="SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'" function init_ibpr_count() { local a; read a ibpr1 <<< $(./use -Ne "$ibpr_query"); } function show_ibpr_count() { local a; read a ibpr2 <<< $(./use -Ne "$ibpr_query") echo "Innodb_buffer_pool_reads: $(($ibpr2 - $ibpr1))"; } icuo_query="SELECT SUM(uncompress_ops) FROM information_schema.INNODB_CMP" function init_icuo_count() { icuo1=$(./use -Ne "$icuo_query"); } function show_icuo_count() { icuo2=$(./use -Ne "$icuo_query") echo "Innodb_cmp_uncompress_ops: $(($icuo2 - $icuo1))"; } function init_counts() { init_sq_count; init_ibpr_count; init_icuo_count; } function show_counts() { show_sq_count; show_ibpr_count; show_icuo_count; } threads=4; rate=$(($threads * 100)); limit=$((16 * 1024 * 6/5)) # Grow the buffer pool to fully fit the compressed table "in RAM", and try loading # the table in the Buffer Pool (does not fully work but out of scope of this post). time ./use sbtest <<< " SET GLOBAL innodb_buffer_pool_size = $((8*1024*1024*1024)); SELECT COUNT(*) FROM sbtest1" > /dev/null real 0m54.341s user 0m0.009s sys 0m0.002s # Sysbench with backoff, filtered output and counts. ./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff" init_counts; \ sysbench $sba --threads=1 --time=2 run | grep -e queries: -e avg: -e max; \ show_counts queries: 10329 (5159.46 per sec.) avg: 0.19 max: 763.27 Slow_queries: 2 Innodb_buffer_pool_reads: 203 Innodb_cmp_uncompress_ops: 11944 # Sysbench with legacy, filtered output and counts. ./use -e "SET GLOBAL innodb_empty_free_list_algorithm = legacy" init_counts; \ sysbench $sba --threads=1 --time=2 run | grep -e queries: -e avg: -e max; \ show_counts queries: 23688 (11833.20 per sec.) avg: 0.08 max: 1.19 Slow_queries: 0 Innodb_buffer_pool_reads: 593 Innodb_cmp_uncompress_ops: 20504 # Repeating backoff with sleep 0, 5 and 60. # The hairy awk script outputs queries, max, slow query, IO and uncompress counts. ./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff" for sl in 0 5 60; do echo; echo "Sleep 300 and then loops with sleep $sl"; sleep 300 for i in {0..4}; do sleep $sl; init_counts sysbench $sba --threads=1 --time=2 run | grep -e queries: -e max show_counts done | awk '{t[++i]=$2} i==5{i=0; printf "%6d %7.2f %2d %6d %6d\n", t[1], t[2], t[3], t[4], t[5]}' done | tail -n +2 Sleep 300 and then loops with sleep 0 5482 971.51 2 158 4710 23457 1.33 0 688 20173 23389 1.20 0 721 19984 23928 1.16 0 714 20420 24067 1.07 0 654 20469 Sleep 300 and then loops with sleep 5 12205 682.73 2 303 10482 23225 9.25 0 634 19933 23018 17.54 0 597 19744 23966 27.56 0 557 20516 23045 27.54 0 556 19670 Sleep 300 and then loops with sleep 60 7369 851.67 2 201 6337 19854 245.90 2 645 17137 18377 255.99 1 759 15876 19750 245.89 1 777 16833 19494 245.90 1 805 16797 # Sysbench with backoff and 4 threads doing 100 qps each. ./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff" init_counts; \ sysbench $sba --threads=$threads --time=2 --rate=$rate run | grep -e queries: -e avg: -e max; \ show_counts queries: 824 (410.49 per sec.) avg: 0.12 max: 0.65 Slow_queries: 0 Innodb_buffer_pool_reads: 27 Innodb_cmp_uncompress_ops: 707 # Similar to above, emptying the free list while sysbench is running, # with filtered output and counts. ./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff" ( id=$(($RANDOM * ($nb_rows - $limit) / (32*1024))) ./use sbtest -e " SET long_query_time = 9; DO sleep(1); SELECT id FROM sbtest1 WHERE id > $id LIMIT $limit, 1" > /dev/null & init_counts sysbench $sba --threads=$threads --time=2 --rate=$rate run | grep -e queries: -e avg: -e max show_counts; ) queries: 801 (399.18 per sec.) avg: 89.06 max: 585.05 Slow_queries: 4 Innodb_buffer_pool_reads: 78 Innodb_cmp_uncompress_ops: 1651 # Similar to above, for 10 minutes, emptying the free list every 2 minutes, # and showing all slow query times. ./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff" n=5; l=120; time=$(($n * $l)); \ ( tail -f -n 0 data/slow.log | awk '$2=="Time:"{t=substr($3,12)} $2=="Query_time:"{print t ": " $3}' & trap "kill $!" EXIT for i in $(seq 1 $n); do sleep $(($l / 2)) id=$(($RANDOM * ($nb_rows - $limit) / (32*1024))) ./use sbtest -e " SET long_query_time = 9; SELECT id FROM sbtest1 WHERE id > $id LIMIT $limit, 1" > /dev/null sleep $(($l / 2)) done & init_counts sysbench $sba --threads=$threads --time=$time --rate=$rate run | grep -e queries: show_counts; ) 16:27:24.733619Z: 0.199349 16:27:24.733626Z: 0.199042 16:27:24.733630Z: 0.203437 16:27:24.733624Z: 0.208039 16:29:25.134491Z: 0.168449 16:29:25.134492Z: 0.168464 16:29:25.137346Z: 0.168357 16:29:25.139026Z: 0.188526 16:31:25.537864Z: 0.255990 16:31:25.542367Z: 0.266015 16:31:25.542700Z: 0.265974 16:31:25.543174Z: 0.276117 16:35:25.856932Z: 0.057693 16:35:25.858968Z: 0.057736 queries: 239343 (398.90 per sec.) Slow_queries: 14 Innodb_buffer_pool_reads: 8147 Innodb_cmp_uncompress_ops: 209818
Nice
ReplyDelete