Free Page Consumption by InnoDB Table Compression (Percona backoff Algo. part #2)

In my previous post about InnoDB Stalls on Empty Free List, I used a test environment that might look a little artificial : a table larger than the InnoDB Buffer Pool but fitting in the Linux Page Cache. This configuration allows serving very quickly what MySQL thinks are IOs because these are hit in the filesystem cache. In this post, I explain why this environment is not totally artificial and can be a good representation of another more complex environment. I also explain how an Empty Free List backoff stall can occur when no IO is done. You have probably already guessed what this post is covering from the title, so let's dive in.

InnoDB Table Compression

Note that in this post, I am not covering InnoDB Compression based on filesystem hole punching, also known as InnoDB Page Compression and sometimes called InnoDB Transparent Page Compression. I am covering the InnoDB Compression using the Compressed Row Format which needs adding a KEY_BLOCK_SIZE in the table definition.

Disk space is not super expensive these days, but it is still interesting to be lean in its usage to optimize costs (disk space, IOPS and disk bandwidth, backups, replicas, RAM, ...). One way to do that, once all other schema optimizations have been explored, is to use InnoDB Table Compression. This compression works at the page level (more about InnoDB Pages in my previous post), trying to make a 16-kilobyte uncompressed page fit in an 8, 4, 2 or 1 KB compressed page (in my previous post, when I mention that pages can be smaller than 16 KB, these smaller compressed pages were what I had in mind). I have previously blogged about disk space saving with InnoDB Table Compression, you can check these two posts for more detail :

In my previous post, in the section about Free Pages, I mention that before being able to read or write data, the matching pages need to be in the InnoDB Buffer Pool. This is also true for data stored in a compressed table, but with more complexity: InnoDB first fetches the compressed page, and then decompresses that page, storing the result in another page. This means that when a cache miss happens on a compressed page, InnoDB reads the compressed page from disk (let's assume the compressed page is of size 4 KB), stores this page (of 4 KB) in the Buffer Pool, and then decompress this in a 16 KB page also stored in the Buffer Pool. At this point and still using the example of 4 KB compressed pages, we have 4 + 16 = 20 KB in two pages in the Buffer Pool. These pages live their life in the LRU, and the decompressed page might be evicted for storing other data while the compressed page is kept in the Buffer Pool for future usage. This means that with the same amount of RAM, we can store more compressed pages than uncompressed pages. This means that IOs can be traded for CPU, which is another advantage of InnoDB Page compression (in addition to using less disk space).

This advantage has a consequence though : guess what happens when a compressed page is decompressed ? This needs a Free Page ! And what happens if the Free List is empty and the backoff algorithm is used : there will be a backoff stall (micro-stall or macro-stall, impossible to tell). This is where a backoff stall can occur without needing an IO: when InnoDB Table Compression is used and compressed pages are decompressed. If the last sentences about Free Pages, the Free List and backoff stalls read like Latin to you, you can read more about this in my previous post in the Free List section and the Empty Free List section.

A backoff stall can happen without needing an IO
when a compressed page already in the Buffer Pool
is decompressed and the Free List is empty

Observing backoff Stalls in a Compressed Table

For observing backoff stalls in a compressed table, I am using a very similar test configuration as in my previous post (dbdeployer, Percona Server 8.0.29 (I get similar results with 5.7.39), and sysbench, running on a vm with 16 vCPU and 64 GB of RAM). One difference this time is that I do not want the table to fit in the Linux Page Cache because I want realistic IO behavior, so I am using my Huge Page trick to avoid this. I also use a larger table to make sure no significant portion of it is in the Page Cache (it is very hard to avoid 500 MB to 1 GB of data to be in the Page Cache), especially that compression reduces the size of the table (the 2.5 GB table would be 600 MB compressed, so I use a 20 GB table that compresses to 4.7 GB). The script to create the uncompressed table is in Annex #1 and the compression and Huge Pages trick is in Annex #2.

Out of curiosity, and opening a parenthesis, I ran sysbench single threaded for 60 seconds before and after compressing to compare the performance of non-compressed and compressed tables (these were run with a 1 GB Buffer Pool and with the table fitting in the Page Cache, so before applying the Huge Pages trick). I have 33,056 qps for non-compressed and 10,528 qps for compressed, which is 3 times slower. In this test configuration, decompression really shows its overhead. Closing the parenthesis.

In Annex #3, I repeat the tests done in the previous post for showing the backoff stalls. Before doing the tests, I extended the size of the Buffer Pool to 8 GB to allow fully fitting the 4.7 GB table "in RAM" and I ran a SELECT for trying loading the table in the Buffer Pool. However, and as seen later in the post, this does not fully lead to the expected result of having the table "fully in RAM".

Very similar stalls are happening, with the difference that most (not all) these stalls are logged in the slow log as not needing an IO. Below is an example of a slow log entry for one stall not needing an IO.

# Time: YYYY-MM-DDThh:mm:ss.uuuuuuZ
# User@Host: msandbox[msandbox] @ localhost []  Id:    27
# Schema: sbtest  Last_errno: 0  Killed: 0
# Query_time: 0.763248  Lock_time: 0.000000  Rows_sent: 1  Rows_examined: 1  Rows_affected: 0  Bytes_sent: 197
# InnoDB_trx_id: 0
#   InnoDB_IO_r_ops: 0  InnoDB_IO_r_bytes: 0  InnoDB_IO_r_wait: 0.000000
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 6
SET timestamp=1670168419;
SELECT c FROM sbtest1 WHERE id=1379872;

However, some slow queries need an IO, and I was surprised by this because the 4.7 GB table should fully fit in the Buffer Pool. Below is an example of a slow log entry for one stall needing an IO. These do not happen very often, you might have to run the tests many times to get one.

# Time: YYYY-MM-DDThh:mm:ss.uuuuuuZ
# User@Host: msandbox[msandbox] @ localhost []  Id:    27
# Schema: sbtest  Last_errno: 0  Killed: 0
# Query_time: 0.199434  Lock_time: 0.000001  Rows_sent: 1  Rows_examined: 1  Rows_affected: 0  Bytes_sent: 197
# InnoDB_trx_id: 0
#   InnoDB_IO_r_ops: 1  InnoDB_IO_r_bytes: 4096  InnoDB_IO_r_wait: 0.000566
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 5
SET timestamp=1670174908;
SELECT c FROM sbtest1 WHERE id=145080;

When I saw that for the first time, I was puzzled and wanted to know more. So in addition to counting slow queries while running sysbench, I added looking at IOs and decompressions with the server status Innodb_buffer_pool_reads and uncompress_ops from the Information Schema table INNODB_CMP. This confirmed what I saw in the slow logs : even if the compressed table should fit in the Buffer Pool, InnoDB is doing IOs for serving queries. I understood why and I will give more detail in a future post (every time I start looking closer at something, I often find more things to look at : MySQL is full of surprises).

Another parenthesis, in Annex #3, I show results for Percona Server 8.0.29, but I also ran the tests for 5.7.39. The 5.7 tests are also showing stalls, but more interestingly, they are showing 36,161 and 11,143 qps for non-compressed and compressed tests (reminder : it was 33,056 and 10,528 for 8.0.29). This is 8.4% and 5.5% less throughput in 8.0. This is one other example of performance regressions with newer MySQL versions (link #1, link #2, and link #3), which is concerning. I have to confess I did not do a full analysis of where these regressions are coming from, maybe it is a default change (like character set), but as these regressions are also observed and reported by others, I guess they are real. Closing this other parenthesis.

From the point of view of Free Page Consumption,
a non-compressed table larger than the Buffer Pool (BP)
and fitting in the Page Cache is similar to
a compressed table smaller than the BP
while decompressing this table would not fit the BP

So this is it : reproducing Empty Free List stalls in an environment where a non-compressed table is larger than the Buffer Pool but fits in the Page Cache is similar to reproducing these stalls with a compressed table smaller than the Buffer Pool and with the uncompressed pages of this table not fitting in the Buffer Pool (this is a complex sentence, sorry I do not have enough time / skills to simplify it !).

This is all I have for today. The TL&DR is the following :

Even without doing an IO, a SELECT query can need Free Pages because of decompressions (also, a SELECT can need Free Pages because of Page-Splits merging index updates from the Change Buffer, but this is another story),
When using the backoff Empty Free List Algorithm, these Free Page needs might cause stalls,
InnoDB Table Compression can significantly impact the throughput of MySQL (3 times less qps for the micro-benchmark of this post),
There are significant throughput regressions from 5.7 to 8.0 (8.4% and 5.5% throughput for the micro-benchmarks of this post),
InnoDB Table Compression generates unexpected IOs (this will be the subject of another post),
and my previous environment with a table larger than the Buffer pool but fitting in the Page cache was to simulate InnoDB Table Compression.

Annex #1 : dbdeployer Sandbox for Free Page Consumption with Compression

# Create a sandbox for our tests.
# Because the used sysbench binaries do not work with caching_sha2_password,
#   use mysql_native_password.
dbdc="-c skip-log-bin"
dbdc="$dbdc -c slow_query_log=ON -c log_slow_verbosity=innodb"
dbdc="$dbdc -c slow_query_log_file=slow.log -c long_query_time=0.05"
dbdc="$dbdc -c innodb_adaptive_hash_index=OFF"
dbdc="$dbdc -c innodb_buffer_pool_instances=1"
dbdc="$dbdc -c innodb_buffer_pool_size=$((1024*1024*1024))"
dbdc="$dbdc -c default-authentication-plugin=mysql_native_password"
dbdeployer deploy single ps_8.0.29 $dbdc

# Create a schema for sysbench and fill it.
# The table size is computed for having a 16 GB table with 16 rows per 16KB page.
# (sba: sysbench arguments)
./use <<< "CREATE DATABASE sbtest"
nb_rows="$((16*1024*1024*1024 / (16*1024) * 16))"
socket=$(awk '$1=="socket"{print $3; exit}' my.sandbox.cnf)
sba="oltp_point_select --rand-type=uniform --mysql-socket=$socket"
sba="$sba --mysql-user=msandbox --mysql-password=msandbox --mysql-db=sbtest"
sba="$sba --tables=1 --table_size=$nb_rows"
time sysbench $sba prepare --create_secondary=off
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Creating table 'sbtest1'...
Inserting 16777216 records into 'sbtest1'

real    2m37.847s
user    0m18.220s
sys     0m0.971s

# Make the table bigger by padding rows with data.
# This is done in such a way as each InnoDB Page contains 16 rows
#   (the table was created with two CHAR of size 120 and 60).
time { echo "ALTER TABLE sbtest1 ADD COLUMN c0 CHAR(60) DEFAULT ''"
       seq -f " ADD COLUMN c%.0f CHAR(240) DEFAULT ''" 1 3
} | paste -s -d "," | ./use sbtest
real    10m17.757s
user    0m0.009s
sys     0m0.000s

# Normally, because of instant ADD COLUMN, 8.0 needs running an identity
#   ALTER for inflating the table.  But the ALTER above took longer than
#   expected, which looks like a bug, so I opened PS-6857.  For PS 8.0.29,
#   below is not absolutely needed.
time ./use sbtest <<< "ALTER TABLE sbtest1 ENGINE=InnoDB"
real    10m22.727s
user    0m0.002s
sys     0m0.005s

# Make sure all pages are flushed.
time ./use sbtest <<< "FLUSH TABLE sbtest1 FOR EXPORT"
real    0m0.010s
user    0m0.007s
sys     0m0.000s

# Look at the size of the table.
du -h data/sbtest/sbtest1.ibd
20G     data/sbtest/sbtest1.ibd

Annex #2 : Compressing the Table and Realistic IOs (not All Hit in the Page Cache)

# Common stuff.
nb_rows="$((16*1024*1024*1024 / (16*1024) * 16))"
socket=$(awk '$1=="socket"{print $3; exit}' my.sandbox.cnf)
sba="oltp_point_select --rand-type=uniform --mysql-socket=$socket"
sba="$sba --mysql-user=msandbox --mysql-password=msandbox --mysql-db=sbtest"
sba="$sba --tables=1 --table_size=$nb_rows"

# Out of curiosity, sysbench before compression (for comparing performance).
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = legacy"
sysbench $sba --threads=1 --time=60 run | grep -e queries: -e avg: -e max
    queries:                             1983454 (33056.57 per sec.)
         avg:                                    0.03
         max:                                    0.51

# Compress the table.
time ./use sbtest <<< "ALTER TABLE sbtest1 KEY_BLOCK_SIZE=4"
real    9m44.327s
user    0m0.002s
sys     0m0.005s

# Make sure all pages are flushed.
time ./use sbtest <<< "FLUSH TABLE sbtest1 FOR EXPORT"
real    0m0.010s
user    0m0.002s
sys     0m0.005s

# Look at the size of the table.
du -h data/sbtest/sbtest1.ibd
4.7G    data/sbtest/sbtest1.ibd

# Out of curiosity, sysbench after compression (for comparing performance).
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = legacy"
sysbench $sba --threads=1 --time=60 run | grep -e queries: -e avg: -e max
    queries:                             631707 (10528.13 per sec.)
         avg:                                    0.09
         max:                                    1.94

# Allocate Huge Pages to prevent the table from being in the Linux Page Cache.
# Depending of the amount of RAM, the numbers might have to be adjusted.
# To learn more about this trick, read this blog post:
# https://jfg-mysql.blogspot.com/2021/11/trick-simulate-linux-server-with-less-ram.html
grep -e MemTotal -e SwapTotal -e MemFree -e "^Cached" /proc/meminfo
MemTotal:       64819156 kB
MemFree:        54962156 kB
Cached:          6115828 kB
SwapTotal:             0 kB
sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
grep -e MemTotal -e SwapTotal -e MemFree -e "^Cached" /proc/meminfo
MemTotal:       64819156 kB
MemFree:        61419540 kB
Cached:           362476 kB
SwapTotal:             0 kB
sudo bash -c "echo $((512*50+256)) > /proc/sys/vm/nr_hugepages"
grep -e MemTotal -e SwapTotal -e MemFree -e "^Cached" /proc/meminfo
MemTotal:       64819156 kB
MemFree:         8857076 kB
Cached:            52460 kB
SwapTotal:             0 kB

Annex #3 : Reproducing backoff Stalls with a Compressed Table

# Common stuff.
nb_rows="$((16*1024*1024*1024 / (16*1024) * 16))"
socket=$(awk '$1=="socket"{print $3; exit}' my.sandbox.cnf)
sba="oltp_point_select --rand-type=uniform --mysql-socket=$socket"
sba="$sba --mysql-user=msandbox --mysql-password=msandbox --mysql-db=sbtest"
sba="$sba --tables=1 --table_size=$nb_rows"
sq_query="SHOW GLOBAL STATUS LIKE 'Slow_queries'"
function init_sq_count() { local a; read a sq1 <<< $(./use -Ne "$sq_query"); }
function show_sq_count() { local a; read a sq2 <<< $(./use -Ne "$sq_query")
  echo "Slow_queries: $(($sq2 - $sq1))"; }
ibpr_query="SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'"
function init_ibpr_count() { local a; read a ibpr1 <<< $(./use -Ne "$ibpr_query"); }
function show_ibpr_count() { local a; read a ibpr2 <<< $(./use -Ne "$ibpr_query")
  echo "Innodb_buffer_pool_reads: $(($ibpr2 - $ibpr1))"; }
icuo_query="SELECT SUM(uncompress_ops) FROM information_schema.INNODB_CMP"
function init_icuo_count() { icuo1=$(./use -Ne "$icuo_query"); }
function show_icuo_count() { icuo2=$(./use -Ne "$icuo_query")
  echo "Innodb_cmp_uncompress_ops: $(($icuo2 - $icuo1))"; }
function init_counts() { init_sq_count; init_ibpr_count; init_icuo_count; }
function show_counts() { show_sq_count; show_ibpr_count; show_icuo_count; }
threads=4; rate=$(($threads * 100)); limit=$((16 * 1024 * 6/5))

# Grow the buffer pool to fully fit the compressed table "in RAM", and try loading
#   the table in the Buffer Pool (does not fully work but out of scope of this post).
time ./use sbtest <<< "
  SET GLOBAL innodb_buffer_pool_size = $((8*1024*1024*1024));
  SELECT COUNT(*) FROM sbtest1" > /dev/null
real    0m54.341s
user    0m0.009s
sys     0m0.002s

# Sysbench with backoff, filtered output and counts.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
init_counts;                                                              \
sysbench $sba --threads=1 --time=2 run | grep -e queries: -e avg: -e max; \
show_counts
    queries:                             10329  (5159.46 per sec.)
         avg:                                    0.19
         max:                                  763.27
Slow_queries: 2
Innodb_buffer_pool_reads: 203
Innodb_cmp_uncompress_ops: 11944

# Sysbench with legacy, filtered output and counts.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = legacy"
init_counts;                                                              \
sysbench $sba --threads=1 --time=2 run | grep -e queries: -e avg: -e max; \
show_counts
    queries:                             23688  (11833.20 per sec.)
         avg:                                    0.08
         max:                                    1.19
Slow_queries: 0
Innodb_buffer_pool_reads: 593
Innodb_cmp_uncompress_ops: 20504

# Repeating backoff with sleep 0, 5 and 60.
# The hairy awk script outputs queries, max, slow query, IO and uncompress counts.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
for sl in 0 5 60; do
  echo; echo "Sleep 300 and then loops with sleep $sl"; sleep 300
  for i in {0..4}; do
    sleep $sl; init_counts
    sysbench $sba --threads=1 --time=2 run | grep -e queries: -e max
    show_counts
  done |
    awk '{t[++i]=$2}
          i==5{i=0; printf "%6d %7.2f %2d %6d %6d\n", t[1], t[2], t[3], t[4], t[5]}'
done | tail -n +2
Sleep 300 and then loops with sleep 0
  5482  971.51  2    158   4710
 23457    1.33  0    688  20173
 23389    1.20  0    721  19984
 23928    1.16  0    714  20420
 24067    1.07  0    654  20469

Sleep 300 and then loops with sleep 5
 12205  682.73  2    303  10482
 23225    9.25  0    634  19933
 23018   17.54  0    597  19744
 23966   27.56  0    557  20516
 23045   27.54  0    556  19670

Sleep 300 and then loops with sleep 60
  7369  851.67  2    201   6337
 19854  245.90  2    645  17137
 18377  255.99  1    759  15876
 19750  245.89  1    777  16833
 19494  245.90  1    805  16797

# Sysbench with backoff and 4 threads doing 100 qps each.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
init_counts;                                               \
sysbench $sba --threads=$threads --time=2 --rate=$rate run |
  grep -e queries: -e avg: -e max;                         \
show_counts
    queries:                             824    (410.49 per sec.)
         avg:                                    0.12
         max:                                    0.65
Slow_queries: 0
Innodb_buffer_pool_reads: 27
Innodb_cmp_uncompress_ops: 707

# Similar to above, emptying the free list while sysbench is running,
#   with filtered output and counts.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
( id=$(($RANDOM * ($nb_rows - $limit) / (32*1024)))
  ./use sbtest -e "
    SET long_query_time = 9; DO sleep(1);
    SELECT id FROM sbtest1 WHERE id > $id LIMIT $limit, 1" > /dev/null &
  init_counts
  sysbench $sba --threads=$threads --time=2 --rate=$rate run |
    grep -e queries: -e avg: -e max
  show_counts; )
    queries:                             801    (399.18 per sec.)
         avg:                                   89.06
         max:                                  585.05
Slow_queries: 4
Innodb_buffer_pool_reads: 78
Innodb_cmp_uncompress_ops: 1651

# Similar to above, for 10 minutes, emptying the free list every 2 minutes,
#   and showing all slow query times.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
n=5; l=120; time=$(($n * $l)); \
( tail -f -n 0 data/slow.log |
    awk '$2=="Time:"{t=substr($3,12)} $2=="Query_time:"{print t ": " $3}' &
  trap "kill $!" EXIT
  for i in $(seq 1 $n); do
    sleep $(($l / 2))
    id=$(($RANDOM * ($nb_rows - $limit) / (32*1024)))
    ./use sbtest -e "
      SET long_query_time = 9;
      SELECT id FROM sbtest1 WHERE id > $id LIMIT $limit, 1" > /dev/null
    sleep $(($l / 2))
  done &
  init_counts
  sysbench $sba --threads=$threads --time=$time --rate=$rate run | grep -e queries:
  show_counts; )
16:27:24.733619Z: 0.199349
16:27:24.733626Z: 0.199042
16:27:24.733630Z: 0.203437
16:27:24.733624Z: 0.208039
16:29:25.134491Z: 0.168449
16:29:25.134492Z: 0.168464
16:29:25.137346Z: 0.168357
16:29:25.139026Z: 0.188526
16:31:25.537864Z: 0.255990
16:31:25.542367Z: 0.266015
16:31:25.542700Z: 0.265974
16:31:25.543174Z: 0.276117
16:35:25.856932Z: 0.057693
16:35:25.858968Z: 0.057736
    queries:                             239343 (398.90 per sec.)
Slow_queries: 14
Innodb_buffer_pool_reads: 8147
Innodb_cmp_uncompress_ops: 209818

J-F Gagné's MySQL Blog

Monday, December 12, 2022