Wednesday, November 30, 2022

Tail Latencies in Percona Server because of InnoDB Stalls on Empty Free List

If, in Percona Server, you are observing tail latencies on queries that should be fast, this might be a side effect of Percona's improved InnoDB Empty Free List Algorithm.  When using this algorithm (the default in 5.6 and 5.7 and optional configuration in 8.0), a query needing a free page while none are available waits until the LRU Manager Thread refills the free list.  Because this thread is waking-up only at regular intervals, the query is blocked (which explains the latency) until the LRU Manager Thread is done sleeping and completes its work.  These latencies / stalls might last up to one second in the worse case.  All the details are in this post, starting with InnoDB Free Pages.

TL&DR :

  • Consider not using the backoff Empty Free List Algorithm,
  • If you stick to using this algorithm, monitor Innodb_buffer_pool_wait_free (or buffer_pool_wait_free) as each time this counter increases, there might be a stall,
  • To reduce stalls when using backoff, optimize the schema and queries, and carefully tune innodb_lru_scan_depth (such a weird name without direct relation to free pages, and what does carefully mean; the details are in the post).

InnoDB Free Pages and Free List

When using the InnoDB Storage Engine, all operations on data — including reads and writes — happen "in RAM", which in this context means in the InnoDB Buffer Pool.  The buffer pool is a memory buffer where InnoDB caches table and index data as they are accessed.  Its size is a fixed / finite value defined by the configuration variable innodb_buffer_pool_size.  Operations being fast in InnoDB is usually because of clever usages of the buffer pool.  Fully explaining how it works is beyond the scope of this post, I focus on Free Pages.

The buffer pool is organized in Pages.  The size of these pages is usually 16 kilobytes, but it can be smaller or larger depending on the value used at InnoDB initialization for the setting innodb_page_size (and of other things too complex to mention at this point; update 2022-12-12: I mention these in my post Free Page Consumption by InnoDB Table Compression).  Most (all ?) data in the buffer pool is stored in pages.  When some data is not in the buffer pool, InnoDB needs to fetch the matching page(s) before being able to read or write this data.  Because the buffer pool is of finite size, when it is full and new data is needed in RAM, some pages need to be evicted to make room for new data.  Eviction is a complicated process and doing it while data is fetched introduces latencies.  So a background InnoDB process keeps a buffer of free pages which can be quickly grabbed when fetching data.  This buffer is the Free List.

There would be a lot of things to write about the free list, and it would make for a long post.  Below is a summary with links to learn more.

One important thing to understand in the context of this post is what happens when the free list is empty and a free page is needed.  This is the subject of the next section.

Needing Free Pages With an Empty Free List

This part is where there are differences between Oracle MySQL and Percona Server for MySQL (called Percona Server for short in the rest of this post).  But to be able to fully understand what happens in this case, InnoDB Buffer Pool Flushing must be described.

InnoDB Buffer Pool Flushing, sometimes called InnoDB Flushing or InnoDB Page Flushing, is a background process that maintains the buffer pool.  When a buffer pool page is updated, this modification does not immediately propagate to the corresponding file on disk.  A log entry for this modification is added to the InnoDB Redo Log, and until written to disk, this page is considered dirty.  Writing — also called flushing or cleaning — dirty pages is one of the responsibilities of InnoDB Flushing.  Cleaning dirty pages is a complex process juggling many tradeoffs which is out of scope of this post, more can be read about this in the MySQL Manual (Configuring InnoDB Buffer Pool Flushing) and in many blog posts including InnoDB Page Flushing Diagram, Tuning MySQL/InnoDB Flushing, InnoDB Flushing in Action, and an old post from 2011 on Different flavors of InnoDB flushing (I do not know if it is still up to date, but it very clearly describes the different types of flushing).

One type / flavor of InnoDB Flushing is LRU Flushing.  This is the process of keeping the tail of the LRU clean by flushing dirty pages (I did not describe this in detail, you can infer that the buffer pool is organized in a Least Recently Used / LRU data structure).  Having a non-dirty / clean tail of the LRU is important for free pages generation.  The length of the tail of the LRU that is kept clean by the flushing process is controlled by the configuration variable innodb_lru_scan_depth (as described above, this variable also controls the amount of free pages in the free list, there should probably be two, and maybe three, variables for this, for more details, read Bug#108928 : Please consider adding innodb_free_list_target_size global variable).

Finally, the free list is refilled by the page cleaner threads (this is the name of the background flushing threads, their number being configurable with the configuration variable innodb_page_cleaners).  The way this is done is by evicting non-dirty / clean pages from the tail of the LRU and adding them to the free list.

One last thing to mention about InnoDB Flushing : Percona implements Multi-threaded LRU flusher.  In Oracle MySQL, all the following is done by the page cleaners :

  • flushing dirty pages,
  • maintaining the tail of the LRU clean,
  • and refilling the free list.

In Percona Server, maintaining the tail of the LRU clean and refilling the free list is delegated to other threads : the LRU Manager Threads.

Back at what happens when the free list is empty...  In Oracle MySQL and when the free list is empty, a query thread needing a free page takes on the responsibility of a page cleaner thread and tries evicting a page from the tail of the LRU.  When the tail of the LRU is clean, this is quick, but this needs writing to disk — which is longer — when all the tail of the LRU is dirty.  If there are many simultaneous queries needing free pages when the free list is empty and all the tail of the LRU is dirty, there will be contention on the LRU mutex while single page flushes are executed.  This is where the Percona improvements come into play.

In Percona Server,
with the backoff Empty Free List Algorithm,
stalls up to one second can happen
waiting for a free page

Percona implements an alternative algorithm for dealing with an empty free list : Priority refill.  When this algorithm is used — which is the default in 5.7, was the default in EOL 5.6, and is optionally configurable in 8.0 by setting innodb_empty_free_list_algorithm to backoff — a query thread needing a free page while the free list is empty waits for the LRU Manager Thread to produce free pages.  This avoids contention on the LRU mutex, but introduces a delay waiting for the LRU Manager Thread.  Our tail latencies / stalls are coming from these delays.  In some situations, the stalls can be close to inexistent or unnoticeable (less than 5 milliseconds, these are called micro-stalls in the rest of this post), but in other situations they can be more than 100 milliseconds and up to one second, which is hard to miss (called macro-stall in this post).  The next section shows examples of these stalls and explains their variability (monitoring and solutions are presented in other sections).

Observing Stalls of the backoff Empty Free List Algorithm

In this section, I describe how to observe the stalls of the backoff Empty Free List Algorithm.  For this, I am using dbdeployer, Percona Server 8.0.29 (I get similar results with 5.7.39), and sysbench, running on a vm with 16 vCPU and 64 GB of RAM (this is a little excessive, I want to make sure to avoid CPU starvation).  The sandbox and the dataset are created using the commands listed in Annex #2.  One characteristic of this sandbox and dataset is that the table used does not fit in the buffer pool, but it fits in the Linux Page Cache, so IOs are very fast (these are even not "real" IOs, MySQL thinks these are IOs, but these are cache-hit in the Page Cache).  This is a little artificial, but it is an easy way to reproduce stalls without very low latency SSDs or complex parallelism.  But bear with me, this scenario will become more realistic as we explore all the details around Free Pages.

In Annex #3, I run sysbench single-threaded without rate-limiting to reproduce tail latencies.  I expect to quickly empty the free list and to produce stalls.  Surprisingly, even if each execution of sysbench produces a stall, there are very few queries in the slow query log.  Also, executing sysbench 10 times in a loop only produces slow queries in the first execution, the subsequent all being fast.  From that, I am guessing that an adaptive algorithm is at play, waking-up the LRU Manager Thread more often if the free list is empty.  To confirm that, I reran sysbench in a loop but with sleeping 5 and 60 seconds between loops, and I get stalls of around 70 and 300 milliseconds.  When running without the backoff algorithm, I do not get stalls.

As explained in the previous paragraph, the test results made me suspect of an adaptive algorithm.  After exploring the Percona Server code, I found the function buf_lru_manager_adapt_sleep_time (link to 5.7.39 code and link to 8.0.29 code) which adjusts the LRU Manager Thread sleep time based on the free list length and the last flush result.  When the free list is filled at less than 5% and the previous iteration resulted in flushed pages, the sleep time is set to zero milliseconds.  The sleep time is then raised by one millisecond for each following iteration where the free list is filled at more than 20%.  Neglecting the flush delay, a sleep / stall of close to sqrt(2t) milliseconds happens t milliseconds after emptying the free list (computation details in Annex #4).  This means that a 100 ms stall happens 5 seconds after emptying the free list, and a stall of 346 ms after 60 seconds (the tests in Annex #3 are giving 70 and 300 ms, which is close enough considering the flush delay is neglected).

In my tests from Annex #3, I only get one or 2 slow queries / stalls while running thousands of queries, and only at the beginning of executing sysbench.  This is inconvenient, but if this only happens once, this might not be a problem that is worth spending so much time on.  However, by being a little more clever, slow queries can be generated in such configuration, even on a lightly loaded system.

In Annex #5, I run sysbench with 4 threads rate limited at around 400 queries per second.  This does not generate any stall, which is expected (this needs around 400 free pages per second, and there is a buffer of 1024, which is refilled every second).  However, while re-executing sysbench like this, if I run a simple query consuming all free pages, each of the 4 sysbench threads suffers a stall (the simple query also stalls, but this is expected).  This becomes more problematic : there is a possibility of repeating stalls on a very lightly loaded system, on which the steady state usually does not empty the free list, but on which an occasional query empties the free list and causes all running queries to stall.  As shown in the Annex #5, these stalls can be significant : queries usually taking less than 1 ms can take more than half a second, and this can happen at regular intervals if the query emptying the free list is irregular enough to let the LRU Manager Thread increase its wake-up delay to significant stall behavior.  A way to detect and avoid these stalls must be found.

Detecting Stalls of the backoff Empty Free List Algorithm

Now that the stalls of the backoff Empty Free List Algorithm are understood, we need a way to detect that they are happening.  Unfortunately, nothing in the extended Percona Server Slow log explicitly shows that a slow query is caused by a stall.  Below is an example of a slow log entry for one stall.

# Time: YYYY-MM-DDThh:mm:ss.uuuuuuZ
# User@Host: msandbox[msandbox] @ localhost []  Id:    27
# Schema: sbtest  Last_errno: 0  Killed: 0
# Query_time: 0.744840  Lock_time: 0.000003  Rows_sent: 1  Rows_examined: 1  Rows_affected: 0  Bytes_sent: 197
# InnoDB_trx_id: 0
#   InnoDB_IO_r_ops: 1  InnoDB_IO_r_bytes: 16384  InnoDB_IO_r_wait: 0.000009
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 4
SET timestamp=T;
SELECT c FROM sbtest1 WHERE id=1881701;

In above, it can only be guessed, by elimination, that this SELECT query suffered from a backoff stall because :

  • it only read a few pages,
  • it only examined and returned few rows,
  • it did not wait for locks, IOs nor InnoDB queue.

A slow query having IOs might be a clue that free pages were needed, but queries without IO can also suffer a backoff stall.

Update 2022-12-12: I cover queries without IO suffering a backoff stall in my post Free Page Consumption by InnoDB Table Compression.

Looking at Innodb_buffer_pool_wait_free
(or buffer_pool_wait_free)
is a relatively good way for detecting backoff stalls

One way to detect backoff stalls is to look at the server status Innodb_buffer_pool_wait_free or at the InnoDB Metric buffer_pool_wait_free.  When using the backoff algorithm, these do not have the same meaning from when using the legacy algorithm.  From the MySQL Manual, this status is described as (my addition in square brackets) :

When InnoDB [a query thread] needs to read or create a page and no clean pages are available, InnoDB [a query thread] flushes some dirty pages first and waits for that operation to finish. This counter counts instances of these waits.

The quote above is basically describing the legacy algorithm.  This algorithm was presented in a previous section of this post, but as a reminder, its description is copied below.

When the free list is empty, a query thread needing a free page takes on the responsibility of a page cleaner thread and tries evicting a page from the tail of the LRU.  When the tail of the LRU is clean, this is quick, but this needs writing to disk — which is longer — when all the tail of the LRU is dirty.

So in Oracle MySQL, or in Percona Server when using the legacy algorithm, the global status Innodb_buffer_pool_wait_free (and the InnoDB Metric buffer_pool_wait_free) is increased each time a query thread needs to flush a dirty page from the LRU for getting a free page.  This should not be confused with evicting a clean page from the LRU : evicting a clean page is quick, flushing a dirty page needs one or two IOs (one for the double write buffer, unless it is disabled, and one to write the page to the ibd file).

When using the backoff algorithm and with an empty free list, a query thread does not scan the tail of the LRU when needing a free page.  Instead, it waits for the LRU Manager Thread to free a page, which causes the stalls described in this post.  When implementing their improvement, Percona, instead of not using the global status Innodb_buffer_pool_wait_free (and the InnoDB Metric buffer_pool_wait_free), recycled this status (metric) for counting occurrences of a query thread waiting for a free page.  So these are the counters to monitor for detecting backoff stalls.

Below is a simple script for looking at the buffer_pool_wait_free InnoDB Metric.  Running this script for seeing the stalls of sysbench from Annex #3 and #5 is left as an exercise to the reader.

# Looking, every second, at the delta of the InnoDB Metric buffer_pool_wait_free.
m=buffer_pool_wait_free
t=information_schema.INNODB_METRICS
query="SELECT now(3), '$m', COUNT FROM $t WHERE NAME = '$m'"
while sleep 1; do ./use -Ne "$query"; done | awk '{v=$4-n; n=$4; $4=v} NR > 1'

However, and as usual, things are not that simple.  Even if every major stalls increment Innodb_buffer_pool_wait_free and buffer_pool_wait_free, they are also increased on non-significant micro-stalls.  So these counters increasing does not mean there are significant query latencies, it only means that a query thread has waited for the LRU Manager Thread.

Moreover, Innodb_buffer_pool_wait_free and buffer_pool_wait_free are not increased only once per stall, they are increased many times per stall.  As an example, the 5 stalls / slow queries caused by emptying the free list in Annex #5 results in increasing the counters by 479 !

For now, I am afraid I do not have anything better.  I think Percona should provide better observability on this, but going into more details is beyond the subject of this post.  For now, Innodb_buffer_pool_wait_free and buffer_pool_wait_free are the best signals I have for stalls, even if they lead to false positives.

Reducing and Avoiding Stalls of the backoff Empty Free List Algorithm

Now that the stalls of the Empty Free List Algorithm are understood and somehow detectable with false positives, a way to reduce their occurrence or to fully avoid them is needed.  Beside improving the algorithm, I have what I think is a naive way and what I think is a good way to reduce or avoid the stalls.

Increasing innodb_lru_scan_depth
is a naive way for reducing stalls

I think increasing the configuration variable innodb_lru_scan_depth is a naive way for reducing stalls because it is coming from a limited understanding of the behavior of the stalls.  If you used this method, you are in good company, because I did this myself for the last 18 months.  I only realized I was naive while writing this post..

The stalls are caused by the delay between queries starting to wait for free pages and the LRU Manager Thread waking-up for refilling the free list.  Having more pages in the free list sounds like a good way to have fewer stalls, and increasing innodb_lru_scan_depth is the way to achieve that.  However, even if having more free pages reduce their number, stalls will still happen when all free pages are consumed between two iterations of the LRU Manager Thread (remember that this thread might sleep up to one second before waking-up for producing free pages).  Also, having fewer stalls means that they will last for longer because of the adaptive algorithm.  Reducing the occurrence of the stalls, while having the remaining stalls being even worse does not sound like a good deal !

Increasing innodb_lru_scan_depth can still provide some good results, because fewer stalls will happen (but the remaining stalls will be longer).  However, this variable should be increased with care because more pages in the free list means fewer pages used for caching data.  When using this method, I was running the query below to see the ratio of free pages to total pages and decide if I could increase the number of free pages in the free list (@@hostname and now() to have a trace of where and when I ran the query).  My rule of thumb was that less than 5% of pages in the free list is easily acceptable, more than 20% is probably problematic, and in between is ok-ish.

mysql> SELECT @@hostname, NOW(), FLOOR(s/1024/1024) AS s_in_mb, d, i,
         d*i*16*1024/s AS free_ratio FROM (
           SELECT @@innodb_buffer_pool_size AS s, @@innodb_lru_scan_depth AS d, 
             @@innodb_buffer_pool_instances AS i) AS t;
+--------------+---------------------+---------+------+------+------------+
| @@hostname   | NOW()               | s_in_mb | d    | i    | free_ratio |
+--------------+---------------------+---------+------+------+------------+
| dbdeployer03 | 2022-11-29 00:30:03 |     128 | 1024 |    1 |     0.1250 |
+--------------+---------------------+---------+------+------+------------+
1 row in set (0.01 sec)

A little more involved way to reduce stalls is to reduce the need for free pages.  Some way to achieve this is to have a larger buffer pool (more RAM), and to optimize queries and schema for needing less IOs (adding covering indexes, packing the same data on fewer bytes on disk, ...).  These optimizations are very use-case specific and I cannot give more general advice about them in this post.

Another way to have fewer macro-stalls is to reduce the number of free pages in the free list.  I have not explored this in detail, but with less free pages, we can expect to have more stalls, but these will mostly be micro-stalls.  This might consume more CPU because the LRU Manager Thread will not sleep for a long time, but this will avoid macro-stalls.  However, as long as there is only one parameter to control the size of the free list and the length of the tail of the LRU that is kept clean (Bug#108928), this might not be a practical solution (if we only have 128 pages in the free list, we probably want more than 128 pages at the tail of the LRU to be kept clean).

The general good way of avoiding stalls
is to not use the backoff algorithm

The stalls are caused by the backoff algorithm, so a reasonable idea would be to avoid stalls by not using this algorithm (using the legacy algorithm instead).  It is even what Percona is doing in the latest 8.0 : in 5.7 and 5.6, backoff was the default and it is legacy in recent 8.0.  It is unclear if this is an omission or a well-thought decision : PS-5881 mentions needing to set the default to backoff and fixed in 8.0.17 (thanks Ivan Groenewold for finding this), but in 8.0.29, the default is now legacy.  I did not do more archeology to find the details about this.

In their documentation, I understand that Percona is telling us that the backoff algorithm is useful for I/O-Bound Highly-Concurrent Workloads.  I would like to add my grain of salt to that : I think the backoff algorithm is useful for Write-I/O Bound and Highly-Concurrent Workloads where Flushing is not quick enough to keep the tail of the LRU clean.  There is a lot to unpile in this statement, let's start by revisiting the legacy algorithm and the inception of the backoff algorithm.

In the legacy algorithm, a query thread needing a free page while there are none first tries to find a clean page to evict from the LRU and falls back to flushing a dirty page if no clean pages are found.  Backoff improves on that by waiting for a free page to be produced to avoid LRU contention.  But for me, LRU contention should only happen if the legacy algorithm needs to flush a page, because in the case of a query thread evicting a clean page, the LRU mutex is only held for a very short amount of time.  So for me, as long as query threads do not fall back to single page flush, the legacy algorithm should work just fine.

Using the backoff algorithm by default is a premature
optimization with significant drawbacks (the stalls)

I am sure there are situations where the backoff algorithm performs better than the legacy (if not, Percona would not have implemented it).  But as this algorithm might cause stalls, it should only be used when needed.  And before enabling it, flushing should probably be tuned to try avoiding dirty pages reaching the tail of the LRU.  Fully understanding when the backoff algorithm performs better than legacy is a complicated subject that I am yet to explore : if I need to go there, I will write a post about it.  But my current conviction is that in most cases, the best way to avoid stalls is to stop using the backoff algorithm and to use the legacy algorithm instead.

To be sure that I am not introducing regression by changing the Empty Free List Algorithm, I look at the Percona extended global status innodb_buffer_pool_pages_lru_flushed.  This status is increased each time a dirty page is flushed by the LRU Manager Thread (not to be confused with single page flush by query threads occurring when a free page is needed while the free list is empty and no clean page can be evicted from the LRU).  It is my understanding / educated guess that if no dirty pages are reaching the tail of the LRU, the legacy algorithm should perform better than the backoff algorithm because a query thread can quickly evict a free page from the LRU.  So as long as innodb_buffer_pool_pages_lru_flushed does not increase rapidly, most pages at the tail of the LRU are clean and the legacy algorithm should perform well.

Note that I was not able to find any similar counter in Oracle MySQL.  It looks like this was added by Vadim Tkachenko in Percona Server 5.5 (in the post Different flavors of InnoDB flushing, I found "I added the SHOW STATUS variable innodb_buffer_pool_pages_LRU_flushed, which shows only pages flushed by the LRU list").  I think this is an important counter, thanks Vadim for adding this.  For this to be implemented in Oracle MySQL, I opened Bug#109208 : Implement a counter for observing LRU flushing.

There might be some situations where the backoff algorithm is better than legacy, and in these cases, the stalls might be acceptable.  But I think these situations are the exception rather than the norm, and I think optimizing for these is a premature optimization.  Intuitively, I think such a situation would happen when the query threads are holding the LRU mutex for a long time, cascading in LRU contention.  This can be detected by enabling some InnoDB Metrics and seeing buffer_LRU_search_scanned increasing much faster than buffer_LRU_search_num_scan (query threads need to scan many pages in the LRU before finding a clean page to evict, holding the LRU mutex for some time), or seeing buffer_LRU_single_flush_scanned increasing (query threads need to flush dirty pages from the LRU).  However, I think there are bugs in these metrics, and I opened Bug#109200 : Metrics not increased for 1st iteration in buf_LRU_free_from_common_LRU_list (at the time of writing this post, that bug has been dismissed by MySQL Verification Team, but it has been accepted by a fork which I cannot name to avoid Planet MySQL censorship, see sister bug report that has already been fixed).

Conclusion and Future Work

In this post, most of the details about the tail latencies of the backoff Empty Free List Algorithm have been covered.  At this point, it is my opinion that the backoff setting is not a good general default and that legacy should be used.  There are situations which can benefit from the backoff algorithm, but it looks to me that these are more the exception than the norm, and that the backoff algorithm should be enabled on a case-by-case basis after understanding that it is better than legacy.  I did not fully explore when the backoff algorithm is better than legacy, I might do it in the future if I find a situation where legacy behaves poorly in one of my production environments.

Something quickly mentioned and not fully explored in this post is the relevance of the testing configuration with a small buffer pool and the table fitting in the Linux Page Cache.  I will cover this in a follow-up post.  But there is a hint about this hidden in the post : when I mentioned that a query might need free pages even if not doing IOs.  If you can guess when this happens, feel free to fill the form When does MySQL Need Free Pages Without Doing IOs.  You can already forget about Page-Splits on INSERTs or UPDATEs, or the Change Buffer and the Undo Logs, the situation I have in mind concerns SELECT queries.

Update 2022-12-12: I cover queries without IO suffering a backoff stall and the relevance of the testing configuration in my post Free Page Consumption by InnoDB Table Compression.

In addition to a follow-up post about the situation where a query needs free pages without doing IOs, I have another post planned to describe another bad behavior of the backoff algorithm (I mentioned in this post that Innodb_buffer_pool_wait_free is increased more than once per stall).  As you can see, I am not done writing about InnoDB LRU and Free List.


Annex #1 : Looking at the InnoDB Free List Size

The total number of free pages can be obtained by looking at the global status Innodb_buffer_pool_pages_free or the InnoDB Metric buffer_pool_pages_free :

mysql> SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_free';
+-------------------------------+-------+
| Variable_name                 | Value |
+-------------------------------+-------+
| Innodb_buffer_pool_pages_free | 1024  |
+-------------------------------+-------+
1 row in set (0.00 sec)

mysql> SELECT NAME, COUNT FROM information_schema.INNODB_METRICS
         WHERE NAME = 'buffer_pool_pages_free';
+------------------------+-------+
| NAME                   | COUNT |
+------------------------+-------+
| buffer_pool_pages_free |  1024 |
+------------------------+-------+
1 row in set (0.00 sec)

The number of free pages per buffer pool instance can be obtained by querying the INFORMATION_SCHEMA table INNODB_BUFFER_POOL_STATS (example below with 2 instances for a buffer pool of 2 GB just after startup with most pages being free) :

mysql> SELECT POOL_ID, FREE_BUFFERS
         FROM information_schema.INNODB_BUFFER_POOL_STATS;
+---------+--------------+
| POOL_ID | FREE_BUFFERS |
+---------+--------------+
|       0 |        65349 |
|       1 |        65403 |
+---------+--------------+
2 rows in set (0.00 sec)

Also, both metrics are in SHOW ENGINE INNODB STATUS :

mysql> SHOW ENGINE INNODB STATUS\G
[...]
----------------------
BUFFER POOL AND MEMORY
----------------------
[...]
Free buffers       130752
[...]
----------------------
INDIVIDUAL BUFFER POOL INFO
----------------------
---BUFFER POOL 0
[...]
Free buffers       65349
[...]
---BUFFER POOL 1
[...]
Free buffers       65403
[...]
1 row in set (0.00 sec)

Annex #2 : dbdeployer Sandbox for Empty Free List Stalls

# Create a sandbox for our tests.
# Because the used sysbench binaries do not work with caching_sha2_password,
#   use mysql_native_password.
dbdc="-c skip-log-bin"
dbdc="$dbdc -c slow_query_log=ON -c log_slow_verbosity=innodb"
dbdc="$dbdc -c slow_query_log_file=slow.log -c long_query_time=0.05"
dbdc="$dbdc -c innodb_adaptive_hash_index=OFF"
dbdc="$dbdc -c innodb_buffer_pool_instances=1"
dbdc="$dbdc -c innodb_buffer_pool_size=$((128*1024*1024))"
dbdc="$dbdc -c default-authentication-plugin=mysql_native_password"
dbdeployer deploy single ps_8.0.29 $dbdc

# Create a schema for sysbench and fill it.
# The table size is computed for having a 2 GB table with 16 rows per 16KB page.
# (sba: sysbench arguments)
./use <<< "CREATE DATABASE sbtest"
nb_rows=$((2*1024*1024*1024 / (16*1024) * 16))
socket=$(awk '$1=="socket"{print $3; exit}' my.sandbox.cnf)
sba="oltp_point_select --rand-type=uniform --mysql-socket=$socket"
sba="$sba --mysql-user=msandbox --mysql-password=msandbox --mysql-db=sbtest"
sba="$sba --tables=1 --table_size=$nb_rows"
time sysbench $sba prepare --create_secondary=off
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Creating table 'sbtest1'...
Inserting 2097152 records into 'sbtest1'

real    0m27.030s
user    0m2.313s
sys     0m0.112s

# Make the table bigger by padding rows with data.
# This is done in such a way as each InnoDB Page contains 16 rows
#   (the table was created with two CHAR of size 120 and 60).
time { echo "ALTER TABLE sbtest1 ADD COLUMN c0 CHAR(60) DEFAULT ''"
       seq -f " ADD COLUMN c%.0f CHAR(240) DEFAULT ''" 1 3
} | paste -s -d "," | ./use sbtest
real    1m22.110s
user    0m0.008s
sys     0m0.001s

# Normally, because of instant ADD COLUMN, 8.0 needs running an identity
#   ALTER for inflating the table.  But the ALTER above took longer than
#   expected, which looks like a bug, so I opened PS-8461.  For PS 8.0.29,
#   below is not absolutely needed.
time ./use sbtest <<< "ALTER TABLE sbtest1 ENGINE=InnoDB"
real    1m27.039s
user    0m0.003s
sys     0m0.004s

# Make sure all pages are flushed.
time ./use sbtest <<< "FLUSH TABLE sbtest1 FOR EXPORT"
real    0m0.020s
user    0m0.007s
sys     0m0.000s

# Look at the size of the table.
du -h data/sbtest/sbtest1.ibd
2.5G    data/sbtest/sbtest1.ibd

Annex #3 : First Attempts at backoff Stalls

# First attempts common stuff.
nb_rows=$((4*1024*1024*1024 / (16*1024) * 16))
socket=$(awk '$1=="socket"{print $3; exit}' my.sandbox.cnf)
sba="oltp_point_select --rand-type=uniform --mysql-socket=$socket"
sba="$sba --mysql-user=msandbox --mysql-password=msandbox --mysql-db=sbtest"
sba="$sba --tables=1 --table_size=$nb_rows"
sq_query="SHOW GLOBAL STATUS LIKE 'Slow_queries'"
function init_sq_count() { local a; read a sq1 <<< $(./use -Ne "$sq_query"); }
function show_sq_count() { local a; read a sq2 <<< $(./use -Ne "$sq_query")
  echo "Slow_queries: $(($sq2 - $sq1))"; }

# Sysbench with backoff, filtered output and slow query count.
# This max is unexpected : it is a stall described in this post.
# But there are few such stalls / slow queries.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
init_sq_count;                                                            \
sysbench $sba --threads=1 --time=2 run | grep -e queries: -e avg: -e max; \
show_sq_count
    queries:                             41317  (20640.18 per sec.)
         avg:                                    0.05
         max:                                  763.21
Slow_queries: 1

# Sysbench with legacy and filtered output : no stalls.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = legacy"
sysbench $sba --threads=1 --time=2 run | grep -e queries: -e avg: -e max
    queries:                             67702  (33820.64 per sec.)
         avg:                                    0.03
         max:                                    0.17

# Repeat legacy : no stalls.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = legacy"
for i in {0..9}; do sysbench $sba --threads=1 --time=2 run; done | grep max
         max:                                    0.21
         max:                                    0.21
         max:                                    0.25
         max:                                    0.29
         max:                                    0.20
         max:                                    0.15
         max:                                    0.21
         max:                                    0.22
         max:                                    0.19
         max:                                    0.21

# Repeat backoff.
# The hairy awk script outputs total queries, max, and slow query count.
# The first repetition has a big stall / slow queries, but not the others.
# An adaptive algorithm is probably preventing subsequent stalls.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
for i in {0..4}; do
  init_sq_count
  sysbench $sba --threads=1 --time=2 run | grep -e queries: -e max
  show_sq_count
done | awk '{t[++i]=$2} i==3{i=0; printf "%6d %7.2f %2d\n", t[1], t[2], t[3]}'
 29969  743.05  2
 69241    0.22  0
 69835    0.20  0
 69669    0.19  0
 70284    0.18  0

# Repeat backoff with sleep 5 for messing-up with an adaptive algorithm.
# The first repetition has a big stall / slow queries, the others have
#   smaller ones with their matching slow queries.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
for i in {0..4}; do
  sleep 5; init_sq_count
  sysbench $sba --threads=1 --time=2 run | grep -e queries: -e max
  show_sq_count
done | awk '{t[++i]=$2} i==3{i=0; printf "%6d %7.2f %2d\n", t[1], t[2], t[3]}' 
 41956  763.21  1
 65206   77.80  2
 64125   77.80  2
 66167   77.79  1
 65598   77.82  2

# Repeat backoff with sleep 60 : the subsequent stalls are bigger.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
for i in {0..4}; do
  sleep 60; init_sq_count
  sysbench $sba --threads=1 --time=2 run | grep -e queries: -e max
  show_sq_count
done | awk '{t[++i]=$2} i==3{i=0; printf "%6d %7.2f %2d\n", t[1], t[2], t[3]}'
 20356  811.03  2
 54296  316.36  2
 55339  326.42  2
 55064  306.55  2
 55181  316.31  2

# Verify legacy and sleep : no stalls.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = legacy"
for i in {0..9}; do sleep 60; sysbench $sba --threads=1 --time=2 run | grep max; done
         max:                                    0.16
         max:                                    0.18
         max:                                    0.27
         max:                                    0.20
         max:                                    0.25
         max:                                    0.24
         max:                                    0.22
         max:                                    0.23
         max:                                    0.20
         max:                                    0.53

Annex #4 : LRU Manager Thread Adaptive Sleep Math.

In the function buf_lru_manager_adapt_sleep_time (5.7.39 code and 8.0.29 code), the sleep delay is increased by 1 millisecond at each iteration when the free list is more than 20% full.  This means that after n iterations with a free list more than 20% full, the total time elapsed t is 1+2+3+...+n which is n(n+1)/2.  Solving t = n(n+1)/2, we have :

  • 2t = n^2 + n
  • --> n^2 + n -2t = 0
  • --> n = (-1 +/-sqrt(1-4x1x(-2t)))/2
  • --> n = (-1 +/-sqrt(1+8t))/2

A negative n above is not what we are looking for, so we can drop the negative value of the square root.  Also, for t in thousands of milliseconds, 1+8t can be approximated to 8t and we can neglect the -1.  So we have :

  • n = sqrt(8t)/2
  • --> n = sqrt(2t)

So 30 seconds after emptying the free list, the stall can be up to sqrt(2x30000) = 245 ms.  And for having a more than 700 ms stall, the minimum delay after emptying the free list is 700x(700+1)/2 = 245350 ms (a little more than 4 minutes).

All these computations are neglecting the time needed for executing an iteration of the LRU Manager Thread, but they are giving us a good idea of what should be expected.


Annex #5 : Regular backoff Stalls in a Lightly Loaded Sandbox

# Regular stall common stuff.
nb_rows=$((4*1024*1024*1024 / (16*1024) * 16))
socket=$(awk '$1=="socket"{print $3; exit}' my.sandbox.cnf)
sba="oltp_point_select --rand-type=uniform --mysql-socket=$socket"
sba="$sba --mysql-user=msandbox --mysql-password=msandbox --mysql-db=sbtest"
sba="$sba --tables=1 --table_size=$nb_rows"
sq_query="SHOW GLOBAL STATUS LIKE 'Slow_queries'"
function init_sq_count() { local a; read a sq1 <<< $(./use -Ne "$sq_query"); }
function show_sq_count() { local a; read a sq2 <<< $(./use -Ne "$sq_query")
  echo "Slow_queries: $(($sq2 - $sq1)); }
threads=4; rate=$(($threads * 100)); limit=$((16 * 1024 * 6/5))

# Sysbench with backoff and 4 threads doing 100 qps each.
# No stall expected as not consuming enough free pages.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
sysbench $sba --threads=$threads --time=2 --rate=$rate run |
  grep -e queries: -e avg: -e max
    queries:                             777    (387.21 per sec.)
         avg:                                    0.07
         max:                                    0.21

# Similar to above, emptying the free list while sysbench is running,
#   with filtered output and slow query count.
# Consuming all free pages needs reading more than 16*1024 rows because
#   some pages might have been prefetched accidentally by sysbench.
# Each thread should suffer a stall, so 4 slow queries are expected.
# If not getting high max, the adaptive algorithm might have been
#   previously activated : retry after waiting 5 minutes / sleep 300.
# Also, if not getting high max, the LRU Manager Thread might have
#   woken up quickly after the free list was emptied : retry.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
( id=$(($RANDOM * ($nb_rows - $limit) / (32*1024)))
  ./use sbtest -e "
    SET long_query_time = 9; DO sleep(1);
    SELECT id FROM sbtest1 WHERE id > $id LIMIT $limit, 1" > /dev/null &
  init_sq_count
  sysbench $sba --threads=$threads --time=2 --rate=$rate run |
    grep -e queries: -e avg: -e max
  show_sq_count; )
    queries:                             728    (362.57 per sec.)
         avg:                                   67.00
         max:                                  554.83
Slow_queries: 4

# Similar to above, for 10 minutes, emptying the free list every 2 minutes,
#   and showing all slow query times.
# This should result in 4 significant stalls at 2-minute intervals,
#   exactly what is observed below.
./use -e "SET GLOBAL innodb_empty_free_list_algorithm = backoff"
n=5; l=120; time=$(($n * $l)); \
( tail -f -n 0 data/slow.log |
    awk '$2=="Time:"{t=substr($3,12)} $2=="Query_time:"{print t ": " $3}' &
  trap "kill $!" EXIT
  for i in $(seq 1 $n); do
    sleep $(($l / 2))
    id=$(($RANDOM * ($nb_rows - $limit) / (32*1024)))
    ./use sbtest -e "
      SET long_query_time = 9;
      SELECT id FROM sbtest1 WHERE id > $id LIMIT $limit, 1" > /dev/null
    sleep $(($l / 2))
  done &
  init_sq_count
  sysbench $sba --threads=$threads --time=$time --rate=$rate run | grep -e queries:
  show_sq_count; )
01:35:41.656535Z: 0.401592
01:35:41.657470Z: 0.396737
01:35:41.658073Z: 0.405711
01:35:41.661335Z: 0.396724
01:37:42.014167Z: 0.326436
01:37:42.016142Z: 0.336401
01:37:42.021072Z: 0.336380
01:37:42.022688Z: 0.336398
01:39:42.373672Z: 0.336466
01:39:42.373876Z: 0.336455
01:39:42.374456Z: 0.336390
01:39:42.375309Z: 0.336403
01:41:42.730329Z: 0.326384
01:41:42.731932Z: 0.336372
01:41:42.732958Z: 0.336367
01:41:42.734310Z: 0.336306
01:43:43.087575Z: 0.336429
01:43:43.088405Z: 0.336386
01:43:43.090872Z: 0.336378
01:43:43.091750Z: 0.336343
    queries:                             239928 (399.88 per sec.)
Slow_queries: 20

2 comments:

  1. It is a gold. Not exaggerating.
    One question. Can this issue be possible diagnosed/quontified using perf schema wait events? The old school counters method is ... Well. Who can tell what value is too much?

    ReplyDelete
    Replies
    1. Thanks.

      I have not found anything like a "wait" counter / event that would allow quantifying the time spent in waiting. I wished this existed, but I do not think it does. This would be a good feature request for Percona, feel free to do it if you have time.

      There is an "os_thread_sleep" call in 5.7 in [1] and a "std::this_thread::sleep_for" in 8.0 in [2], I do not know if these automagically create matching Performance Schema wait events. If you find-out, let us know.

      [1]: https://github.com/percona/percona-server/blob/release-5.7.39-42/storage/innobase/buf/buf0lru.cc#L1547
      [2]: https://github.com/percona/percona-server/blob/release-8.0.29-21/storage/innobase/buf/buf0lru.cc#L1488

      Another way to diagnose this might be to use eBPF [3], but I do not know this tool much.

      [3]: https://ebpf.io/

      Good luck, and please share your finding.

      Delete