Tuesday, November 19, 2024

Understanding InnoDB Tablespace Duplicate Check (MySQL Startup with Many Tables)

This post is a little long, please bear with me as after the intro, you can skip to the essential.

In the last weeks / months, I have been working on understanding / improving MySQL startup with many tables.  I already wrote three posts on the subject, they are listed below.  In this post, I focus on describing the startup phase InnoDB Tablespace Duplicate Check (Duplicate Check for short).  This allows to better understand the optimization I already submitted, and find another optimization that could be leveraged.

This post is written in the context of my work on MySQL at Aiven : check our Developer Center, blog or careers page to know more about us.

The three posts I already wrote on the subject of long MySQL startup with many tables are :

  1. Faster MySQL Startup with Many Tables;
  2. Impressed by Disabling InnoDB Redo Logging for Creating Tables as Fast as Possible;
  3. Long and Silent / Stressful MySQL Startup with Many Tables.

In the first post above, I presented the light optimization which makes MySQL start in 1:09 (one minute and nine seconds) instead of 2:39 (with 1 million tables, on a m6id.xlarge AWS instance).  In that post, I wrote that it does not translate directly to EBS volumes, which I revisit in the next post with information from the current post.

In the third post above, I presented the four main / long phases of MySQL startup with 1 million tables, one of which is InnoDB Tablespace Duplicate Check (Duplicate Check for short).  I need to describe this phase in greater detail for the next two posts.  It is a little long, so if you are in a hurry and only want the essential to understand the next posts, jump to the section Duplicate Check System Analysis.

InnoDB Tablespace Duplicate Check (Duplicate Check)

The name Duplicate Check is not obvious from the logging, I chose it from the code (detailed further down below).  The logging would lead to "Tablespace Scaning" or "Space ID Checking", and we see what these mean below.  In fact, all these three names are not completely satisfactory because this phase is doing more than what can be described with a simple name (naming is hard).  Below is the log extract for Duplicate Check, the full logs are in the previous post.

2024-08-26T02:35:59.954165Z mysqld_safe Logging to '/home/jgagne/sandboxes/msb_mysql_8_0_39/data/msandbox.err'.
[...]
2024-08-26T02:36:27.504569Z 1 [Note] [MY-012207] [InnoDB] Using 5 threads to scan 1000002 tablespace files
[...]
2024-08-26T02:42:04.758139Z 1 [Note] [MY-012208] [InnoDB] Completed space ID check of 1000004 files.
[...]
2024-08-26T02:43:09.702744Z 0 [System] [MY-010931] [Server] /home/jgagne/opt/mysql/mysql_8.0.39/bin/mysqld: ready for connections. Version: '8.0.39'  socket: '/tmp/mysql_sandbox8039.sock'  port: 8039  MySQL Community Server - GPL.

Code-wise, Duplicate Check happens in the function Tablespace_dirs::scan (no link here, I link to the exact lines below).  This function also includes Directory Scanning (another startup phase presented in the previous post) and Partition Filename Alteration (I will not give more information about this beside pointing to Bug #116713 : Function rename_partition_files should not be in MySQL 9, and more).  The phase Duplicate Check starts at the declaration Space_id_set unique and lasts until the end of the function.

From the logging above, it can be inferred that Duplicate Check is a multithreaded process.  The number of threads is not configurable (it should be, more about this below).  The number of threads is computed in the function fil_get_scan_threads (call, definition) and can be summarized by :

IMHO, it is weird that the number of threads is determined by the number of cores.  Duplicate Check is an IO-bound process, reading the header of each tablespace to find the space ID (if you do not remember the details, you can find them in the first post).  So the number of threads should not be determined by the number of cores, but by IO characteristics (iops and bandwidth).  I come back to this further down below.

By the rules of integer division and from the code, 1 to 7999 tablespaces leads to 0 thread, 8000 to 15,999 to 1, 16,000 to 23,999 to 2, etc. (n/8000).  This number is passed to the generic function par_for (call, macro, definition) for thread spawning and waiting.  Interestingly, 0 and 1 thread lead to all the work being done in a single thread : the main thread for 0 and a sub-thread for 1.  I guess we should jump from 0 to 2, skipping 1, and the details are in Bug #116714 : With 8000 to 15.999 tablespaces, MySQL Startup Spawns one Useless Sub-Thread.

To specify the work to execute, a "callable instance" on an "iterator range" is passed to the generic function par_for.  In the case of Duplicate Check, the callable instance is Tablespace_dirs::duplicate_check (and yes, it is from this function name that I choose the phase name).  After getting the space ID of a tablespace (this is the part needing an IO), this ID is added to a unique set, and if it is a duplicate, it is also added to another set (duplicates).  These two sets are declared before the threads are spawned (link), and once the threads are finished, the content of the duplicate set is used to report success or failure.  But in addition to filling these two sets which are not conserved once the phase is completed, another longer-lived data structure is filled with a mapping of space ID to full filename.  To know more about how this data structure is used, you can check the section of the manual about Moving Tablespace Files While the Server is Offline.

What I call Duplicate Check has another name in then manual : Tablespace Discovery.  In the manual section about InnoDB Recovery, there is a subsection about Tablespace Discovery.  There are interesting historical notes about Tablespace Discovery in Bug #96340 (from 2019), quotes below.  Note that since this bug, Tablespace Discovery / Duplicate Check has been optimized to only open the first page of every .ibd file, and threads are spawned at 8k tablespaces instead of at 50k.

  • The 5.7 startup can be faster because 8.x reverts the 5.7 changes where we wrote the open file descriptors in the redo log on checkpoint. The scheme introduced in 5.7 was very buggy and it has a runtime cost (and lots of edge case bugs during recovery) and if you have lots of tables open and written to since the last checkpoint well good luck.
  • We decided to revert to 5.6 behavior to pay the cost on startup and will fix it properly later.
  • The cost comes from opening the first N pages of every .ibd file and checking the table space ID to build the tablespace ID to filename map.
  • We are looking at several options (medium to long run) that will eliminate this cost altogether. The scanning for .ibd files in the directories is not the issue and is fast for both HDD and SSD etc.

These quotes confirm the need for Duplicate Check in crash recovery, but it is still unclear if this is absolutely needed in normal startup (as mentioned above, it is sometimes needed for Moving Tablespace Files).  Also in these quotes, it is mentioned "will fix it properly later".  And because this looks somehow forgotten after 5 years, I opened Bug #116715 : Optimize Normal Startup with Many Tables (and there are many details in there about why startup is long).

Duplicate Check System Analysis

(In case you are jumping directly at this section, the TL&DR of the previous section is that Duplicate Check is an IO-bound phase, using up to 16 threads to dispatch IOs in parallel, and that the number of threads is determined by the number of cores available, which is weird for an IO-bound process)

Let's now do a system / resource analysis of Duplicate Check.  Assuming the light optimization is used, t threads (example 4) for an IO latency of l (example 0.5 ms) are able to use t/l iops (8000 for our example), consuming a disk bandwidth of 4*t/l/1024 MiB/s (32 for our example, and this assumes IO sizes of 4 KiB).

t/l iops are used by t threads with io latency of l ms,
consuming 4*t/l/1024 MiB/s disk bandwidth

Slightly modifying this formula, for a disk capacity of c (example 64k iops) and an IO latency of l (example 0.5 ms), c*l threads are needed (32 for our example) to saturate IO capacity, consuming a disk bandwidth of 4*c/1024 MiB/s (256 for our example, and this assumes IO sizes of 4 KiB).

c*l threads allow using c iops of latency l ms,
consuming 4*c/1024 MiB/s disk bandwidth

So with 64k iops with latency of 0.5 ms, having more than the current maximum of thread (16) might be useful (in theory, we would benefit from having 32 threads).  Also, with a latency of 0.5 ms and 2 cores (so 4 threads), we would only be able to use 8000 iops, while AWS gp3 EBS volumes can scale up to 16k iops.  For these two reasons, I opened Bug #116716 : Allow Setting Nb Thread for Tablespace Duplicate Check.  So far, this result is only theoretical, we will revisit this in-practice in a follow-up post.

Also, thinking about iops and bandwidth allowed me to revisit the light optimization on EBS volumes.  In the blog post presenting the optimization, I wrote that my tests on EBS did not show improvements, and I think I might have not looked at the right thing.  This will be the subject of the next post.

No comments:

Post a Comment