This post is written in the context of my work on MySQL at Aiven : check our Developer Center, blog or careers page to know more about us.
After reading the code about a part of MySQL Startup that I call InnoDB Tablespace Duplicate Check (Duplicate Check for short), my understanding was that the first page, usually 16 kilobytes, of each table is read on startup. The relevant part of the code is the function Tablespace_dirs::duplicate_check which is called for each table. This function calls Fil_system::get_tablespace_id which is reading the first page of the table (MAX_PAGES_TO_READ is defined to 1). But when validating my understanding by actually testing with monitoring the disk, I observed that for 1 million tables, 65.7 million sectors (roughly 32 GB) is read, which is twice what I expected.
I was quick to make a hypothesis here: the Linux kernel is reading more by prefetching the next 16 kilobytes. Validating this was easy, I just needed adding a second fadvise call with POSIX_FADV_RANDOM before the fread (there is already one with POSIX_FADV_DONTNEED after), and this indeed divided the number of sectors read by two, reducing MySQL startup time with 1 million tables from 2:39 to 1:44. And I was able to do even better (1:44 is not the 1:07 I mentioned in the introduction).
Once I was in IO optimization mode, it became obvious - it is almost embarrassing that I did not realize this before - that the full first page of each table was not needed, we only need 4 bytes at offset FIL_PAGE_SPACE_ID (which is 34). So I naively thought I could reduce the number of sectors read by another 32x. After testing, I saw that it only achieved an additional reduction of 4x because the XFS block size is 4 kilobytes. Still, combined, these two optimizations reduce the number of sectors read from 65.7 to 9.7 million, which reduces startup time from 2:47 to 1:07 (with 1 million tables and local SSD).
All this is described, with my testing patch, in Bug #115988: Too Much Disk Read on Startup, penalizing deployments with many tables (1M+). You can also check and comment / review the matching Contrib Fake PR in my fork on GitHub.
The tests I did with EBS volumes of type gp3 did not lead to a significant improvement in startup time. I am guessing that reading 4 or 32 kilobytes from an EBS volume has a very similar cost (latency and IOPS), which explains that I was not able to see significant speedup. If I looked closer, I might see less CPU consumption, but I did not take time to do these tests (if someone wants to document this, I can help). But there is more to say about MySQL Startup with EBS volumes, and this will be in another post where I need to dive into the full "system" behavior of InnoDB Tablespace Duplicate Check.
In addition to describing Duplicate Check in more detail, I have bug reports to open and things to write about the following subject of the MySQL / InnoDB Startup, and not only about Duplicate Check:
- Logging improvements;
- Resource consumption and regressions;
- Documentation.
And one last thing: the second optimization (reading only 4 bytes instead of a full page) will probably not work on filesystems that have a 16-kilobyte block size. This means that when avoiding the overhead of the InnoDB Double Write Buffer, one will have to pay an overhead on startup. As usual, every optimization has a tradeoff ¯\_(ツ)_/¯.
That is it for today, stay tuned for more content on this subject.
Nice improvement !
ReplyDeleteGood job! BTW - speaking about general optimisation of MySQL startup and shutdown times, we have observed ~2x improvement when changing innodb_numa_interleave from ON to OFF (at least on 8.0.28). I guess it's unrelated with number of tables, but rather BP size. So if by chance you had it ON, you may also measure it with OFF (of course if this setting makes sense in your environment). As we have small number of tables and in your case majority of time is spent due to handling such big number of tables, I guess the results won't be as impressive as ours.
ReplyDeleteInteresting observation about numa interleave. It would be nice to document it, open a bug, and blog about this.
DeleteIn my case, for these tests, the buffer pool is 128 MB, so would probably not help.