Understanding Bulk Index Creation in InnoDB (and innodb_sort_buffer_size)

In a previous post, I presented an Unexpected Memory Consumption for Bulk Index Creation in InnoDB. This was triggered by an increased innodb_sort_buffer_size and as stated in another post: "the sorting algorithm does not scale well with large sort buffers". In this post, I will present why it does not scale well and I will suggest solutions.

This post also answers feedback request for the changes in defaults for the next MySQL version. A new value for innodb_sort_buffer_size is not suggested yet but before changing it, I think that:

innodb_sort_buffer_size should be a session variable with its default value based on a corresponding dynamic global variable
the bulk index creation algorithm in InnoDB must be changed
innodb_sort_buffer_size of much larger size than 64M should be allowed (maybe larger than 4GB)

A last note before diving into this subject: I do not need this to be fixed in MySQL 8.0. There are things out there that are more important. Said another way, I think this is a low priority idem on the backlog (but others might not agree). In the meantime, innodb_sort_buffer_size default should not be increased.

Bulk Index Creation in InnoDB:

Creating an index can be implemented with the following algorithm:

scan the table, extracting the fields you need, and writing them in a temporary file
sort the file according to the index
insert the sorted file into InnoDB

Of course, online index creation and other optimizations might make things more complex, but let's stick with that for now.

If you have N indexes to create, repeating the above many times will be inefficient as the table will be read N times. So we can modify above in the following way:

scan the table, extracting the fields you need for each index, and writing them in temporary files (N files in total)
sort each file according to the index (N sorts)
insert the sorted files into InnoDB

The InnoDB implementation optimizes the sorting phase by extra work in the scanning phase. As the scanning proceed, the index records are accumulated in a "sort area" of innodb_sort_buffer_size [1]. When this area is full, it is sorted and then written to disk before continuing the scan [2]. Once the table has been fully scanned, N files containing "sorted blocks" have been generated. Those files are then sorted using the "sorted block" property to speed-up the sorting [3].

I am sure that the "sort during scan" has valid motivations because this has been written by smart people. However, I do not know what those motivations are and I doubt they are still valid on current computer architecture (this code was written 10 years ago). The problem with "sort during scan" is that innodb_sort_buffer_size is allocated as many times as we are creating indexes. So for creating 32 indexes with an innodb_sort_buffer_size of 64M, 2048M of sorting area is allocated [4]. This is where the Unexpected Memory Consumption for Bulk Index Creation in InnoDB is coming from.

In the file sorting phase (after scanning), innodb_sort_buffer_size is allocated 3 times: 2 blocks are read and they are merged in a 3rd block using a merge sort algorithm that I have a hard time understanding. It looks like this sorting algorithm is optimized to reduce disk space consumption by sorting the file "in place" at the cost of using 3 buffers. I am not sure I like this trade-off and I would like to read comments on this.

Optimizing Index Creation with the Current Implementation:

Ideally, doubling the size of innodb_sort_buffer_size would remove the need for one pass of the merge sort:

if we allocate a sort buffer of 1M for sorting a 1GB index, we need 10 merge operations
using a buffer of 2M, sorting the same index can be done in 9 passes
with a buffer of 64M, only 4 passes are needed

If I am creating a single index, using 64M as innodb_sort_buffer_size should be much quicker than with a 1M buffer. But if I am creating 32 indexes, this buffer size might make me run out of RAM. This is why I think innodb_sort_buffer_size should be dynamic and it should have a session context.

Also, in an ideal world, I should be able to allocate an innodb_sort_buffer_size of a very large value. If I am creating an index on a Terabyte scale table, I should be able to use a buffer in the Gigabyte range. With the current implementation allocating 3 times the buffer for sorting and as many times this buffer as we are creating indexes, it does not make much sense to set a Gigabyte innodb_sort_buffer_size. But with a different algorithm, it could become interesting.

Changing the Algorithm:

I would suggest the following algorithm for index creation in InnoDB:

scan the table, extracting the fields you need, and writing them in an unsorted temporary file
from the unsorted file, generate 2 files with blocks of sorted records (this needs a sort buffer)
apply a standard iterative merge sort on the 2 files above
insert the sorted file into InnoDB

This algorithm does not need a sort buffer in step #1, it only needs a write buffer which can be of modest size.

Step #2 needs a sort buffer: the larger this buffer is, the fewer merge passes will be needed in step #3. However, this algorithm needs 2 times the disk space: from an unsorted file of size N, it generates 2 "block sorted" files of size N/2. This might not please some people, but I am personally fine with this (see more about that below).

Step #3 is a standard iterative merge sort [5]. It also needs extra disk space, but it needs very limited RAM (only some IO buffers).

I claim that the extra disk space is not a problem. After all, in step #4, when inserting the resulting file in InnoDB, this extra disk space is needed anyway !

With this algorithm, the only major use of RAM is in the sort buffer of phase #2, and it is only allocated once, not 3 times are the current sorting.

It looks to me that this would be a better implementation for InnoDB bulk index creation. What do you think ?

[1]: the allocation of the sort area can be found in MySQL 5.7.13 code here.

[2]: the "sort during scan" can be found in MySQL 5.7.13 code here.

[3]: the final sorting phase can be found in MySQL 5.7.13 code here.

[4]: the N times innodb_sort_buffer_size memory allocated for the sorting area during the table scan is a rough approximation. InnoDB does memory allocation using a heap. An InnoDB heap is a linked list of blocks where memory consumed from the last pre-allocated block (avoiding calling malloc). When the block is full, a new block is allocated. This wastes RAM at the end of blocks and needs more RAM for book-keeping structures. Also, to avoid having to copy large records in memory during sort, InnoDB uses pointer structures. All that grows the amount of memory needed for "sort during scan".

[5]: the merge sort starts by merging the first block of each of the 2 files and writing the result in file A. Then it does the same for the second block of each file writing to file B. This continues, merging corresponding blocks and appending them to file A and B respectively. After one pass, we have 2 files with sorted blocks of twice their initial size. This continues until the generation of a single file that will be fully sorted.

J-F Gagné's MySQL Blog

Saturday, July 16, 2016