Saturday, August 29, 2015

Unexpected Memory Consumption for Bulk Index Creation in InnoDB (MySQL)

In my last Hackathon, I worked on MyISAM vs InnoDB for data loading (LOAD DATA IN FILE) and bulk index creation.  My motivation was the following: knowing that some are still using MyISAM for this particular use-case, I wanted to verify/understand if/why InnoDB is slower than MyISAM.  I do not yet have complete results on this specific subject but I found some interesting things that are worth sharing.

The first result is that creating many indexes on an empty InnoDB table with a large value of innodb_sort_buffer_size takes more time than expected (more than 6 seconds for 20 indexes).  I opened 2 bugs for MySQL 5.6.26 and MariaDB 10.0.21 for this particular issue.  Of course, I do not often create indexes on empty tables, so this bug is very minor, but this raised an unexpected behavior of the InnoDB Sort Buffer.  I wanted to know more...

There is not a lot of online information about the inner working of the InnoDB Sort Buffer.  Beside the documentation (MySQL and MariaDB), I found a single blog post on this subject.  All those describe memory consumption of 3 times the size of the InnoDB Sort Buffer plus some pointers, but my experiments show as much as 10 GB expansion in the mysqld RAM Resident Set Size (RSS) with innodb_sort_buffer_size=64M.  Obviously, something strange is happening.  I again opened 2 bugs for MySQL 5.6.26 and MariaDB 10.0.21 on this more concerning issue.

My tests start with the following table definition:
CREATE TABLE `test_wo_keys` (
  `f01` int AUTO_INCREMENT,
  `f02` bigint, `f03` bigint, `f04` enum('a','b'),
  `f05` date, `f06` int, `f07` int, `f08` double, `f09` int,
  `f10` bigint, `f11` double, `f12` enum('a','b','c','d','e'),
  `f13` int, `f14` int, `f15` varchar(255), `f16` int,
  `f17` int, `f18` int, `f19` double, `f20` double,
  `f21` double, `f22` double, `f23` double, `f24` tinyint,
  `f25` double, `f26` double, `f27` double, `f28` double,
  `f29` int unsigned, `f30` int unsigned, `f31` bigint,
  `f32` int unsigned, `f33` bigint, `f34` int unsigned,
  `f35` int unsigned,
  PRIMARY KEY `f01` (`f01`)
This table contains 5 million rows.  I then create indexes on that table using the following commands:
ALTER TABLE test_wo_keys ADD KEY f02 (f02);

ALTER TABLE test_wo_keys
  ADD KEY f02 (f02), ADD KEY f03 (f03),
  ADD KEY f04 (f04), ADD KEY f05 (f05),
  ADD KEY f06 (f06), ADD KEY f07 (f07),
  ADD KEY f08 (f08), ADD KEY f09 (f09);

ALTER TABLE test_wo_keys
  ADD KEY f02 (f02), ADD KEY f03 (f03),
  ADD KEY f04 (f04), ADD KEY f05 (f05),
  ADD KEY f06 (f06), ADD KEY f07 (f07),
  ADD KEY f08 (f08), ADD KEY f09 (f09), 
  ADD KEY f10 (f10), ADD KEY f11 (f11),
  ADD KEY f12 (f12), ADD KEY f13 (f13),
  ADD KEY f14 (f14), ADD KEY f16 (f16),
  ADD KEY f17 (f17), ADD KEY f18 (f18),
  ADD KEY f19 (f19), ADD KEY f20 (f20),
  ADD KEY f21 (f21), ADD KEY f22 (f22),
  ADD KEY f23 (f23), ADD KEY f24 (f24),
  ADD KEY f25 (f25), ADD KEY f26 (f26),
  ADD KEY f27 (f27), ADD KEY f28 (f28),
  ADD KEY f29 (f29), ADD KEY f30 (f30),
  ADD KEY f31 (f31), ADD KEY f32 (f32),
  ADD KEY f33 (f33), ADD KEY f34 (f34);
The graphs below show that creating 1, 4 and 32 indexes consume respectively ~400 MB, ~2.7 GB and ~11 GB of memory with innodb_sort_buffer_size=64M.  This is far from expected.

Note: the InnoDB Buffer Pool is set to 4 GB on this server and I made sure that it was fully loaded before starting my tests.

Note2: those graphs have been taken on MariaDB, but MySQL has a similar behavior (graph below for 32 indexes).

The advertisement from the documentation (3x InnoDB Sort Buffer Size) is obviously wrong.  If you observe well on the 3 last graphs, you will see a small decrease on RSS on the right side of the graph: this is the 3x InnoDB Sort Buffer Size.  The high consumption of RAM on the left of the graphs happens when the table is scanned to generate temporary files.  Once the files are generated, RAM consumption goes back to acceptable levels and stabilizes while the files are sorted.

If you want more details, you can either code dive in the file storage/innobase/row/ of MySQL/MariaDB sources, or wait for my next post on the subject.

Before closing, let me take this opportunity to promote my next talks:
Looking forward to seeing you in Amsterdam and/or in San Francisco.


  1. Salut, Jean-François,

    I am the primary developer of the InnoDB code in MySQL 5.6. It is improving on the ‘fast index creation’ feature that I developed for the InnoDB Plugin for MySQL 5.1.
    Some of the issues that you mention were already fixed in MySQL 5.7 quite some time ago, as part of an internally-filed bug
    that was initially fixed about a year ago by my colleague Thirunarayanan Balathandayuth.

    Yes, some of this could have been done already in 5.6, but I was under a time pressure to deliver correctly working online ALTER TABLE. To me, correctness was (and still is) more important than performance.

    Unfortunately, the worst-case memory consumption cannot be helped easily. If we are adding multiple secondary indexes, we will need one set of buffers for each of them.

    Roughly speaking, there are six forms of ALTER TABLE in MySQL 5.6 and later. They consist of the 3 main variants:

    This is the only option before InnoDB Plugin, and the only option for ENGINE=MyISAM tables. This can be forced by SET old_alter_table=1;
    This is the default in MySQL 5.6. Some operations are refused with ALGORITHM=INPLACE, and some operations (such as DROP PRIMARY KEY without ADD PRIMARY KEY) require LOCK=SHARED. Note: the word ‘inplace’ is somewhat misleading, because the table may need to be rebuilt.
    An ‘inplace’ ALTER that does not allow concurrent data modification. This will reduce memory overhead, because no log for concurrent DML operations will be allocated or applied.

    ALGORITHM=COPY will include both undo and redo logging overhead. Internally, InnoDB treats the operation as a combination of CREATE TABLE, INSERT, RENAME TABLE, DROP TABLE, RENAME TABLE.

    ALGORITHM=INPLACE since 5.6 will skip the undo logging overhead, but everything will be redo-logged. In 5.7, my colleague Shaohua Wang implemented WL#7277 avoids the redo logging for the initial index build phase, but not for the DML log apply ( WL#7277 also speeds up the loading of the merge-sorted index data into the B-trees. Before that, the sorted records were inserted one by one.

    Now, there are three subvariants of ALGORITHM=INPLACE:
    (a) Schema-only modifications, such as renaming a column or index, or extending a VARCHAR column.
    (b) WL#5526: Operations that require data changes, but no rebuild of the whole table: ADD INDEX, DROP INDEX
    (c) WL#6255: Operations that require rebuilding the whole table, e.g., ADD/DROP COLUMN, ADD PRIMARY KEY, change column order.

    The schema-only changes (a) are basically instantaneous (requiring an exclusive table lock during the ‘commit’ phase only).

    The DROP INDEX case of (b) is basically instantaneous too: the index B-trees will be dropped in the ‘commit’ phase.

    The ADD INDEX case of (b) will allocate a sort buffer for every index that is being built. If the operation is performed with LOCK=NONE (the default), we will also allocate index->online_log for each index that is being built. The above mentioned bug fix in MySQL 5.7 will avoid some of this overhead.

    The table-rebuilding ALTER operations will allocate a sort buffer for each index that will be built, plus a single index->online_log for the table rows. With the bug fix in 5.7, we can skip sorting of the clustered index if the order of the PRIMARY KEY does not change. There have been at least 2 bugs in this skip-sort functionality. The second fix was after the 5.7.8 release.

    Because some regressions take considerable time to surface in our ongoing tests, we are reluctant to backport this kind of huge fixes to GA versions, even in those cases when patches themselves could be applied with minor technical effort.

    With best regards,
    Marko Mäkelä
    Senior Principle Software Developer
    InnoDB Group, MySQL Server, Oracle Corporation

    1. Many thanks Marko for presenting yourself here and spending the time to comment on my post.

      Thanks also for pointing out that things has been improved in MySQL 5.7. I wanted to do tests with the latest RC version, but I was short on time. Your comment made me curious, I will probably do test soon and update the post, especially that "diff mysql-5.{6.26,7.8-rc}/storage/innobase/row/" is massive.

      About the worst-case memory consumption, I have some thoughts about this and I put them in a private comment on Bug #78270.

      Your comment raises some questions and thoughts, I will do a separate reply for each. Let me just put here that I agree with you that correctness is more important than speed: things must first work and then then can be improved. Something that is fast, but does not work, is not useful.

      Also, let me point out that I flagged those 2 bugs performance and minor. I do not think it is worth risking regressions to have them fixed in 5.6. I however thing that awareness should be raised about side effects of setting a large innodb_sort_buffer_size.

      Thanks again for your comment and looking forward to seeing you at PLAMS or OOW.

    2. About the 'inplace' that is misleading, my understanding of this 'inplace' is in fact a 'nocopy'. Some ALTER can be done without duplicating the storage of the table (that would be a copy) but still needs to scan the table and doing modifications to it (nocopy). You also write about schema-only modifications, this looks like a 3rd algorithm to me (noscan).
      Renaming a column, droppign an index, adding a value to an ENUM, or changing a VARCHAR(10) to VARCHAR(20) should succeed with 'noscan', but changing a VARCHAR(10) to VARCHAR(300) would fail (if I am not mistaken).

      Having an explicit way to specify that I expect this ALTER to be a noscan would be useful. It would allow some mistakes to be avoided. I am thinking of changing a VARCHAR(60) to VARCHAR(100) on a 100 GB table realizing after some minutes "oups, this was UTF-8" (my understanding of VARCHAR and UTF-8 is that the length field is the binary size, not the string length, but I might be mistaken as I did not test it).

    3. About "ALGORITHM=COPY will include both undo and redo logging overhead", I understand that the corresponding INSERTs need REDO logging for crash recovery, but I do not understand why they need UNDO. In my yet limited understanding of InnoDB, an INSERT does not need to store a previous version of the row, it is a new row. Could you explain that please ?

    4. About "ALGORITHM=INPLACE since 5.6 will skip the undo logging overhead, but everything will be redo-logged", I am not sure to fully understand: are we talking only about ADD INDEX or all ALTERs ? This might be related to previous comment above.

    5. About "DML log apply" and avoiding REDO logging during the initial index build phase, you mean that REDO logging is avoided during merge sorting, but not during the insert phase, right ? That makes sense to me as during the sorting, if we crash, the index is lost and we do not need REDO. However, if we crash after the insert phase, we need the REDO to rebuild the data as the corresponding pages could still be dirty. But if the insert phase is "fast" how is the un-existing REDO of the sort phase managed: if we crash here, how do we avoid loosing something ? Is a write back of the pages without REDO forced before the end of the insert phase ?

    6. Hi Jean-François, you have good ideas and questions.

      Maybe you should file a feature request regarding the ALGORITHM=NOCOPY and ALGORITHM=NOSCAN.

      Sure, there probably should be another parameter for the DML log buffer size, so that innodb_sort_buffer_size would not be overloaded for that. Maybe you could file a feature request for that as well?

      Yes, it is pretty insane for ALGORITHM=COPY to generate undo log for the inserts, but that is how it currently is. We do have an internally filed bug for this, and I would definitely want to remove this completely unnecessary overhead. This overhead prompted Heikki Tuuri to ask me to implement a stupid work-around in MySQL 4.0 or 4.1: we would commit the ALTER TABLE...ALGORITHM=COPY after every 10,000 copied rows, so that if the server was killed (due to the ALTER being so slow), the recovery would take forever to roll back all those inserts. Why not simply just let the recovery drop the incomplete table instead?

      "About "ALGORITHM=INPLACE since 5.6 will skip the undo logging overhead, but everything will be redo-logged", I am not sure to fully understand: are we talking only about ADD INDEX or all ALTERs ? This might be related to previous comment above." By default, all modifications to data pages will be redo-logged, be it user data pages, data dictionary pages, or undo log pages. So, I am talking about all ALTERs. Skipping the undo logging applies to ADD INDEX and any form of table-rebuilding (WL#6255) not-really-inplace ALGORITHM=INPLACE operations. Other ALTER operations will not generate any undo log, except for modifying the data dictionary tables.

      "About "DML log apply" and avoiding REDO logging during the initial index build phase, you mean that REDO logging is avoided during merge sorting, but not during the insert phase, right?" The merge sort files are temporary files. There is never redo logging for them. There are actually several phases in online ALTER:
      1. scan the clustered index, create the tuples for each index
      2. merge sort the tuples
      3. insert the sorted data into each index
      4. apply DML log into each index
      5. upgrade MDL to MDL_EXCLUSIVE
      6. [if table-rebuild] apply any DML log to each index
      7. commit or roll back the ALTER operation

      WL#7277 in MySQL 5.7 affects step 3 only. In step 4 (, we will still be generating redo log for applying the log from concurrent DML operations into the tables. This redo logging could be removed in 5.8.

    7. If we crash before an ALTER TABLE operation is committed, we must roll back all changes of the ALTER TABLE. This works in 5.6, with the exception of possible *.frm file mismatch and orphan #sql-ib*.ibd files in table-rebuilding ALTER TABLE.

      With WL#7277 in MySQL 5.7, we will obviously have to flush the non-redo-logged pages before the ALTER TABLE operation is committed. And we will write redo log about any page allocations, so that crash recovery will be able to drop the incompletely created index tree(s).

    8. Many thanks for all those information Marko.

  2. Oh, by the way, InnoDB does not yet implement any bulk-loading interface other than the internal one that WL#7277 introduced for ALTER TABLE…ALGORITHM=INPLACE.

    So, LOAD DATA INFILE will include both undo and redo logging overhead, and it will insert data into the secondary indexes in ‘random’ order.

    We are aware of the problem, but I cannot make any promises when it could be fixed. These things (especially crash recovery) are simpler in MyISAM, because in MyISAM, the keys are stored separately from the data hash, and there is no system tablespace for MyISAM. Users also do not expect any durability or consistency from MyISAM. In InnoDB, proper recovery from a server kill during a fast-path LOAD DATA INFILE would require some changes to the data dictionary and crash recovery.

    1. Thanks for pointing that out Marko. My preliminary results show that it is better to just load data with indexes in InnoDB: 542 seconds for load with indexes vs 134 seconds for only loading wo indexes + 649 seconds for index creation. The surprising part here is that index creation takes more time than loading with indexes, so I tried to see if I could get this part to take less time. It is while looking in that direction that I found the bad side effects of having a large innodb_sort_buffer_size.

    2. This result is very surprising, if you look at that requests that mysqldump should generate SQL to create the secondary indexes afterwards. Could you share your benchmark? Can you repeat the same with 5.7.8?

    3. My results are still preliminaries. Once I am sure I understand everything about them I will publish them. I can also run on 5.7.8, my scripts are ready and server time is available.

  3. Oh, and I am LOADING data that fits in RAM (at at least the indexes fit in RAM) so I do not have to pay read IO latency during loading.

  4. I see that post is very informative, thanks for sharing thoughts.