Wednesday, July 19, 2017

InnoDB Basics - Compaction: when and when not

This is old news for MySQL/MariaDB expert but people that are starting using InnoDB do not always know that disk space is not automatically released when deleting data from a table.  To explain and demonstrate that, I will take two real-world examples: table1 and table2.

Recently, more than 90% and about 20% of rows were deleted from table1 and table2 (those tables contain real data, I only changed the table names).  Those two tables are quite large (121GiB and 846GiB respectively):
# ls -lh table1.ibd table2.ibd
-rw-rw---- 1 user group 121G Jul 12 05:43 table1.ibd
-rw-rw---- 1 user group 846G Jul 12 11:11 table2.ibd
To understand why those table did not shrink when deleting rows, we have to dive into some InnoDB internals.  InnoDB stores data in pages (default size of 16 KiB).  When a row is deleted from a page, free space is generated but the page still occupies its full size on disk and in RAM.  When consecutive pages have enough free space, they are merged.  A page-merge generates a free page that can be reused for future row insertion but free pages still occupy space on disk.

An approximation of the amount of space consumed by free pages can be observed by looking at the field Data_free from a SHOW TABLE STATUS output.  Below, we can see the disk space consumed by free pages in table1 (converted to GiB with awk).
# sts="show table status like 'table1'\G"
# mysql my_schema <<< "$sts" | awk '$1=="Data_free:"{$2 /= 1024*1024*1024; print}'
Data_free: 117.709
This real-world example is a replicated environment, so running OPTIMIZE TABLE on the master would generate a lot of replication delay (and would lock the table).  To reclaim this free space, I used the online schema change utility from the Percona Toolkit with an alter command of ENGINE=InnoDB.  The table compaction took a little less than an hour and I have this when it completed:
# ls -lh table1.ibd
-rw-rw---- 1 user group 492M Jul 14 14:42 table1.ibd
So in this case, compaction allowed to save about 120 GiB of disk space.  This is more than the 117.7 GiB advertised by the Data_free of the SHOW TABLE STATUS output, so I am concluding that the used pages had significant free space in them.

We are now done with table1, let's have a look at table2.

After deleting about 20% of rows in table2, the amount of Data_free was not significant.  This means that we probably have a lot of free space inside InnoDB pages.  At this point, it is unclear what to do next:
  • compacting with an online schema change will be long as the table is big, it will generate a lot of binary logs (RBR) and we do not know how much space we will save,
  • on the other hand, we could end up freeing significant space, so it might be worth it,
  • but as this table is big, and if the percentage of reclaimed space is small, we might be doing work and taking risks for a very marginal gain (operations on live system always come with risks).
To decide on the way forward, I ran an OPTIMIZE TABLE on a spare slave (I have a few).  So I executed below:
# date; ls -lh table2.ibd; \
  mysql my_schema <<< "STOP SLAVE; OPTIMIZE TABLE LOCAL table2; START SLAVE"; \
  ls -lh table2.ibd; date
Wed Jul 12 11:17:29 CEST 2017
-rw-rw---- 1 user mysql 835G Jul 12 11:17 table2.ibd
Table             Op        Msg_type  Msg_text
my_schema.table2  optimize  note      Table does not support optimize, doing recreate + analyze instead
my_schema.table2  optimize  status    OK
-rw-rw---- 1 user group 640G Jul 13 10:30 table2.ibd
Thu Jul 13 10:30:45 CEST 2017
So it took a little more than 23 hours to compact table2 and it saved a little less than 200 GiB: not bad !  But at this point, the InnoDB pages are mostly full and the next row insertions in the table will generate page splits.  Such splits happen when InnoDB tries to add rows in pages without enough free space.  In this case, InnoDB splits the page generating two pages that are half-full.  The new row is then put in the generated free space.  This allows the row to be inserted but it produces about 16 KiB of free space in used pages.

Note that free space is not generated if rows are inserted in primary key order as the next row is always inserted at the end of the table.  But this does not apply to indexes because INSERTs there are mostly done in random orders.

So even if we saved 200 GiB after running OPTIMIZE TABLE on table2, page-split could end up consuming that reclaimed space very fast.  To confirm that this saving is not virtual, we have to observe the evolution of the size of the table on disk, and it is sadly growing:
# date; ls -lh table2.ibd
Thu Jul 13 10:46:51 CEST 2017
-rw-rw---- 1 user group 660G Jul 13 10:46 table2.ibd
# date; ls -lh table2.ibd
Thu Jul 13 12:56:08 CEST 2017
-rw-rw---- 1 user group 715G Jul 13 12:56 table2.ibd
# date; ls -lh table2.ibd
Fri Jul 14 10:06:55 CEST 2017
-rw-rw---- 1 user group 736G Jul 14 10:06 table2.ibd
So I decided that compacting table2 on all servers of this replicated system was not worth it.


  1. I wonder if Jeremy Cole's innodb_ruby could be used on a live instance directly to get the idea of potential OPTIMIZE savings: :

    $ innodb_space -f test/t.ibd space-index-pages-summary | head -n 10
    page index level data free records
    3 15 2 26 16226 2
    4 15 0 9812 6286 446
    5 15 0 15158 860 689
    6 15 0 10912 5170 496
    7 15 0 10670 5412 485
    8 15 0 12980 3066 590
    9 15 0 11264 4808 512
    10 15 0 4488 11690 204
    11 15 0 9680 6418 440

    1. Very good idea Laurynas, I did not think about this, it is definitely worth trying.

  2. As of MySQL 5.7, it is possible to control the page fill factor used by InnoDB when rebuilding indexes with OPTIMIZE TABLE. Which may help reduce page splits after rebuilding the table. However, that variable only applies to table rebuilds. I don't think it will make any difference for tools like pt-osc or gh-ost.

    By default, innodb_fill_factor=100 is used, which results in the fill factor of 15/16 for the clustered index (to keep the pre-5.7 behavior) and the 100% fill factor for secondary index pages.

    1. Thanks for pointing that out Alexey. I thought about mentioning innodb_fill_factor (and also MyRocks) but I cut short as the post is already quite long and dense. It would take a lot of lines to cover it fully and I was hoping that there would be a comment on this. I owe you a beer ! :-)

      I like the idea of innodb_fill_factor but the implementation in MySQL 5.7 and MariaDB 10.2 is still very coarse. One of the things I do not like about it is "global variable only". If you change it, it will be for the whole server. I think adding a session version of the variable would be good as we could control if we want to affect all current and future operations or only the one we are doing now. Also, it looks to me that this feature in a replicated environment has not been thought through: if you use it on the master, how would it apply to slaves...

      I think that what we need is a fill factor definition at the CREATE TABLE level and per-INDEX.

      So many feature requests to do and not much time for this now. I would be happy if someone would work on improving.

  3. Cannot agree more, innodb_fill_factor is quite limited in its current implementation. Being able to specify it per-session or per-table would be useful. What would be even more useful is being able to control fill factor on all inserts, not just on bulk inserts aka index rebuilds. I hope to get around to implementing those features soon.