Recently, more than 90% and about 20% of rows were deleted from table1 and table2 (those tables contain real data, I only changed the table names). Those two tables are quite large (121GiB and 846GiB respectively):
# ls -lh table1.ibd table2.ibd -rw-rw---- 1 user group 121G Jul 12 05:43 table1.ibd -rw-rw---- 1 user group 846G Jul 12 11:11 table2.ibd
To understand why those table did not shrink when deleting rows, we have to dive into some InnoDB internals. InnoDB stores data in pages (default size of 16 KiB). When a row is deleted from a page, free space is generated but the page still occupies its full size on disk and in RAM. When consecutive pages have enough free space, they are merged. A page-merge generates a free page that can be reused for future row insertion but free pages still occupy space on disk.
An approximation of the amount of space consumed by free pages can be observed by looking at the field Data_free from a SHOW TABLE STATUS output. Below, we can see the disk space consumed by free pages in table1 (converted to GiB with awk).
We are now done with table1, let's have a look at table2.
After deleting about 20% of rows in table2, the amount of Data_free was not significant. This means that we probably have a lot of free space inside InnoDB pages. At this point, it is unclear what to do next:
Note that free space is not generated if rows are inserted in primary key order as the next row is always inserted at the end of the table. But this does not apply to indexes because INSERTs there are mostly done in random orders.
So even if we saved 200 GiB after running OPTIMIZE TABLE on table2, page-split could end up consuming that reclaimed space very fast. To confirm that this saving is not virtual, we have to observe the evolution of the size of the table on disk, and it is sadly growing:
An approximation of the amount of space consumed by free pages can be observed by looking at the field Data_free from a SHOW TABLE STATUS output. Below, we can see the disk space consumed by free pages in table1 (converted to GiB with awk).
This real-world example is a replicated environment, so running OPTIMIZE TABLE on the master would generate a lot of replication delay (and would lock the table). To reclaim this free space, I used the online schema change utility from the Percona Toolkit with an alter command of ENGINE=InnoDB. The table compaction took a little less than an hour and I have this when it completed:# sts="show table status like 'table1'\G" # mysql my_schema <<< "$sts" | awk '$1=="Data_free:"{$2 /= 1024*1024*1024; print}' Data_free: 117.709
So in this case, compaction allowed to save about 120 GiB of disk space. This is more than the 117.7 GiB advertised by the Data_free of the SHOW TABLE STATUS output, so I am concluding that the used pages had significant free space in them.# ls -lh table1.ibd -rw-rw---- 1 user group 492M Jul 14 14:42 table1.ibd
We are now done with table1, let's have a look at table2.
After deleting about 20% of rows in table2, the amount of Data_free was not significant. This means that we probably have a lot of free space inside InnoDB pages. At this point, it is unclear what to do next:
- compacting with an online schema change will be long as the table is big, it will generate a lot of binary logs (RBR) and we do not know how much space we will save,
- on the other hand, we could end up freeing significant space, so it might be worth it,
- but as this table is big, and if the percentage of reclaimed space is small, we might be doing work and taking risks for a very marginal gain (operations on live system always come with risks).
So it took a little more than 23 hours to compact table2 and it saved a little less than 200 GiB: not bad ! But at this point, the InnoDB pages are mostly full and the next row insertions in the table will generate page splits. Such splits happen when InnoDB tries to add rows in pages without enough free space. In this case, InnoDB splits the page generating two pages that are half-full. The new row is then put in the generated free space. This allows the row to be inserted but it produces about 16 KiB of free space in used pages.# date; ls -lh table2.ibd; \ mysql my_schema <<< "STOP SLAVE; OPTIMIZE TABLE LOCAL table2; START SLAVE"; \ ls -lh table2.ibd; date Wed Jul 12 11:17:29 CEST 2017 -rw-rw---- 1 user mysql 835G Jul 12 11:17 table2.ibd Table Op Msg_type Msg_text my_schema.table2 optimize note Table does not support optimize, doing recreate + analyze instead my_schema.table2 optimize status OK -rw-rw---- 1 user group 640G Jul 13 10:30 table2.ibd Thu Jul 13 10:30:45 CEST 2017
Note that free space is not generated if rows are inserted in primary key order as the next row is always inserted at the end of the table. But this does not apply to indexes because INSERTs there are mostly done in random orders.
So even if we saved 200 GiB after running OPTIMIZE TABLE on table2, page-split could end up consuming that reclaimed space very fast. To confirm that this saving is not virtual, we have to observe the evolution of the size of the table on disk, and it is sadly growing:
# date; ls -lh table2.ibd Thu Jul 13 10:46:51 CEST 2017 -rw-rw---- 1 user group 660G Jul 13 10:46 table2.ibd # date; ls -lh table2.ibd Thu Jul 13 12:56:08 CEST 2017 -rw-rw---- 1 user group 715G Jul 13 12:56 table2.ibd # date; ls -lh table2.ibd Fri Jul 14 10:06:55 CEST 2017 -rw-rw---- 1 user group 736G Jul 14 10:06 table2.ibd
So I decided that compacting table2 on all servers of this replicated system was not worth it.
I wonder if Jeremy Cole's innodb_ruby could be used on a live instance directly to get the idea of potential OPTIMIZE savings: https://blog.jcole.us/2013/01/03/a-quick-introduction-to-innodb-ruby/ :
ReplyDelete$ innodb_space -f test/t.ibd space-index-pages-summary | head -n 10
page index level data free records
3 15 2 26 16226 2
4 15 0 9812 6286 446
5 15 0 15158 860 689
6 15 0 10912 5170 496
7 15 0 10670 5412 485
8 15 0 12980 3066 590
9 15 0 11264 4808 512
10 15 0 4488 11690 204
11 15 0 9680 6418 440
Very good idea Laurynas, I did not think about this, it is definitely worth trying.
DeleteAs of MySQL 5.7, it is possible to control the page fill factor used by InnoDB when rebuilding indexes with OPTIMIZE TABLE. Which may help reduce page splits after rebuilding the table. However, that variable only applies to table rebuilds. I don't think it will make any difference for tools like pt-osc or gh-ost.
ReplyDeleteBy default, innodb_fill_factor=100 is used, which results in the fill factor of 15/16 for the clustered index (to keep the pre-5.7 behavior) and the 100% fill factor for secondary index pages.
Thanks for pointing that out Alexey. I thought about mentioning innodb_fill_factor (and also MyRocks) but I cut short as the post is already quite long and dense. It would take a lot of lines to cover it fully and I was hoping that there would be a comment on this. I owe you a beer ! :-)
DeleteI like the idea of innodb_fill_factor but the implementation in MySQL 5.7 and MariaDB 10.2 is still very coarse. One of the things I do not like about it is "global variable only". If you change it, it will be for the whole server. I think adding a session version of the variable would be good as we could control if we want to affect all current and future operations or only the one we are doing now. Also, it looks to me that this feature in a replicated environment has not been thought through: if you use it on the master, how would it apply to slaves...
I think that what we need is a fill factor definition at the CREATE TABLE level and per-INDEX.
So many feature requests to do and not much time for this now. I would be happy if someone would work on improving.
Cannot agree more, innodb_fill_factor is quite limited in its current implementation. Being able to specify it per-session or per-table would be useful. What would be even more useful is being able to control fill factor on all inserts, not just on bulk inserts aka index rebuilds. I hope to get around to implementing those features soon.
ReplyDelete