Monday, August 7, 2017

An Adventure in InnoDB Table Compression (for read-only tables)

In my last post about big MySQL deployments, I am quickly mentioning that InnoDB compression is allowing dividing disk usage by about 4.3 on a 200+ TiB dataset.  In this post, I will give more information about this specific use case of InnoDB table compression and I will share some statistics and learnings on this system and subject.  Note that I am not covering InnoDB page compression which is a new feature of MySQL 5.7 (also known as hole punching).

Before talking about the results, I will describe the system where I have implemented InnoDB compression and give the following warning:
The compression statistics shared here are from a very specific system.
Take them with a grain of salt and test to see how compression benefits your systems.
The system where I implemented InnoDB table compression contains more than one million read-only tables.  Some time ago, those tables were stored in a single MySQL instance, without using InnoDB compression (the size was 200+ TiB) and without using MySQL replication.  The storage was external (disk array accessed via a SAN) and was supporting snapshots.  At the very beginning, the amount of data (and tables) was much smaller, it slowly grew to its current size and it is still growing.  Because administration was becoming tedious, the system was sharded on servers with local storage and replication was used for redundancy.  To reduce the number of shards needed to store all this data, I implemented InnoDB table compression.

One particularity of this system is that the data arrives in MySQL in chunks (tables) that are never updated after being stored (after being fully written, the tables are read-only).  In the new sharded system, the data is written on the master in an InnoDB tables (non-compressed) and it reaches slaves via classical MySQL replication.  Once the table is fully written, a cron job running on all nodes of the replication hierarchy compresses the data locally (not writing to binary logs).

The compression script is trying many KEY_BLOCK_SIZE for finding the one resulting in the smallest footprint on disk.  The script works as follows for each table to compress:
  1. it copies the table twice in a temporary schema (let's call those copies A and B),
  2. it compresses table B with KEY_BLOCK_SIZE of 8,
  3. if compression does not succeed, it keeps table A then exits,
  4. if table B (just compressed) is larger than table A, it keeps table A then exits,
  5. it swaps tables A and B,
  6. it goes back to step #2 using KEY_BLOCK_SIZE of 4, 2 and 1.
The compression script stores the compression state of a table and the resulting sizes of each step in a log table having following structure:
> show create table compression_logs\G
*************************** 1. row ***************************
       Table: compression_logs
Create Table: CREATE TABLE `compression_logs` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `schema_name` varchar(64) NOT NULL,
  `table_name` varchar(64) NOT NULL,
  `initial_size` bigint(20) unsigned NOT NULL,
  `state` enum([...]) NOT NULL,
  `copied_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_8_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_4_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_2_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_1_size` bigint(20) unsigned DEFAULT NULL,
  `final_size` bigint(20) unsigned DEFAULT NULL,
  `final_kbs` tinyint(4) DEFAULT NULL
  PRIMARY KEY (`id`)
1 row in set (0.03 sec)
Note that finding the size on disk of an InnoDB table is not straightforward.  Looking at the size of the ibd file (ls, du or stat) does not work as some data might still be in the Buffer Pool (not yet flushed to disk).  Also, the output of SHOW TABLE STATUS or the content of information_schema.TABLE only shows an approximation of the data and index lengths.  I used the command FLUSH TABLES ... FOR EXPORT for solving that.  Below is an extract from the manual, which is exactly what we need to be able to use ls, du or stat for getting the table size:
[FLUSH TABLES ... FOR EXPORT] ensures that changes to the named tables have been flushed to disk so that binary table copies can be made while the server is running.
I will come back later below to compression failing and to table size going the wrong direction (growing instead of shrinking after compression).  Now, let's look at the compression statistics.  For this post, I am only looking at a single shard of the new system (I am assuming that all shards have similar behavior).  The query below gives a good overview of the InnoDB table compression benefits on the amount of data on disk:
> SELECT nb_tables, initial_gib, final_gib,
         ROUND(initial_gib/final_gib, 2) AS ratio
  FROM (SELECT COUNT(*) AS nb_tables,
               ROUND(SUM(initial_size) / POW(2,30), 2) AS initial_gib,
               ROUND(SUM(final_size)   / POW(2,30), 2) AS final_gib
        FROM compression_logs) AS c;
| nb_tables | initial_gib | final_gib | ratio |
|    248670 |    20701.84 |   4566.50 |  4.53 |
1 row in set (0.14 sec)
What we have here is a very respectable MySQL instance:
  • it contains 248,670 tables,
  • uncompressed, those tables would use 20.21 TiB of disk space,
  • after compression, those tables are using 4.45 TiB of disk space,
  • and the average compression ratio is 4.53.
The results above have been obtained with the default value of innodb_compression_level (6).  Using the highest value (9) might give better results in the general case but the tests I did on that system do not show significant improvements (I might share more about that in another post; UPDATE 2017-08-14: I wrote a More Details about InnoDB Compression Levels post).  So I kept the default for all the shards of this system.

An average compression ratio of 4.53 is very good but it might vary for different KEY_BLOCK_SIZE.  Let's look at more detailed data with this query (a final_kbs of 16 means no compression):
> SELECT final_kbs, nb_tables, initial_gib, final_gib,
         ROUND(initial_gib/final_gib, 2) AS ratio
  FROM (SELECT final_kbs, COUNT(final_kbs) AS nb_tables,
               ROUND(SUM(initial_size) / POW(2,30), 2) AS initial_gib,
               ROUND(SUM(final_size)   / POW(2,30), 2) AS final_gib
        FROM compression_logs GROUP BY final_kbs) AS c;
| final_kbs | nb_tables | initial_gib | final_gib | ratio |
|         1 |     40863 |     1940.72 |    265.15 |  7.32 |
|         2 |    152215 |    14937.60 |   3033.26 |  4.92 |
|         4 |     38862 |     3387.78 |   1048.73 |  3.23 |
|         8 |     13896 |      434.14 |    217.76 |  1.99 |
|        16 |      2834 |        1.60 |      1.60 |  1.00 |
5 rows in set (0.23 sec)
As we see above, we end-up with tables of all KEY_BLOCK_SIZE/final_kbs but at this point, we do not know if compression stopped because ALTER is failing or because the table size grew with the next KEY_BLOCK_SIZE.  The query below is telling us when the next KEY_BLOCK_SIZE resulted in a bigger table:
> SELECT final_kbs, COUNT(final_kbs) AS nb_tables,
         ROUND(SUM(is)      / POW(2,30), 2) AS initial_gib,
         ROUND(SUM(fs)      / POW(2,30), 2) AS final_gib,
         ROUND(SUM(ns - fs) / POW(2,30), 2) AS diff_gib
  FROM (SELECT final_kbs, initial_size as is, final_size as fs,
               CASE final_kbs WHEN 16 THEN kbs_8_size
                              WHEN  8 THEN kbs_4_size
                              WHEN  4 THEN kbs_2_size
                              ELSE         kbs_1_size END AS ns
        FROM compression_logs WHERE final_kbs <> 1) AS c
  WHERE next_size > final_size GROUP BY final_kbs;
| final_kbs | nb_tables | initial_gib | final_gib | diff_gib |
|         2 |    112278 |     9339.80 |   1955.21 |   609.37 |
|         4 |     38862 |     3387.78 |   1048.73 |   137.83 |
|         8 |     13896 |      434.14 |    217.76 |    51.82 |
|        16 |      2834 |        1.60 |      1.60 |     9.17 |
4 rows in set (0.29 sec)
The result above shows that, if we want the smallest footprint on disk, it is important to use the right KEY_BLOCK_SIZE:
  • if using 8 instead of uncompressed (final_kbs of 16), 9.17 GiB is wasted (this might not look like much but it is 5.7 times the initial size),
  • if 4 instead of 8, 51.82 GiB is wasted, which is an increase of 23%,
  • and the increase is 31% for 2 instead of 1.
The attentive reader might have spotted a gap between the two tables above.  In the first tables, we have 152,215 tables that are compressed with a KEY_BLOCK_SIZE of 2 and in the second tables, we have only 112,278 tables.  This is because some tables failed to alter with a KEY_BLOCK_SIZE of 1.  Let's confirm that with the following query.
> SELECT COUNT(final_kbs) as nb_tables,
         ROUND(SUM(initial_size) / POW(2,30), 2) AS initial_gib,
         ROUND(SUM(final_size)   / POW(2,30), 2) AS final_gib
  FROM compression_logs
  WHERE final_kbs = 2 AND kbs_1_size IS NULL
  GROUP BY final_kbs;
| nb_tables | initial_gib | final_gib |
|     39937 |     5597.80 |   1078.05 |
1 row in set (0.18 sec)
So 39,937 tables are failing ALTER with a KEY_BLOCK_SIZE of 1. The manual is saying that compression could fail if a row does not fit in a block, the exact words being:
Generally, MySQL requires that each B-tree page in an InnoDB table can accommodate at least two records. For compressed tables, this requirement has been relaxed.  Leaf pages of B-tree nodes (whether of the primary key or secondary indexes) only need to accommodate one record, but that record must fit, in uncompressed form, in the per-page modification log.
However, I know that at least one of those tables has a row size smaller than 1 KiB.  I opened Bug#84439 for that.

I hope above gave you an idea of a real-world use case of InnoDB table compression.  Before ending this post, here are a few additional comments/thoughts on this work:
  • The script compresses table in a temporary schema and swaps the compressed with the original table (note that this is a cross-schema rename operation).  I found Bug#84038 that results in spamming the error log on MySQL restart for each cross-schema rename operation.  This should be fixed in MySQL 5.7.19 (I have not verified myself yet).
  • I already mentioned in a previous post that I tested MyISAM and myisampack and that this is giving a better compression ratio (I wrote a post Why we still need MyISAM about this).
  • Finally, I am curious how MyRocks would compress this data.  I might be able to find time to test this and write about it.


  1. Could you please clarify: KEY_BLOCK_SIZE=16 is a valid key block size for compression, which is not the same as no compression (it is documented to compress BLOB overflow pages but not the B-tree index pages). But in your test summary KEY_BLOCK_SIZE=16 is simply a notation for regular uncompressed tables, correct?

    1. Thanks for the question Laurynas. I changed a few words in the post. I am never altering with "KEY_BLOCK_SIZE=16" (I have no kbs_16_size in the log table) but I sometimes end-up with 16 in the final_kbs field of the log table. When this is the case, it means that I kept the table uncompressed (usually because using 8 for the KEY_BLOCK_SIZE results in a bigger table). Sorry for the confusion.

  2. Thanks Jean for posting a helpful link. Can you please confirm that such compression will help only in read only tables not in RW table ?

    1. InnoDB table compression will also work for RW tables, but the numbers I am presenting here are for compressing table that are never updated.