Monday, August 14, 2017

More Details about InnoDB Compression Levels (innodb_compression_level)

In one of my previous posts, I shared InnoDB table compression statistics for a read-only dataset using the default value of innodb_compression_level (6).  In it, I claimed, without giving much detail, that using the maximum value for the compression level (9) would not make a big difference.  In this post, I will share more details about this claim.

TL;DR: tuning innodb_compression_level is not very useful for my dataset.

In the previous post, I explained how I am compressing InnoDB tables for a specific read-only dataset (more details can be found in the previous post).  I also present a log table where I am keeping information about the size of each compression stages.  As a reminder, here is the structure of this table:
> show create table compression_logs\G
*************************** 1. row ***************************
       Table: compression_logs
Create Table: CREATE TABLE `compression_logs` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `schema_name` varchar(64) NOT NULL,
  `table_name` varchar(64) NOT NULL,
  `initial_size` bigint(20) unsigned NOT NULL,
  `state` enum([...]) NOT NULL,
  `copied_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_8_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_4_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_2_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_1_size` bigint(20) unsigned DEFAULT NULL,
  `final_size` bigint(20) unsigned DEFAULT NULL,
  `final_kbs` tinyint(4) DEFAULT NULL
  PRIMARY KEY (`id`)
) ENGINE=InnoDB
1 row in set (0.03 sec)
For testing with other values of innodb_compression_level, I took a copy of the log table.  For that, I used the commands below.  Note that disabling binary logging is important as I do not want this table to be replicated and I want to avoid polluting the content of gtid_executed with errant transactions (this system is using Oracle/MySQL GTIDs).
SET sql_log_bin = 0;
CREATE TABLE compression_logs9 LIKE compression_logs;
INSERT INTO compression_logs9 (id, schema_name, table_name, initial_size, state)
  SELECT id, schema_name, table_name, initial_size, 'todo' from compression_logs;
I then changed the value of innodb_compression_level to 9 and I ran my compression test script.  I was ready for a long wait as the script is testing all values of KEY_BLOCK_SIZE and is running FLUSH TABLE ... FOR EXPORT to be able to get a reliable size of the table (the details are in the previous post).  While waiting, I checked a few things to see if I could make compression go faster.  I noticed the saturation below in redo log usage:


This is classic: there was not enough space in the redo log to sustain the amount of work I was throwing at InnoDB (I was running 30 instances of table compression in parallel).  I had those values as my redo log size:
innodb_log_files_in_group = 2
innodb_log_file_size      = 2G
I increased them to the following:
innodb_log_files_in_group = 4
innodb_log_file_size      = 8G
I am not sure it helped much in terms of compression speed but InnoDB was happily using this new space in the redo log:


I also checked the read/write ratio of InnoDB.  As I expected, there were fewer reads than writes.  This means that the data was well cached in the Buffer Pool.  If this would not have been the case, each compression with a new KEY_BLOCK_SIZE would have needed to read data from disk, which would have slowed down the compression script.  I would have been able to detect that with a smaller gap (or no gap at all) in read vs write throughput in InnoDB data files.  And if this would have happened, running less table compression in parallel might have speed things up.


After having compressed a few tables with an innodb_compression_level of 9, I ran the query below to see how it compared with a compression level of 6, but I got an unexpected error:
> SELECT nb_tables, final_size6_gib, final_size9_gib, delta_gib, final_size9_gib / final_size6_gib AS ratio
    -> FROM (SELECT COUNT(*) AS nb_tables,
    ->              ROUND(SUM(l.final_size)              / POW(2,30), 2) AS final_size6_gib,
    ->              ROUND(SUM(i.final_size)              / POW(2,30), 2) AS final_size9_gib,
    ->              ROUND(SUM(l.final_size-i.final_size) / POW(2,30), 2) AS delta_gib
    ->       FROM compression_logs l, compression_logs9 i
    ->       WHERE l.id = i.id AND i.state = 'done') AS c;
ERROR 1690 (22003): BIGINT UNSIGNED value is out of range in '(`schema`.`l`.`final_size` - `schema`.`i`.`final_size`)'
This is because the final_size field of the compression log tables is defined as unsigned and the query above is subtracting them, which can give a negative value.  This is the defined behavior of MySQL arithmetic operations as we can read the following in the manual:
The result [of an arithmetic operation] is determined according to the following rules:
  • [...]
  • If both operands are integers and any of them are unsigned, the result is an unsigned integer.
  • [...]
A solution to above is to use the CAST MySQL function to make both operands signed.  Some could think that it makes the query longer and they might prefer to use the sql_mode NO_UNSIGNED_SUBTRACTION which forces the result to be signed even if one of the operands is unsigned.  Another option is to change the type of the final_size field to signed but I do not like this patch as a negative size does not make sense in the context of this table.  I solved the problem by using type casting.

Back to compression, after having processed a few tenths of thousand of tables with an innodb_compression_level of 9 (it took about 4 days), I ran the query below to see how it compared with a compression level of 6:
> SELECT nb_tables, final_size6_gib, final_size9_gib, delta_gib,
    ->   ROUND(100 * final_size9_gib / final_size6_gib - 100, 2) AS ratio
    -> FROM (SELECT COUNT(*) AS nb_tables,
    ->              ROUND(SUM(l.final_size) / POW(2,30), 2) AS final_size6_gib,
    ->              ROUND(SUM(i.final_size) / POW(2,30), 2) AS final_size9_gib,
    ->              ROUND(SUM(CAST(l.final_size AS SIGNED) - CAST(i.final_size AS SIGNED))
    ->                    / POW(2,30), 2) AS delta_gib
    ->       FROM compression_logs l, compression_logs9 i
    ->       WHERE l.id = i.id AND i.state = 'done') AS c;
+-----------+-----------------+-----------------+-----------+-------+
| nb_tables | final_size6_gib | final_size9_gib | delta_gib | ratio |
+-----------+-----------------+-----------------+-----------+-------+
|     35000 |          607.42 |          589.58 |     17.84 | -2.94 |
+-----------+-----------------+-----------------+-----------+-------+
1 row in set (0.18 sec)
So the maximum compression level is saving about 2.94% of disk space for 35,000 tables.  This might be interesting for some people but as I can divide the size of this data by two with using MyISAM and myisampack, saving such small space on disk is not very interesting to me and in this case.  So I am not planning to use an innodb_compression_level of 9 for this system.

For completeness and as I was curious, I also tested with a compression level of 3.  Here is the result for the same 35,000 tables:
> SELECT nb_tables, final_size6_gib, final_size3_gib, delta_gib,
    ->   ROUND(100 * final_size3_gib / final_size6_gib - 100, 2) AS ratio
    -> FROM (SELECT COUNT(*) AS nb_tables,
    ->              ROUND(SUM(l.final_size) / POW(2,30), 2) AS final_size6_gib,
    ->              ROUND(SUM(i.final_size) / POW(2,30), 2) AS final_size3_gib,
    ->              ROUND(SUM(CAST(l.final_size AS SIGNED) - CAST(i.final_size AS SIGNED))
    ->                    / POW(2,30), 2) AS delta_gib
    ->       FROM compression_logs l, compression_logs3 i
    ->       WHERE l.id = i.id AND i.state = 'done') AS c;
+-----------+-----------------+-----------------+-----------+-------+
| nb_tables | final_size6_gib | final_size3_gib | delta_gib | ratio |
+-----------+-----------------+-----------------+-----------+-------+
|     35000 |          607.42 |          659.14 |    -51.72 |  8.51 |
+-----------+-----------------+-----------------+-----------+-------+
1 row in set (0.48 sec)
So a compression level of 3 instead of 6 is consuming about 8.51% more disk space.  I can guess that using 3 instead of 6 is also quicker but I do not have good data to share on the time taken for compression.

One thing on which I can share visual data is the CPU cycles needed for compression.  On the graph below, the left part is when running the compression script with a level of 9 and the part on the right is with a level of 3.  This shows that using 3 instead of 9 (and probably instead of 6) is consuming fewer CPU cycles.  So depending on how abundant (or limited) CPU cycles are in an environment, one could tune the compression level accordingly.


This is all I have on innodb_compression_level.  Personally and for now, I am not planning to use a different value than the default, especially that I know I can divide the size of this data by two if using MyISAM and myisampack.  As a last note, I think that the global scope of this parameter is not very flexible nor very usable: in some situations, one could want to use a different compression level on different tables.  Like KEY_BLOCK_SIZE, this parameter should probably be specified at the table level (and not in a global variable).  However, as I am not planning to use this for now, I am not opening a feature request for improving this (yet).

No comments:

Post a Comment