Why we still need MyISAM (for read-only tables)

Thursday, August 10, 2017

Why we still need MyISAM (for read-only tables)

TL;DR: we still need MyISAM and myisampack because it uses less space on disk (half of compressed InnoDB) !

In the previous post, I shared my experience with InnoDB table compression on a read-only dataset. In it, I claimed, without giving much detail, that using MyISAM and myisampack would result is a more compact storage on disk. In this post, I will share more details about this claim.

MyISAM is the original storage engine of MySQL. It is a modified version of ISAM for MySQL (My-ISAM). It was replaced by InnoDB as the default storage engine in MySQL 5.5 (first GA release in December 2010). MyISAM is sometimes qualified as "not under active development", unsupported, unmaintained, deprecated, end-of life, and/or other unflattering adjectives. There are even people writing about the end of MyISAM. However in some situations and even if it does not support good concurrency, neither transactions nor foreign keys, MyISAM (and myisampack) can still be very useful. The end of MyISAM would leave a gap in MySQL feature-set. Let's see why.

In the previous post, I described a replicated system storing thousands of read-only tables. Those tables are written as non-compressed InnoDB table on the master and they reach slaves via standard replication. After being written, those tables are converted locally (not writing to the binary logs) to compressed InnoDB tables. This post-compression significantly reduces the storage needed for this system.

MyISAM and myisampack could replace InnoDB table compression on this system. The tables would still be written on the master using the InnoDB storage engine (for replication crash safety), but MyISAM and myisampack would be used instead of InnoDB compression for reducing disk footprint. As compression is done on a copy of the table (atomically swapping/renaming tables when completed), the lack of transaction support in MyISAM is not a problem: if a crash happens during compression, the operation can be restarted from the beginning without losing any data. Also, as the tables are read-only, compressed MyISAM tables can be used (after packing a table using myisampack, the table becomes read-only).

We can use myisampack because our dataset is read-only !

In the previous post, I described a log table where I am keeping information about the size of each compression round. For testing MyISAM and myisampack, I used an extra slave where I added four fields to the compression log table: myisam_size, myisam_size2, mypack_size and mypack_size2. The structure of table becomes the following:

> show create table compression_logs\G
*************************** 1. row ***************************
       Table: compression_logs
Create Table: CREATE TABLE `compression_logs` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `schema_name` varchar(64) NOT NULL,
  `table_name` varchar(64) NOT NULL,
  `initial_size` bigint(20) unsigned NOT NULL,
  `state` enum([...]) NOT NULL DEFAULT 'copying',
  `copied_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_8_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_4_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_2_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_1_size` bigint(20) unsigned DEFAULT NULL,
  `final_size` bigint(20) unsigned DEFAULT NULL,
  `final_kbs` tinyint(4) DEFAULT NULL,
  `myisam_size` bigint(20) unsigned DEFAULT NULL,
  `myisam_size2` bigint(20) unsigned DEFAULT NULL,
  `mypack_size` bigint(20) unsigned DEFAULT NULL,
  `mypack_size2` bigint(20) unsigned DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB
1 row in set (0.01 sec)

We need two fields for MyISAM and myisampack sizes because, contrarily to InnoDB that stores data in a single .ibd file, MyISAM and myisampack store data in two files: .MYD for the data and .MYI for indexes. I could have stored the sum of both values in a single field but this would have lost some information that could be useful in the future (feel free to ask me creative questions/queries on that data in the comments).

Before going further, let's look again at the InnoDB statistics for the MySQL instance analysed in the previous post:

> SELECT nb_tables, initial_gib, final_gib,
    ->   ROUND(initial_gib / final_gib, 2) AS ratio
    -> FROM (SELECT COUNT(*) AS nb_tables,
    ->              ROUND(SUM(initial_size) / POW(2,30), 2) AS initial_gib,
    ->              ROUND(SUM(final_size)   / POW(2,30), 2) AS final_gib
    ->       FROM compression_logs) AS c;
+-----------+-------------+-----------+-------+
| nb_tables | initial_gib | final_gib | ratio |
+-----------+-------------+-----------+-------+
|    248670 |    20701.84 |   4566.50 |  4.53 |
+-----------+-------------+-----------+-------+
1 row in set (0.15 sec)

This MySQL instance:

contains 248,670 tables
which would use 20.21 TiB of disk space with uncompressed
but they are only using 4.45 TiB after compression
so InnoDB table compression divides storage requirements by 4.53.

How would MyISAM and myisampack compare to above ? To answer that question, I ran a script that, for each table:

takes a MyISAM copy of the table,
saves the MyISAM table sizes in the log table (.MYD and .MYI),
runs myisampack and myisamchk on the table,
and saves the myisampack-ed sizes in the log table.

As shown by the query below and in average, InnoDB uncompressed is using 22% more disk space than MyISAM:

> SELECT nb_tables, initial_gib, myisam_gib,
    ->   ROUND(initial_gib / myisam_gib, 2) AS ratio
    -> FROM (SELECT COUNT(*) AS nb_tables,
    ->              ROUND(SUM(initial_size)             / POW(2,30), 2) AS initial_gib,
    ->              ROUND(SUM(myisam_size+myisam_size2) / POW(2,30), 2) AS myisam_gib
    ->       FROM compression_logs) AS c;
+-----------+-------------+------------+-------+
| nb_tables | initial_gib | myisam_gib | ratio |
+-----------+-------------+------------+-------+
|    248670 |    20701.84 |   16906.99 |  1.22 |
+-----------+-------------+------------+-------+
1 row in set (0.14 sec)

As we have an average compression ratio of 4.53 on this dataset with InnoDB table compression, MyISAM alone is not very interesting. The query below shows more interesting results with myisampack:

> SELECT nb_tables, initial_gib, final_gib, mypack_gib,
    ->   ROUND(initial_gib / mypack_gib, 2) AS r_mypack,
    ->   ROUND(final_gib   / mypack_gib, 2) AS r_i_vs_m
    -> FROM (SELECT COUNT(*) AS nb_tables,
    ->              ROUND(SUM(initial_size)             / POW(2,30), 2) AS initial_gib,
    ->              ROUND(SUM(final_size)               / POW(2,30), 2) AS final_gib,
    ->              ROUND(SUM(mypack_size+mypack_size2) / POW(2,30), 2) AS mypack_gib
    ->       FROM compression_logs) AS c;
+-----------+-------------+-----------+------------+----------+----------+
| nb_tables | initial_gib | final_gib | mypack_gib | r_mypack | r_i_vs_m |
+-----------+-------------+-----------+------------+----------+----------+
|    248670 |    20701.84 |   4566.50 |    1987.52 |    10.42 |     2.30 |
+-----------+-------------+-----------+------------+----------+----------+
1 row in set (0.17 sec)

So myisampack average compression ratio on the initial size (InnoDB uncompressed) is 10.42, which is 2.3 times better than InnoDB compressed. I do not have formal data on the next statement, but I also think that myisampack takes less time to compress a table, especially that with InnoDB and to obtain the best compression ratio, we have to try many KEY_BLOCK_SIZE, which takes time.

One last thing before closing: let's check if there are cases where MyISAM or myisampack is using more disk space than InnoDB and InnoDB compression:

> SELECT SUM(myisam_bigger) as nb_myisam_bigger, SUM(mypack_bigger) as nb_mypack_bigger
    -> FROM (SELECT IF(myisam_size+myisam_size2 > initial_size, 1, 0) as myisam_bigger,
    ->              IF(mypack_size+mypack_size2 > final_size,   1, 0) as mypack_bigger
    ->       FROM compression_logs) as c;
+------------------+------------------+
| nb_myisam_bigger | nb_mypack_bigger |
+------------------+------------------+
|                0 |                0 |
+------------------+------------------+
1 row in set (0.19 sec)

I think this clearly shows that we still need MyISAM and myisampack for efficient storage of read-only tables.

A disclaimer: I am not yet using MyISAM and myisampack for the storage of this particular dataset in production. However, I might switch from InnoDB compression to MyISAM and myisampack because, as shards are filling-up, it might be better to change the storage format than to allocate more shards/hardware for this system.

12 comments:

Alexey KopytovAugust 10, 2017 at 6:49:00 AM EDT
Thanks, I've been looking for a use case for MyISAM for a long time, and this one looks convincing.

Still, as a caveat for readers, MyISAM has some drawbacks that may be important even in this specific use case (compressed read-only data):

- lack of checksums (=> silent data corruptions!)
- partitioning is deprecated in 5.7, will be impossible in 8.0+
- non-efficient backups (in case the server contains a mix of read-only MyISAM tables and updatable tables in other storage engines)
- lack of clustered indexes (would be interesting to compare PK lookups between compressed InnoDB and compressed MyISAM)
ReplyDelete
Replies
UnknownAugust 10, 2017 at 10:58:00 PM EDT
If taking TokuDB and Myrocks into consider, is the data size of MyISAM still the smallest one？
ReplyDelete
Replies
Giuseppe MaxiaAugust 11, 2017 at 4:33:00 AM EDT
Very good analysis, JF!
Compressed MyISAM is going to be the only cheap alternative for read-only tables coming MySQL 8.0 (until it is deprecated and then dismissed, that is.)

The other viable case for slave-side storage was archive + partitioning, which compresses data up to 90% and allows statistical queries (i.e. aggregate queries based on the partitioning fields) to be as fast as InnoDB, if not faster. This solution is now dead, as partitioning is only supported for InnoDB.
Now archive tables are only good for queries that need to scan the whole table for statistics, or access records via PK (only auto_incremented integer keys allowed, though).

ReplyDelete
Replies
UnknownAugust 14, 2017 at 1:45:00 PM EDT
I do not dispute that MyISAM has smaller footprint than InnoDB. However, I have various comments on the topic and the comments.

* From numerous measurements, I have derived this Rule of Thumb: "The disk footprint for InnoDB is 2x-3x that of MyISAM." (There are, of course, exceptions.)

* Clustering of PK can be simulated by adding suitable composite indexes to a MyISAM table.

* Why is backup a problem? If the MyISAM tables are readonly, back them up only once.

* If you want serious compression, consider TokuDB or ColumnStore. Both tout 10x compression.

* As with most things, this entire discussion is about a niche market. Although the typical database has a few tiny readonly tables, very few production systems have lots of big readonly tables. (And hence the admonition to turn off the Query Cache.)

* 200K tables is awful on the OS. (This will be magically fixed in 8.0.)

* Partitioning is rarely useful. In most cases, some tweaking to the indexes can turn a partitioned table into a plain table with equal or slightly better performance. And less disk space.

* InnoDB already has tweaks to allow for non-transactional temp tables; seems like this could be extended to readonly tables?

-- Rick James.

ReplyDelete
Replies
UnknownAugust 15, 2017 at 4:07:00 PM EDT
I also once had a system storing thousands of write-once read-only tables. Each table contained 24hrs of log data and most of the time was unlikely to be read more than 3 times before it was purged, when it was read it would typically be to aggregate all data in the table, so a SELECT operation needing to perform a complete table scan was not a problem. I mention it because the compression ratio obtained with ARCHIVE table format was better than a myisampacked table.

.) https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html
.) https://www.percona.com/blog/2014/01/06/archive-storage-engine-expect/
.) https://www.percona.com/blog/2013/02/11/adventures-in-archiving/
.) http://web.archive.org/web/20100212100829/http://dev.mysql.com/tech-resources/articles/storage-engine.html
.) https://www.percona.com/blog/2006/11/12/trying-archive-storage-engine/

If I was to have to do it today I would test ZSTD compression ratios with different compression levels with MyRocks (LZ4 compression on RocksDB L0-1).

https://github.com/facebook/mysql-5.6/wiki/MyRocks-advantages-over-InnoDB#myrocks-requires-less-storage-space-than-innodb
ReplyDelete
Replies
RobAugust 16, 2017 at 12:59:00 AM EDT
We saw similar results at my last job with zlib lib levels with column compression: https://medium.com/@Pinterest_Engineering/evolving-mysql-compression-part-2-2c3eb0101205
ReplyDelete
Replies

Add comment