Monday, July 24, 2017

How far can you go with MySQL or MariaDB ?

MySQL theoretical limits are known and they can be found in the manual, they include:
But what can be done in practice ?  How far can you go with MySQL or with MariaDB ?

Those questions were raised while discussing with a colleague doing Professional Services and Consulting on MySQL.  He was reporting hearing too often unjustified statements like "MySQL surely cannot store that much data" for some big amount of "that much".  I told him that I would share some numbers on the big deployments I know about.  Here they are with some things to consider when managing such large deployments.

In a previous post, I talked about a table that is 846 GiB on disk.  This is a big table but I know about a bigger one:
# size_in_gib="ceil( (DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024 / 1024 )"
# query="select $size_in_gib, TABLE_SCHEMA, TABLE_NAME
         from information_schema.TABLES"
# mysql -N <<< "$query" | sort -nr | head -n 1
8015   schema   table

# ls -lh /data_dir/schema/table.ibd
-rw-rw---- 1 user group 8.2T Jul 22 11:14 /data_dir/schema/table.ibd
This table size is 8.2 TiB !  It is an InnoDB table that does not use partitioning (a single ibd file).  MySQL works well with such big tables, but operations on them are tricky.  Even if MySQL is getting better at doing in-place ALTERs, I hope to never have to do a schema change on such a big table: it would take a long time to complete and it would need a lot of disk space using an online schema change from the Percona Toolkit, especially with RBR binary logs.

So MySQL can store a lot of data in a single table but can it store a lot of tables ?  I remember a test done with MySQL 8.0.0 DMR with one million tables but this does not qualify as a real production environment (yet).  I once saw a production system with about one million tables but I am unable to get an exact count because it has been decommissioned.  However I know the system below is currently running and still growing:
# exclude_list="'information_schema', 'mysql', 'performance_schema', 'sys', 'test'"
# query="select count(TABLE_NAME)
         from information_schema.TABLES
         where TABLE_SCHEMA not in ($exclude_list)"
# mysql -N <<< "$query"
336559
This system is a replicated environment containing more than 300K tables that are mostly of type compressed InnoDB.  The tables are written uncompressed on the master, and they are then never updated.  After being written, a script compresses the data locally (not writing to binary logs) finding the best KEY_BLOCK_SIZE for getting the smallest disk usage.  The uncompressed to compressed ratio for this system is about 4.3 to 1, so InnoDB compression saves a lot of disk space here.  I know that using MyISAM with myisampack might even give a better compression ratio but I am diverging from the subject of this post (I wrote a post Why we still need MyISAM dedicated to that subject) (Update 2017-08-07: I wrote about this use case of InnoDB Table Compression in another post).

The obvious pain points on systems with so many tables are on operations that needs scanning all tables, opening all table definition files (frm) and sometimes reading the header of all the table files (ibd for InnoDB).  Some operations on information_schema and crash recovery are doing that (hopefully, you do not get too nervous waiting for recovery to complete after a crash but, like for online ALTERs, things are also getting better).  Upgrading MySQL also scans all tables (mysql_upgrade) and other maintenance operations could scan all tables.  This reminds me of Bug#85830 that I opened some time ago where scanning 191K tables was delaying MySQL startup by 20 minutes for a deprecation test (having no sign-of-life for 20 minutes on startup is stressful).  The MySQL 8.0 new data dictionary will hopefully reduce some of those pain points.

Back to limits in practice: MySQL can store big tables and a lot of tables but what about the total amount of data it can store ?  Well, I know of this system:
$ df -h $(grep ^datadir /etc/my.cnf | awk '{print $3}')
Filesystem            Size   Used   Avail   Use%   Mounted on
/dev/mapper/vg-lv      xxx    37T     yyy    zzz   /data_dir1
The 37 TiB system above is a mix of InnoDB and MyISAM tables.  MyISAM is used for its compact storage of data and InnoDB is used for atomic transactions (including replication crash safety).  The MyISAM tables are not replicated: they are loaded locally on each member of the replication topology without writing to the binary logs.  This multi-write framework is interesting: it has been developed to bypass replication for LOAD DATA INFILE which was causing replication delay; I might write about it in the future.

When running such a large system in a replicated environment, creating a new slave is not simple.  Copying 37 TiB on a 1 Gigabit Ethernet network takes about 4.5 days and it is worse if you need to do this on a slow WAN.  Keeping a slave stopped for so long, or doing a hot back, might not be feasible.  Hosting the database on a filesystem or a storage system that provides efficient snapshots is probably the best thing to do.

I have heard of a system even bigger than 37 TiB but it has been replaced by a more maintainable system.  Below is a df of this system before being decommissioned:
$ df -h $(grep ^datadir /etc/my.cnf | awk '{print $3}')
Filesystem            Size   Used   Avail   Use%   Mounted on
/dev/mapper/vg-lv      xxx   205T     yyy    zzz   /data_dir2
This instance of 205 TiB was storing hundreds of thousands of uncompressed InnoDB tables without using replication.  It was replaced by a sharded replicated system (the instance with more than 300K tables above is one of those shards).

The reason this huge instance was replaced is that maintenance is tedious on such a big system (upgrades, backups, ...).  Storage snapshot might help but this is only partially reassuring.  When touching such system, the three questions of good system administration should be answered very seriously:
  1. Why am I doing this operation and is it worth the risk ?
  2. Am I able to detect that things go right or that things go wrong ?
  3. If things go wrong, am I able to fix them (which probably involves rolling back) ?
I hope all above gave you some sense on how serious MySQL can be.  Feel free to ask questions or to share your big system sizes in the comments but keep in mind that my goal is not to start a "I have a bigger system than you" discussion, I want to demonstrate that MySQL is a serious database.

10 comments:

  1. Quite interesting and helpful. Thanks for sharing.

    ReplyDelete
  2. RIP "... this system before being decommissioned"

    ReplyDelete
    Replies
    1. and the children are not as fat as the ancestor: the InnoDB compression diet did a good job, and a MyISAM diet would be even better. I will write about this soon.

      Delete
  3. interesting... I have a single MyISAM 100Gb 300 millions records table that performs terribly poorly on group by statements with constant use of tmp tables. pretty much inside the theoretical limits, but not practically useful.

    ReplyDelete
  4. Btw, just for the title of this post: MySQL Vitess lets you scale MySQL like NoSQL

    ReplyDelete
    Replies
    1. +1 for Vitess. It's incredible the way it gives you the best of both worlds. Most operations end up in a single shard, so you have no performance impact and still have the benefit of built in MySQL joins and relational integrity. It is built to be container-native, so the recommended usage is to split into 100-300 GB chunks, which simultaneously makes schema changes take 1/(n shards) of the time for a normal ALTER TABLE.

      Delete
  5. Thank you sharing this,I have never thought such thing can be possible and exists.

    ReplyDelete