Row Deletion Jobs Done Right

I am continuing my blog post series on using indexes — or tables — as queues. In this post, I cover Row Deletion Jobs (I do not call these purge jobs, to avoid confusion with the InnoDB Purge). Such jobs are tempting to implement using an index, but this might be a wrong / suboptimal way. I write about the right / better / cheaper way below, with cheaper meaning potentially significant savings on a Cloud bill !

One way of implementing a Row Deletion Job is by scanning an index (the index is some sort of queue for the job, hence me naming such use-case using an index as a queue). The rows to delete have an indexed flag / column indicating they should be deleted, and optionally some other indexed column / information related to the deletion (like logical delete time, last used, last updated, expiration time, etc...).

One important consideration when deleting rows via an index, and the subject of the previous post, is to Mind the InnoDB Purge. The TL&DR is to avoid using DELETE LIMIT because scanning delete-marked rows is of complexity O(n^2). The full details are in the previous post.

In annex #1, I initialize a large-enough table with an indexed column set as modulo 50 of the auto-increment primary key (the name of the column is v, as in value). Then, I delete 10% of the rows via the index (v < 5; doing this in 10 transactions to simulate doing it in small chunks). I claim this is a wrong / suboptimal way of implementing a Row Deletion Job. To show in which way it is wrong and also in annex #1, I scale this to modulo 100 and the time required to delete 10% of rows doubles (only when the table does not fit in the InnoDB Buffer Pool; from 4:17 to 8:33; and the InnoDB Purge making things more complicated : all this is described in annex #1). Because it is IO related, the problem becomes obvious when looking at how many pages are read from disk : from 1138 K for modulo 50 to 2322 K for modulo 100 (disabling read-ahead is required to get accurate numbers). This is also doubling the number of IOs (like for the timing), but this table is only 4.4 GiB, so about 288 K pages, what is happening here ?

Deleting via an index might scan the table many times !

In my examples — modulo 50 (and 100) and deleting v < 5 (and v <10) — I am scanning the table respectively 5 and 10 times. This is by construction and might look a little artificial. But it is close enough to the situation where the DELETEs via an index are randomly distributed in the table and revisiting the same InnoDB page. Let's explain this in more detail.

InnoDB stores data in B-Trees with a fixed page size of 16 KiB by default (which can be changed at database initialization with the parameter innodb_page_size). If we need to delete more than one row in a page, and if the index is not ordering the deletion in such a way that the co-located rows are deleted at the same time, InnoDB goes through many cycles of reading the page from disk, modifying it, and flushing / writing back the page to disk. This is what I call scanning the table many times !

This does not happen if the index only stores a flag about the row needing to be deleted (0 or 1 , FALSE or TRUE), because the index is ordered in the same way as the table and pages will not be revisited. But this happens if scattered / random distribution is introduced by having a logical delete time / last used / last updated / expiration time / ... column in the deletion index.

This is a lot of IOPS,
is significant on a Cloud bill,
and will challenge SSD endurance !

Then what is the right way for implementing a Row Deletion Job ? It is simply by scanning the table only once, via its Primary Key / Clustered Index, without revisiting pages, and by deleting all co-located rows at once (if not at once, close enough to be done before flushing and eviction). This is what I show in annex #2 (I was lazy and used 10 transactions, the right way of doing this would be with range processing, like presented in annex #5 of the previous post on Minding the InnoDB Purge).

This is another good use-case for OFFSET !

But I need to bring your attention to Bug #111367. In this bug, a query with the below where-clause is not as fast as we would expect...

(pk1 = $pk1 AND pk2 >= $pk2 OR pk1 > $pk1)

...so it needs to be split in two queries with below where-clause (or many IGNORE INDEX need to be added to the query, forcing INDEX(PRIMARY) does not help).

Query #1: (pk1 = $pk1 AND pk2 >= $pk2)
Query #2: (pk1 > $pk1)

This bug is very annoying, I wished it would get more attention because it makes things more complicated for implementing efficient Row Deletion Jobs.

These Jobs Done Right might allow Dropping an Index !

Back at doing things right (not scanning an index), if this was the only use-case for that index, implementing a Rows Deletion Job the right way allows dropping this index. This means less disk usage, and less index maintenance overhead when inserting, updating and deleting rows !

You might think that if less than one row per page is deleted, then using an index is not such a bad thing. But then think again about why rows are being deleted ! Is deleting less than one row per page actually worth it ? It only generates free space in pages, and it does not usually trigger enough page-merge to generate significant free pages (I cover free space in pages and page-merge in a previous post : InnoDB Basics - Compaction: when and when not). IMHO, just deleting one row here and there is a waste of System Resources (CPU, IO, replication capacity, ...). At scale, it is worth waiting a little more to schedule a Row Deletion Job so more than one row per InnoDB Page is deleted, and so System Resources spent on the job are actually worth it. It is especially true when running in the Cloud and IOPS and IOs are expensive !

Also remember that for small rows (50 to 150 bytes), deleting just two rows per page is not enough to trigger a page-merge, so this will not generate a free page, and does not allow reusing storage. With an average row size of 100 bytes, a 16 KiB page can store 160 rows, so you need to delete 80 of these to generate a free page. Consuming CPU, IOs, replication throughput (CPU and IOs on replicas), ... for just deleting 2 or even 10 such rows per page might not be the right thing to do at scale. It might be better to leave these rows there, and schedule the Row Deletion Job less often.

Also take into account that if running a Row Deletion Job daily, there is only 24 hours to run it before the next scheduling of the job. When running it monthly, we have 30 days to run the job, and probably any table can be slowly scanned over a period of 30 days.

And as I wrote in my previous post : yes, it is a leaky abstraction. I would prefer to not have to mind the inner working of InnoDB, but they are sometimes brought to the surface, causing performance problems, and we have to deal with them. PostgreSQL or MyRocks will have different problems, so optimizing for them is different. And remember that premature optimization is the root of all evil : only optimize when solving an actual problem ! It is not actually done wrong via an index (I admit I am overselling this), it becomes wrong when out-of-order DELETEs at scale exhausts resources, or forces you to scale IOPS, and potentially pay a lot of money to your Cloud Provider.

Annex #1 : Batch Row Deletion Done Wrong

Below commands were run on a m6id.xlarge AWS instance (16 GiB RAM with local SSDs) running Debian 12.

For the commands below to give matching timing (longer delete_index_100 than delete_index_50 with IOs), reading the 4.6 GiB table not fitting in the InnoDB Buffer Pool must generate actual IOs (local SSDs ok, no need for EBS volume). If changing MySQL 8.4 to 8.0, the timing might not be right, but the number of IOs should convince you of the difference (I cover the reason in a previous post : More than Flushing for innodb_flush_method). For having similar timing on 8.0 and on Linux, either change innodb_flush_method, or use my trick for simulating a server with less RAM (I have not tested these, I am making an educated guess). On MacOS and because of caching, I am only able to observer IO numbers, not double timing (I did not try hard, I might have been able to reproduce with a table larger than my 24 GB RAM, but this would have taken more time than I had).

# Create a sandbox for our tests.
# (no need for binlogs, so let's skip them for not having to mind disk space)
# (the pv command is a trick to time command execution)
{
  v=mysql_8.4.8; d=${v//./_}
  dbda="-c skip-log-bin"
  dbdeployer deploy single $v $dbda | pv -tN dbdepl. > /dev/null
  cd ~/sandboxes/msb_$d
}
  dbdepl.: 0:00:13

# Create a table for our tests, fill it, and save it.
# (CHAR(60) for simulating data in the table)
# (saving the table for restoring it after destructive tests)
# (two FLUSH TABLE for the second to be quick)
# (CREATE TABLE t1_bak for keeping a copy of the table structure)
# (the functions will be useful in the next steps and annex)
n=$((40 * 1000 * 1000)); {
  ./use <<< "
    SET GLOBAL innodb_buffer_pool_size = 12*1024*1024*1024;
    CREATE DATABASE test_jfg;
    CREATE TABLE test_jfg.t1(id BIGINT PRIMARY KEY, c CHAR(60) DEFAULT '')"

  seq 1 $n |
    sed -e 's/.*/(&)/'                           |
    paste -s -d "$(printf ',%.0s' {1..1000})\n"  |
    sed -e 's/.*/INSERT INTO t1 (id) values &;/'      |
    ./use test_jfg | pv -tN insert

  ./use test_jfg <<< "FLUSH TABLE t1 FOR EXPORT" | pv -tN export
  ./use test_jfg <<< "
    DROP TABLE IF EXISTS t1_bak; CREATE TABLE t1_bak LIKE t1;
    FLUSH TABLE t1 FOR EXPORT;
    system cp data/test_jfg/t1.cfg data/test_jfg/t1.cfg.bak
    system pv -btrae data/test_jfg/t1.ibd > data/test_jfg/t1.ibd.bak"

  function import_add_index_flush() {
    ./use test_jfg <<< "
      DROP TABLE IF EXISTS t1; CREATE TABLE t1 like t1_bak;
      ALTER TABLE t1 DISCARD TABLESPACE"
    ( cd data/test_jfg; pv -btrae t1.ibd.bak > t1.ibd; cp t1.cfg.bak t1.cfg; )
    ./use test_jfg <<< "ALTER TABLE t1 IMPORT TABLESPACE" | pv -tN import

    ./use test_jfg <<< "ALTER TABLE t1 ADD COLUMN v INT as (id % $1) VIRTUAL"
    ./use test_jfg <<< "ALTER TABLE t1 ADD INDEX v (v)"  | pv -tN index

    ./use test_jfg <<< "FLUSH TABLE t1 FOR EXPORT"       | pv -tN export
    ./use test_jfg <<< "
      DROP TABLE IF EXISTS t1_bak_$1; CREATE TABLE t1_bak_$1 LIKE t1;
      FLUSH TABLE t1 FOR EXPORT;
      system cp data/test_jfg/t1.cfg data/test_jfg/t1.cfg.bak_$1
      system pv -btrae data/test_jfg/t1.ibd > data/test_jfg/t1.ibd.bak_$1"
  }

  function import_index() {
    ./use test_jfg <<< "
      DROP TABLE IF EXISTS t1; CREATE TABLE t1 like t1_bak_$1;
      ALTER TABLE t1 DISCARD TABLESPACE"
    ( cd data/test_jfg; pv -btrae t1.ibd.bak_$1 > t1.ibd; cp t1.cfg.bak_$1 t1.cfg; )
    ./use test_jfg <<< "ALTER TABLE t1 IMPORT TABLESPACE" | pv -tN import
  }
}
   insert: 0:02:11
   export: 0:00:00
3.60GiB 0:00:12 [ 292MiB/s] [ 292MiB/s]

# Let's delete 10% of rows in 10 transactions via an index.
# (the 10 transactions is for delete_index_100 below to be comparable)
# (the index is built on modulo 50 of id, the reason will become obvious)
{
  function delete_index_50() {
    local i sql
    for i in {0..4}; do
      ./use test_jfg <<< "DELETE FROM t1 WHERE v = $i AND id <  $(($n/2))"
      ./use test_jfg <<< "DELETE FROM t1 WHERE v = $i AND id >= $(($n/2))"
    done | pv -tN "DELETE 5"
  }

  import_add_index_flush 50
  delete_index_50
}
3.60GiB 0:00:25 [ 141MiB/s] [ 141MiB/s]
   import: 0:00:49
    index: 0:01:22
   export: 0:00:00
4.40GiB 0:00:24 [ 187MiB/s] [ 187MiB/s]
 DELETE 5: 0:03:52

# Again, deleting 10% of the rows, with a variation on the index.
# (the index is now built with modulo 100, the reason will become obvious)
{
  function delete_index_100() {
    local i sql
    for i in {0..9}; do
      ./use test_jfg <<< "DELETE FROM t1 WHERE v = $i"
    done | pv -tN "DELETE 10"
  }

  import_add_index_flush 100
  delete_index_100
}
3.60GiB 0:00:19 [ 193MiB/s] [ 193MiB/s]
   import: 0:00:43
    index: 0:01:22
   export: 0:00:00
4.40GiB 0:00:26 [ 171MiB/s] [ 171MiB/s]
DELETE 10: 0:03:58

# Let's redo both of the above with a smaller Buffer Pool and observing buffer_pool_reads.
# (shown further below, buffer_pool_reads is not reliable without disabling Read-Ahead)
# (also shown further below, timing is not reliable without blocking the purge)
# The timing for delete_index_100 is double the one of delete_index_50,
#   and delete_index_100 is doing twice as many IOs as delete_index_50.
{
  function wait_purge() {
    while sleep 1; do
      local sql="select COUNT from INNODB_METRICS where NAME = 'trx_rseg_history_len'"
      local trx_hist=$(./use -N information_schema <<< "$sql")
      test $trx_hist -eq 0 && break
    done | pv -tN purge
  }

  function block_purge() {
    ((
      ./use test_jfg <<< "BEGIN; SELECT * FROM t1 LIMIT 1; DO SLEEP(60*60*60)" > /dev/null &
      touch purge_blocked; while sleep 1; do test -e purge_blocked || break; done; kill %1
    )&)
  }

  function unblock_purge() {
    rm purge_blocked
  }

  function init_metric() {
    local c; metric=$1
    for c in disable reset_all enable; do ./use <<< "SET GLOBAL innodb_monitor_$c = $metric"; done
    sql_metric="SELECT COUNT FROM INNODB_METRICS WHERE NAME = '$metric'"
    c1="$(./use -N information_schema <<< "$sql_metric")"
  }

  function show_metric() {
    local c2="$(./use -N information_schema <<< "$sql_metric")"
    echo "$metric: $c2 - $c1 = $(($c2 - $c1))"
    unset metric sql_metric c1
  }

  ./use <<< "SET GLOBAL innodb_read_ahead_threshold = 0, innodb_buffer_pool_size = 512*1024*1024"
  while sleep 1; do
    sql="SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_resize_status'"
    ./use -N <<< "$sql" | grep -q -e "Completed resizing" -e "Size did not change" && break
  done | pv -tN resize

  echo; import_index 50
  wait_purge; block_purge; init_metric buffer_pool_reads
  delete_index_50
  show_metric; unblock_purge

  echo; import_index 100
  wait_purge; block_purge; init_metric buffer_pool_reads
  delete_index_100
  show_metric; unblock_purge
}
   resize: 0:01:35

4.40GiB 0:00:16 [ 273MiB/s] [ 273MiB/s]
   import: 0:01:11
    purge: 0:00:12
 DELETE 5: 0:04:17
buffer_pool_reads: 1138137 - 0 = 1138137

4.40GiB 0:00:16 [ 275MiB/s] [ 275MiB/s]
   import: 0:01:11
    purge: 0:02:02
DELETE 10: 0:08:33
buffer_pool_reads: 2322351 - 0 = 2322351

# To show the impact of Read-Ahead, let's redo above while it is enabled.
# (this is just out of curiosity, not actually needed for the demonstration)
# We see similar timing as above (InnoDB Read-Ahead is not helping much here,
#   it might on slower storage), without the clear doubling of IOs
#   (these IOs are probably accounted elsewhere; buffer_pool_read_ahead ?).
{
  ./use <<< "SET GLOBAL innodb_read_ahead_threshold = DEFAULT"

  import_index 50
  wait_purge; block_purge; init_metric buffer_pool_reads
  delete_index_50
  show_metric; unblock_purge

  echo
  import_index 100
  wait_purge; block_purge; init_metric buffer_pool_reads
  delete_index_100
  show_metric; unblock_purge
}
4.40GiB 0:00:13 [ 326MiB/s] [ 326MiB/s]
   import: 0:01:05
    purge: 0:00:30
 DELETE 5: 0:04:17
buffer_pool_reads: 83591 - 0 = 83591

4.40GiB 0:00:13 [ 325MiB/s] [ 325MiB/s]
   import: 0:01:05
    purge: 0:00:02
DELETE 10: 0:08:30
buffer_pool_reads: 300881 - 0 = 300881

# To show the impact of blocking the purge, let's redo above without blocking it.
# (this is just out of curiosity, not actually needed for the demonstration)
# We now see delete_index_50 being is as slow as delete_index_100,
#   probably because of contention with the purge.
# One source of contention could be IOs, the purge needing them and competing with the DELETEs.
# (we still wait for the purge to start the test without any relics of the previous test)
{
  ./use <<< "SET GLOBAL innodb_read_ahead_threshold = 0"

  import_index 50
  wait_purge; init_metric buffer_pool_reads
  delete_index_50
  show_metric
  
  echo
  import_index 100
  wait_purge; init_metric buffer_pool_reads
  delete_index_100
  show_metric
}
4.40GiB 0:00:13 [ 331MiB/s] [ 331MiB/s]
   import: 0:01:11
    purge: 0:00:20
 DELETE 5: 0:07:56
buffer_pool_reads: 2067063 - 0 = 2067063

4.40GiB 0:00:14 [ 311MiB/s] [ 311MiB/s]
   import: 0:01:11
    purge: 0:02:02
DELETE 10: 0:08:35
buffer_pool_reads: 2238481 - 0 = 2238481

Annex #2 : Batch Row Deletion Done Right

Below needs to be run after annex #1.

# Deleting 10% of the rows the right way.
# We see much faster execution of delete_right than delete_index_50 and _100,
#   significantly less IOs, and delete_right 5 and 10 having similar timing and IOs.
# (below is done in 10 transactions, normally we would do this with OFFSET and range,
#   like what was presented in the previous post, I am lazy for the demonstration)
{
  function delete_right() {
    local i min sql nb_trx=10
    for i in $(seq 0 $(($nb_trx-1))); do
      min="$((1 + $i * $n/$nb_trx))"
      sql="DELETE FROM t1 WHERE v < $1 AND id >= $min AND id < $(($min + $n/$nb_trx))"
      ./use test_jfg <<< "$sql"
    done | pv -tN "DELETE $1"
  }

  ./use <<< "SET GLOBAL innodb_read_ahead_threshold = 0"
  
  import_index 50
  wait_purge; block_purge; init_metric buffer_pool_reads
  delete_right 5
  show_metric; unblock_purge

  echo
  import_index 100
  wait_purge; block_purge; init_metric buffer_pool_reads
  delete_right 10
  show_metric; unblock_purge
}
4.40GiB 0:00:14 [ 311MiB/s] [ 311MiB/s]
   import: 0:01:12
    purge: 0:02:02
 DELETE 5: 0:01:00
buffer_pool_reads: 230246 - 0 = 230246

4.40GiB 0:00:13 [ 331MiB/s] [ 331MiB/s]
   import: 0:01:11
    purge: 0:00:36
DELETE 10: 0:00:59
buffer_pool_reads: 237609 - 0 = 237609

# For completeness and out of curiosity, redo the above without blocking the purge.
# There is something interesting about IOs in below,
#   understanding it is left as an exercise to the reader
#   (hint: deleting in more than 10 transactions will impact these IOs).
{
  import_index 50
  wait_purge; init_metric buffer_pool_reads
  delete_right 5
  show_metric

  echo
  import_index 100
  wait_purge; init_metric buffer_pool_reads
  delete_right 10
  show_metric
}
4.40GiB 0:00:13 [ 329MiB/s] [ 329MiB/s]
   import: 0:01:11
    purge: 0:02:01
 DELETE 5: 0:01:44
buffer_pool_reads: 422821 - 0 = 422821

4.40GiB 0:00:14 [ 310MiB/s] [ 310MiB/s]
   import: 0:01:11
    purge: 0:02:01
DELETE 10: 0:01:47
buffer_pool_reads: 424277 - 0 = 424277

J-F Gagné's MySQL Blog

Thursday, March 5, 2026