Monday, December 13, 2021

Trick to Simulate a Linux Server with less RAM

I created the first draft of this post many years ago.  At that time, I was working with physical servers having 192 GB of RAM or more.  On such systems, doing memory pressure tests with MySQL is complicated.  I used a trick to simulate a Linux server with less RAM (also works with vms, probably not with Kubernetes or containers).  I recently needed the trick again and as I will refer to it in a future post, now is a good time to complete and publish this.  TL&DR: huge pages...

Let's say you want to do a test on a memory-constraint environment and you only have a server (or a virtual machine) with a lot of RAM.  Your first thought might be to allocate a small InnoDB Buffer Pool, but this will probably not be enough.  On a server with 10+ GB RAM and even with a smaller than 1 GB buffer pool, a 5 GB table could still "fit in RAM" because of the Linux page cache.  It is easier to understand this with an example, so let's run a few commands on a Linux vm.

The test vm has 16 GB of RAM with swap disable (comae added for readability):
grep -e MemTotal -e SwapTotal /proc/meminfo
MemTotal:       16,284,596 kB
SwapTotal:               0 kB
After boot, most of this RAM is free and some is used as cache:
grep -e MemFree -e "^Cached" /proc/meminfo
MemFree:        15,774,808 kB
Cached:            300,068 kB
After starting a dbdeployer sandbox and running a few commands in MySQL, we have the following file (list of commands at the end of the post):
ls -lh data/sbtest/sbtest1.ibd
-rw-r----- 1 jgagne jgagne 5.9G Dec 13 21:30 data/sbtest/sbtest1.ibd
On the above file and on this vm, a second "scanning" is much faster than the first because of the page cache (less than 1 second for scanning the file with a warm cache vs more than 4 minutes with a cold cache):
# Make sure the page cache is empty:
sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
grep "^Cached" /proc/meminfo
Cached:            42,260 kB

# Reading the whole file with a cold cache takes more than 4 minutes:
time cat data/sbtest/sbtest1.ibd > /dev/null

real    4m26.106s
user    0m0.034s
sys     0m1.858s

# But the previous command loaded the file in the page cache:
grep "^Cached" /proc/meminfo
Cached:          6,245,736 kB

# And with a warm cache, reading the file takes less than 1 second:
time cat data/sbtest/sbtest1.ibd > /dev/null

real    0m0.818s
user    0m0.004s
sys     0m0.813s
Another test with ioping shows similar results (3.99 s for doing 1024 reads after dropping the cache, and only 1.37 ms after warming-up the cache):
sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
grep "^Cached" /proc/meminfo
Cached:            42,392 kB

# Do 1024 read IOs with an empty cache (-C to allow using the cache):
ioping -c 1024 -i 0 -qC data/sbtest/sbtest1.ibd

--- data/sbtest/sbtest1.ibd (ext4 /dev/nvme0n1p1) ioping statistics ---
1.02 k requests completed in 3.99 s, 4.00 MiB read, 256 iops, 1.00 MiB/s
generated 1.02 k requests in 4.00 s, 4 MiB, 256 iops, 1.00 MiB/s
min/avg/max/mdev = 196.1 us / 3.90 ms / 146.7 ms / 15.4 ms

# Load the whole file in the page cache:
cat data/sbtest/sbtest1.ibd > /dev/null
grep "^Cached" /proc/meminfo
Cached:          6,249,968 kB

# And now doing the IOs is much faster:
ioping -c 1024 -i 0 -qC data/sbtest/sbtest1.ibd

--- data/sbtest/sbtest1.ibd (ext4 /dev/nvme0n1p1) ioping statistics ---
1.02 k requests completed in 1.37 ms, 4.00 MiB read, 748.9 k iops, 2.86 GiB/s
generated 1.02 k requests in 1.43 ms, 4 MiB, 714.7 k iops, 2.73 GiB/s
min/avg/max/mdev = 790 ns / 1.33 us / 4.86 us / 580 ns
With such configuration, doing an IO-bound benchmark would be biased because most reads would be served by the page cache.  To avoid this, we could use a bigger table, but this is inconvenient and would not be practical on a server with 192 GB of RAM or more.  This is where the trick becomes handy.

The trick is allocating Huge Memory Pages

A lot has already been written about huge pages, and I am putting two references below (there are many others, maybe better than these two, if you find something, please share it in the comments):
An interesting behavior of huge pages in Linux is that once allocated, they cannot be used by the page cache or by standard applications (including MySQL).  So allocating huge pages simulates a Linux server with less RAM.  Let's try this.

On the tests vm, the huge page size is 2048 kB:
grep Hugepagesize /proc/meminfo
Hugepagesize:       2048 kB
So if we want to hide 14 of the 16 GB of RAM, we have to allocate 7168 huge pages:
grep HugePages_Total /proc/meminfo
HugePages_Total:       0
sudo bash -c "echo 7168 > /proc/sys/vm/nr_hugepages"
grep HugePages_Total /proc/meminfo
HugePages_Total:    5362
But something in above went wrong because we do not have all the huge pages we requested.  It is because Linux was not able to find enough contiguous physical memory (only a few bytes used in a 2048 kB range prevents using this range as a huge page).  In such case, Linux will only give us a fraction the requested huge pages.  Also, Linux will not free all the page cache to allocate huge pages, but we can do this ourselves to increase our chances to get all the huge pages what we are requesting:
grep "^Cached" /proc/meminfo
Cached:          4,344,136 kB
sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
grep "^Cached" /proc/meminfo
Cached:            42,512 kB
sudo bash -c "echo 7168 > /proc/sys/vm/nr_hugepages"
grep HugePages_Total /proc/meminfo
HugePages_Total:    7168
After purging the page cache, we successfully allocated all the requested huge pages, but we could still have been unlucky.  If this is the case, we could try to stop some program, but the best solution is to allocate huge pages just after boot.  Alternatively, we can add vm.nr_hugepages = 7168 in /etc/sysctl.conf for making the huge page allocation persistent after a reboot and being almost sure they will all be allocated.

Update 2021-12-27: a reader contacted me on MariaDB Zulip to share a trick for avoiding a reboot.  In case of difficulty in allocating huge pages, it is possible to compact the memory by running the following command:
sudo bash -c "echo 1 > /proc/sys/vm/compact_memory"
Now that we have hidden 14 of the 16 GB of RAM on this vm, let's now retry our file reading and ioping tests:
sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"

grep -e "^Cached" -e HugePages_Total /proc/meminfo
Cached:            44,400 kB
HugePages_Total:     7168

time cat data/sbtest/sbtest1.ibd > /dev/null

real    4m22.619s
user    0m0.025s
sys     0m1.877s

grep "^Cached" /proc/meminfo
Cached:           751,232 kB

time cat data/sbtest/sbtest1.ibd > /dev/null

real    4m23.514s
user    0m0.013s
sys     0m2.032s

sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"

grep -e "^Cached" -e HugePages_Total /proc/meminfo
Cached:            29,584 kB
HugePages_Total:     7168

ioping -c 1024 -i 0 -qC data/sbtest/sbtest1.ibd

--- data/sbtest/sbtest1.ibd (ext4 /dev/nvme0n1p1) ioping statistics ---
1.02 k requests completed in 4.19 s, 4.00 MiB read, 243 iops, 975.9 KiB/s
generated 1.02 k requests in 4.19 s, 4 MiB, 244 iops, 976.6 KiB/s
min/avg/max/mdev = 3.13 us / 4.10 ms / 146.4 ms / 16.1 ms

cat data/sbtest/sbtest1.ibd > /dev/null

grep "^Cached" /proc/meminfo
Cached:           742,744 kB

ioping -c 1024 -i 0 -qC data/sbtest/sbtest1.ibd

--- data/sbtest/sbtest1.ibd (ext4 /dev/nvme0n1p1) ioping statistics ---
1.02 k requests completed in 2.67 s, 4.00 MiB read, 383 iops, 1.50 MiB/s
generated 1.02 k requests in 2.67 s, 4 MiB, 383 iops, 1.50 MiB/s
min/avg/max/mdev = 1.32 us / 2.61 ms / 140.6 ms / 11.4 ms
As shown above, with less than 1 GB in cache, we do not see cache effects for the file reading tests and only some effects for the ioping test.  With this, we are ready to do benchmarks that will not be influenced by the page cache (or at least be influenced much less by the cache).  An IO-bound read-only sysbench using the trick gives 58.10 qps with magnetic disks, while it is giving 5106.17 qps without the trick (details at the end of the post).

If you want to know more about how to set up huge pages in Linux, you can read the kernel hugetlbpage documentation.

That is it for now.  In a next post, I will use this trick to generate memory pressure in a context not involving the page cache.

Annex: How to Generate the 5.9G File and sysbench Read-Only Tests


# Create a sandbox that will work with sysbench:
#   (sysbench does not work with caching_sha2_password)
c="default-authentication-plugin=mysql_native_password"
dbdeployer deploy single mysql_8.0.27 -c "$c"

# Create a schema for sysbench:
./use <<< "CREATE DATABASE sbtest"

# Prepare sysbench:
s="--mysql-socket=/tmp/mysql_sandbox8027.sock"
s="$s --mysql-user=msandbox --mysql-password=msandbox"
s="$s --tables=1 --table_size=1000000 --mysql-db=sbtest"
time sysbench oltp_read_only $s prepare --create_secondary=off
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Creating table 'sbtest1'...
Inserting 1000000 records into 'sbtest1'

real    0m26.375s
user    0m2.188s
sys     0m0.054s

# Make the table bigger (1 of 2):
{ echo "ALTER TABLE sbtest1 ADD COLUMN c00 CHAR(255) DEFAULT ''";
  seq -f " ADD COLUMN c%02.0f CHAR(255) DEFAULT ''" 1 15; } |
  paste -s -d "," | ./use sbtest

# Make the table bigger (2 of 2):
#   (because instant ADD COLUMN in 8.0, we need to rebuild the table inflate it)
time ./use sbtest <<< "ALTER TABLE sbtest1 ENGINE=InnoDB"

real    5m36.178s
user    0m0.005s
sys     0m0.003s

# Show the buffer pool is small (comae added for readability):
./use -N <<< "SHOW GLOBAL VARIABLES LIKE 'innodb_buffer_pool_size'"
innodb_buffer_pool_size 134,217,728

# Run sysbench, with the result being 5106.17 qps,
#   which is more than expected for an IO-bound workload on magnetic disks:
sysbench oltp_read_only $s run
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            44702
        write:                           0
        other:                           6386
        total:                           51088
    transactions:                        3193   (319.14 per sec.)
    queries:                             51088  (5106.17 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          10.0034s
    total number of events:              3193

Latency (ms):
         min:                                    1.79
         avg:                                    3.13
         max:                                   16.22
         95th percentile:                        5.18
         sum:                                 9992.35

Threads fairness:
    events (avg/stddev):           3193.0000/0.00
    execution time (avg/stddev):   9.9924/0.00

# Hide some RAM by allocating huge pages:
sudo bash -c "echo 7168 > /proc/sys/vm/nr_hugepages"
grep HugePages_Total /proc/meminfo
HugePages_Total:    7168

# Run sysbench again, this time reaching only 58.10 qps,
#   which makes more sense for an IO-bound workload on magnetic disks:
sysbench oltp_read_only $s run
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            518
        write:                           0
        other:                           74
        total:                           592
    transactions:                        37     (3.63 per sec.)
    queries:                             592    (58.10 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          10.1811s
    total number of events:              37

Latency (ms):
         min:                                   49.42
         avg:                                  275.15
         max:                                  605.93
         95th percentile:                      580.02
         sum:                                10180.49

Threads fairness:
    events (avg/stddev):           37.0000/0.00
    execution time (avg/stddev):   10.1805/0.00

2 comments:

  1. Hi,
    thank You for the post, what question if I may.
    What is inside ./use file ?
    Regards
    G

    ReplyDelete
    Replies
    1. The ./use file is a wrapper around the mysql client. It is generated by dbdeployer. More about dbdeployer here (it is a great tool to test many MySQL versions on the same server / laptop, or to deploy a replicated test environment on a single machine):
      https://github.com/datacharmer/dbdeployer

      Delete