Saturday, May 12, 2007

Pre-fetch binlogs to speed up MySQL replication

Note: this "script" is a one liner really and not something meant for general purpose use. It is very specific to my environment so I take no responsibility of any kind. It is very raw so use at YOUR OWN risk.

Last year at the MySQL camp, I heard about Paul Tuckfield's script that pre-fetches relay logs to speed up replication. Since then I got really interested in implementing it but never got the time. This year at MySQL conference Paul's script got a lot of attention again. Since it hasn't been released and I really need to start testing this on my test server. So I started hacking on creating a shell script to dig deeper into what's involved. Once I have that I may work on porting it to other languages.

To make the process of writing this script easier for others in their favorite languages I thought I will go through what's involved in creating this script on my blog. That way when you would like to write an open-source :) port of this script, you won't have to dig through everything. I will, of course, love to get feedback specially since I know so many people are interested in it.

Paul's keynote was among the best keynotes of MySQL conference. It was so popular that even after the keynote finished, people didn't stop picking Paul's mind. The "always-looking-out-for-the-community", Jay Pipes quickly offered , I must mention that Paul was very generous in answering many questions at both the MySQL Camp and MySQL Conference and Expo.

So the concept is something like this (unless otherwise specified, I will be talking about MySQL slave and not the master since we want to speed up the slave):

- Get the current position from where the slave is currently reading binlogs. This will give us the position we need to pass to mysqlbinlog program.
- If the slave isn't replicating all databases from the master then we need to filter those databases out when reading binlog output.
- Decide the maximum amount of data we would request mysqlbinlog to read at a time. Depending on the options passed to mysqlbinlog we can request whether the reading should stop at the end of current log file or should it continue till the end.
- As we get output from mysqlbinlog, we will need to extract the UPDATE statements.
- Once the UPDATE statements have been extracted we need to get the information contained in the WHERE clause and create a SELECT statement from it.
- These SELECT statements can then be played in advance to help speed up the UPDATE processes.

One thing to note is that if you use UPDATE LOW PRIORITY ... on your master then beware that it can further increase the slave lag.

One question I have in mind is whether a slave with query cache turned off will be able to realize the same performance benefits as a slave that has query cache turned on.

Another thing I am working on determining is some kind of formula that will help optimize how much data in advance should be read. This will all become reality soon.

Finally, I would like to measure key performance indicators before and after implementing this script. Ideas are certainly welcome regarding this.

If you have had a chance to play with mysqlbinlog you probably already know that most of the work is already done, thanks to the wealth of options that are available.

Of course, not everyone who will read this blog post will know about mysqlbinlog. So, I am going to give a brief introduction.

Moving relay logs to a test environment
To start playing with mysqlbinlog command, move some relay logs to your test machine.

Say hello to mysqlbinlog
Imagine if we would have had to read the binlogs ourself. Suddenly this whole script would become very challenging for some. Luckily we don't have to do that. Within the bin directory of your MySQL installation there are several helpful programs (or utilities) that you can use to ease your everyday life as in this case.

To view the mysqlbinlog options, run it without any option. I am going to touch on some of these options below.

First, let's start with invoking mysqlbinlog with the name of a relay log file that we are interested in reading.

slave:/path/] /path/to/mysqlbinlog slave-relay-bin.000169  | more

Typical output of the above will be something similar to the following:
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
# at 4
#070329 18:11:33 server id 1 log_pos 915813658 Query thread_id=345365808 exec_time=0 error_code=0
use db;
SET TIMESTAMP=1175206293;
# at 207
#070329 18:11:33 server id 1 log_pos 915813861 Query thread_id=345007376 exec_time=0 error_code=0
use mysql_rocks;
SET TIMESTAMP=1175206293;
UPDATE world set db='MySQL';

Let's de-construct this line by line:
# at 4

This means that the following statement is at delimiter position 4.
#070329 18:11:33 server id 1  log_pos 915813658         Query   thread_id=345365808     exec_time=0     error_code=0

First, we can see the date and time followed by server id to which the statement originally belonged.

The number following log_pos corresponds to the position of this relay log record in the original binlog.

Next is the type of statement, the thread id that caused this statement to be inserted. It seems like the next is probably time it took to execute the query and the error code generated when the query was executed..

use mysql_rocks;
SET TIMESTAMP=1175206293;
UPDATE world set db='MySQL';

The remaining lines are pretty self explanatory and contain the database name, original timestamp and the actual query.

As you can see there is a bunch of output that, while helpful, won't be needed for the purpose of creating this script. We can use the -s or --short-form option to obtain just the queries. Try it out:
slave:/path/] /path/to/mysqlbinlog -s slave-relay-bin.000169  | more

Most of my slaves are not exact replicas of the master and do not "replicate" all databases. Some databases present on the master are ignored. Generally, when this is the case you would be using the --binlog-ignore-db=db_being_ignored on your master to ignore statements that belong to db_being_ignored don't make it to the binlog. But let's assume that for some reason your relay logs contain statements belonging to databases that aren't being replicated and should be ignored for the purpose of this script. If these statements aren't filtered out, you will end up issuing SELECTs that cannot be executed due to missing database and tables.

To get the queries for a specific database only we could use an external utility such as grep or sed, but not use -d or --database=name option and let mysqlbinlog do the job for us. In these posts I am going to just focus on creating this script for a single database. Later, if time permits, I will work on multiple databases. Our "script-in-progress" now becomes:

slave:/path/] /path/to/mysqlbinlog -s -d db_to_filter_for slave-relay-bin.000169  | more

This will make mysqlbinlog output only entries that belong to db_being_ignored.

Unfortunately, I couldn't find any built-in filter to filter out UPDATE statements, therefore I am going to resort to some external utility like grep. Actually, let's use sed :)

/path/to/mysqlbinlog -s -d db_to_filter_for slave-relay-bin.000169  | sed  -e '/UPDATE.*SET.*WHERE/b' -e d | more

The -e option specifies that we will be issuing edit command. As you can see multiple -e options can be specified. The /b flag is used for specifying patterns we want to keep and d to delete the pattern space.

I couldn't find the case-insensitive option for sed on Solaris 10 so let's try this (I'll improve it later):

mysqlbinlog -s -d db_to_filter_for slave-relay-bin.000169  | sed  -e '/UPDATE.*SET.*WHERE/b' -e '/update.*set.*where/b' -e d | more

Now let's bite the bullet and convert the UPDATE statements to SELECT statements.

mysqlbinlog -s -d db_to_filter_for slave-relay-bin.000169  | sed  -e '/UPDATE.*SET.*WHERE/b' -e '/update.*set.*where/b' -e d | sed -e 's/LOW_PRIORITY//' | sed -e 's/UPDATE\(.*\)SET\(.*\)WHERE\(.*\)/SELECT \* FROM\1 WHERE \3/g' | sed -e 's/update\(.*\) set \(.*\)where\(.*\)/SELECT \* FROM\1 WHERE \3/g'

OMG!, this is looking pretty good :)

Before we go any further, let's specify exactly how much of relay binlog should be read.

We have at least two options as far the decision of where to start and stop.

First, we can specify --start-position=# and --stop-position=# to specify the start and stop positions respectively. The default values for --start-position is 4 and for --stop-position is 18446744073709551615

Second, we can use --start-datetime=# and --stop-datetime=# to specify the boundaries. The format of the datetime can be any accepted MySQL format.

I will be using the --start-position=# and --stop-position=#.

After adding this to our script, we get:
start_pos=4; offset=` expr 1024 '*' 1024 `; stop_pos=`expr $start_pos + $offset`; /path/to/mysqlbinlog -s -d db_to_filter_for --start-position=$start_pos --stop-position=$stop_pos slave-relay-bin.000169  | sed  -e '/UPDATE.*SET.*WHERE/b' -e '/update.*set.*where/b' -e d | sed -e 's/LOW_PRIORITY//' | sed -e 's/UPDATE\(.*\)SET\(.*\)WHERE\(.*\)/SELECT \* FROM\1 WHERE \3/g' | sed -e 's/update\(.*\) set \(.*\)where\(.*\)/SELECT \* FROM\1 WHERE \3/g' 

If you specify too low offset you may not get any output.

(some stats related to my system for reference:) For my system, setting the offset to 1MB
offset=` expr 1024 '*' 1024 `;
on average produced 927 SELECTs. Without any options specified, the total output was 26314 lines (not queries). On average the relay log contained 26075379 lines. The last position in relay logs on average was 1073741494. From every relay log file I was able to produce on average 1049323 SELECT statements. On an idle server, producing the SELECT statements after parsing 100MB of relay binlog took 12 seconds and about one second for processing 10MB.

To be continued...


burtonator said...

I don't think the MySQL query cache has anything to do with it.

You're seeing the boost because of the innodb buffer pull or with MyISAM the OS page buffer.

I'm skeptical about how this would perform in high disk/data ratio environments.

Anonymous said...

The query cache is not a factor since the update statement is not byte for byte identical to the select that's being run in advance. Not identical = no query cache use.

The slave thread is limited because it's single-threaded while the master is multi-threaded. This limits the slave's ability to maximally use disk and CPU. Getting the statements in advance can let the slave preload data and decrease replication lag, at the cost of reducing free CPU and disk seeks a bit. In cases where replication lag is a big issue, that's a good trade-off.

The catch is, you don't have much window for getting the queries run in advance if you're not already lagged. And you don't want to get there.

What you could do instead is have the application feed the select queries to the slaves or to a process that runs them. If the application does it, it's sure to be ahead of the replication thread. Catch here is that you must do this outside a transaction because you definitely don't want to hold InnoDB transactions that have modified data open any longer - that's grossly counter-productive for high load places that tend to be limited by concurrency issues.

But... if the application is feeding the selects to the slaves, it could feed the updates themselves and completely skip replication. That's not as crazy as it sounds. Say you have a mixed load with some updates to tables that aren't touched elsewhere. Route just the updates to that subset of tables through your custom slave update process and the single thread bottleneck is gone and you can tell replication to skip those tables.

And that's how to cheat and get ahead in replication without doing any duplicated work.

James Day