MySQL Consulting and NoSQL Consulting: MySQL DBA: mysql conference

Showing posts with label mysql conference. Show all posts

Sunday, April 27, 2008

People I met at the conference

Every year I meet a lot of new and old friends at the MySQL conference. To highlight their involvement in the MySQL community and at the conference I have decided to start a new series: "People I met at the MySQL conference."

I probably won't be able to cover everyone I met (sorry about that) but I intend to cover as many as possible. There will be no order in which I cover people. Also, there is no secret agenda and of course whatever I say is just my personal opinion. Just whenever I have a few thoughts ready about someone, they will pop out :)

Monday, April 21, 2008

Back from the MySQL conference

This morning I landed back at my home airport, EWR, after spending a fun-filled week and a half at the MySQL Conference 2008.

This year's conference was the best ever for me. I have a lot of people to thank and a lot to blog about. The number of pings I have received about lack of my blogging during conference is truly humbling. However, I did have a good reason for not being able to blog.

First, I was presenting three sessions, with two on the final day of the conference. Since I have the habit of continuously revising my presentations, that put a little bit of pressure on me. A big thanks to all those who came to my sessions.

Second, I was given a great opportunity to be a keynote panelist at the "Scaling up or out" session at the MySQL Conference. If you missed the keynote, you can watch the full video of the keynote posted by Sheeri.

Third, me, my wife and a few friends were invited to a trip of the lifetime by hardcore community evangelists at Proven Scaling (Jeremy Cole, Eric Bergen and Mike Griffiths). We had a great time visiting Yosemite National Park (more on this later). This was my first time without checking email or being on the Internet in nine years.

Now that I am back, I intend to put all my thoughts regarding the conference and the trip as blog posts in the coming days so stay tuned.

Sunday, April 13, 2008

Heading to MySQL conference in Santa Clara

I am leaving in a few hours from Monterey for Santa Clara, the home of MySQL conference. I should be in the Hyatt Regency Lobby at 5:45 PM. I still have one more space in my car so if you haven't found a ride yet to go to the pre-conference dinner, you can reserve the spot by calling me or sending me a text message at 5/5/1/6/5/5/5/5/9/0.

Wednesday, April 09, 2008

On my way to MySQL conference

Later today around 5PM EST both me and my wife will be flying to San Jose to attend MySQL Conference happening next week. We will be staying the first two nights in Burlingame to meet family and friends.

Then on Friday evening we will be going to visit more family in Monterey. We will arrive at Hyatt Regency, Santa Clara, on Sunday afternoon.

Once at Hyatt, I will be happy to give a ride to anyone going to the Pre-Conference dinner.

After the conference, my plan is to spend time with a few friends. I will be flying red-eye, Sunday night, back to home.

Like previous conferences, I can't wait to see all my old and new friends.

My passions include InnoDB, memcached, BLOBs, Latent Semantic Analysis, Ruby on Rails (why won't it scale), SEO, monetization, Solaris 10, Sun hardware, Hadoop, Lucene, replication and Blue Moon :), I would love to meet/talk with other users passionate about similar stuff.

Monday, April 07, 2008

Pre-conference Community Dinner - MySQL Forge Registration

I like Arjen's suggestion of having a pre-conference community dinner and wanted to put my name. I tried to register with softwareengineer99 and got this:

This kinds of usernames usually indicate spammers. If you feel like this was in err, contact the wiki administrator.

Return to MySQLConf2008CommunityDinner.

Ok, whatever. I then tried again with a "non spammer" username and multiple email addresses but kept getting this:

A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:

(SQL query hidden)

from within function "User::addToDatabase". MySQL returned error "1062: Duplicate entry '' for key 3 (localhost)".

So there seems to be an issue with MySQL Forge registration.

Sunday, April 06, 2008

Is Backup Really Irrelevant?

Brian Aker writes in his "PostgreSQL to scale to 1 billion users" post:

Backup is irrelevant for those of you who care about this discussion. LVM/ZFS snapshots are the rule of the land.

While I agree with most of Brian's statements in the article, I respectfully disagree with the statement above, especially the bolded part. Copy-on-write snapshots are EVIL for very large databases operating in a high I/O environment and backup, by no means, is entirely irrelevant. Please correct me if I am wrong but it is my understanding that both LVM and ZFS implement copy-on-write snapshots. Backup may be irrelevant for most sites but not for us.

If, however, by "irrelevant" Brian meant that not important in choosing one database over another, I can agree with that. Why? Because no one benchmarks backup methodologies until backup process starts becoming a major PITA.

Backup methods can be a performance killer when dealing with very large databases. If you're interested in finding out why, and more importantly how, ask me at the conference, come to my scaling MySQL and InnoDB on Solaris session, or check on this blog after the conference.

Friday, April 04, 2008

Kickfire looking to push MySQL limits

For the past few months, like Baron, Jeremy and Keith, I have been consulting KickFire (formerly known as C2App). There is another startup currently in stealth mode with some very impressive solutions for MySQL. Unlike Kickfire, this other startup isn't SSD based. I was hoping they will be ready for announcement at the conference as well, but it seems they will need some more time. I cannot go into much detail on this startup at this point.

I have been wanting to write on KickFire but I certainly won't be able to beat Baron. He does a wonderful job in capturing what is KickFire and presenting a detailed insight for PlanetMySQL readers.

Like Baron, I only provided consulting and didn't get a chance to actually play with the solution. If KickFire is able to deliver what they have been promising then I can see them becoming a major solution provider to MySQL community.

I can't wait for Kickfire's keynote. Should be very interesting for those interested in giving MySQL scalability a whole new meaning.

Thursday, April 03, 2008

Golf with Scott McNealy?

Today, I missed on an awesome opportunity: to play golf later this month with Scott McNealy. Scott held the title of 'best golfer among top executives' for eight straight years.

I was made the offer to play golf today at our weekly manager's meeting. Why will I miss it? Because I will be in California, speaking at the MySQL Conference.

There are several Sun related interesting events happening in New York during the time I will be in California for the MySQL conference. This would have been a great chance for me to mingle with the top executives and talent at Sun.

I feel sad for missing this opportunity but very excited as the conference time comes closer and closer. Can't wait to see old friends and make new ones.

Wednesday, April 02, 2008

Velocity Conference

O'Reilly's Velocity Conference is happening this year from June 23-24 at Burlingame, CA. Velocity site describes this new conference as:

"Web companies, big and small, face many of the same challenges: sites must be faster, infrastructure needs to scale, and everything must be available to customers at all times, no matter what. Velocity is the place to obtain the crucial skills and knowledge to build successful web sites that are fast, scalable, resilient, and highly available."

When the call for papers was open for Velocity, I submitted a talk proposal regarding cutting MySQL IO for cost effective scaling and performance optimization.

Fotolog is one of the largest sites on the Internet. We are ranked 13th most visited site by Alexa and 3rd most active social network by ComScore. In the past two years, we have experienced and continue to experience incredible growth. By focusing on efficient data modeling and cutting I/O, we have literally pushed the limits of optimization and scalability when it comes to MySQL.

Learning today that my session was not accepted obviously came as a major disappointment to me. While I truly respect the conference chair's decision, I believe my session would have been useful for those who are experiencing strong growth but cannot afford to re-architect their database backend for one reason or another.

There is some good news as well: While Velocity rejected my proposal, I am presenting a somewhat similar session at this year's MySQL Conference. The session is called "Optimizing MySQL and InnoDB on Solaris 10 for World's Largest Photo Blogging Community". If you're attending the conference and interested in knowing how you can push the limits of your MySQL database servers on Solaris, don't forget to attend my session. It will be a lot of fun, I promise!

I am also presenting two more talks at the MySQL Conference, Disaster is Inevitable—Are You Prepared? and The Power of Lucene.

Sunday, February 10, 2008

Someone please change mysqlconf.com redirection

MySQLConf.com uses a non-optimal temporary 302 redirect to the MySQL conference website.

This is very bad for mysqlconf.com domain name and equally bad for people who link to http://mysqlconf.com (instead of linking to en.oreilly.com/mysql2008/) as their links then DON'T BENEFIT the conference site and from Google's point of view they are linking to a page that engages in temporary redirect. The result is unless you link directly to an oreilly.com page for your conference links, your votes/links don't get passed on to the conference site.

A 302 redirect is considered bad from search engine's point of view due to its temporary nature.

So please folks, change the redirection to 301 or I will have to go back and change my links to be "rel='nofollow'" links.

Currently, the site gives:

wget mysqlconf.com
--14:27:44--  http://mysqlconf.com/
           => `index.html.2'
Resolving mysqlconf.com... 209.204.146.28
Connecting to mysqlconf.com|209.204.146.28|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://en.oreilly.com/mysql2008/ [following]
--14:27:44--  http://en.oreilly.com/mysql2008/
           => `index.html.2'
Resolving en.oreilly.com... 208.201.239.26
Connecting to en.oreilly.com|208.201.239.26|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://en.oreilly.com/mysql2008/public/content/home [following]
--14:27:44--  http://en.oreilly.com/mysql2008/public/content/home
           => `home'
Reusing existing connection to en.oreilly.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 16,852 [text/html]

100%[==========================================>] 16,852        45.68K/s             

14:27:45 (45.63 KB/s) - `home' saved [16852/16852]

What it should give:

wget fotolog.net  
--14:33:02--  http://fotolog.net/
           => `index.html.2'
Resolving fotolog.net... 65.118.195.131
Connecting to fotolog.net|65.118.195.131|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.fotolog.com/ [following]
--14:33:03--  http://www.fotolog.com/
           => `index.html.2'
Resolving www.fotolog.com... 64.111.215.105, 64.111.215.120
Connecting to www.fotolog.com|64.111.215.105|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24,585 [text/html]

100%[==========================================>] 24,585        --.--K/s             

14:33:03 (191.15 KB/s) - `index.html.2' saved [24585/24585]

It doesn't just ends here. MySQL is also destroying its mysqluc.com domain in a bad manner. Look at the scary number of 302 redirects here:

wget mysqluc.com
--14:35:29--  http://mysqluc.com/
           => `index.html.3'
Resolving mysqluc.com... 209.204.146.28
Connecting to mysqluc.com|209.204.146.28|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.mysqlconf.com/ [following]
--14:35:30--  http://www.mysqlconf.com/
           => `index.html.3'
Resolving www.mysqlconf.com... 209.204.146.28
Connecting to www.mysqlconf.com|209.204.146.28|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://en.oreilly.com/mysql2008/ [following]
--14:35:30--  http://en.oreilly.com/mysql2008/
           => `index.html.3'
Resolving en.oreilly.com... 208.201.239.26
Connecting to en.oreilly.com|208.201.239.26|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://en.oreilly.com/mysql2008/public/content/home [following]
--14:35:31--  http://en.oreilly.com/mysql2008/public/content/home
           => `home.1'
Reusing existing connection to en.oreilly.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 16,852 [text/html]

100%[==========================================>] 16,852        41.99K/s             

14:35:31 (41.95 KB/s) - `home.1' saved [16852/16852]

Now I can just hope that someone actually takes action. It's small changes like this that make you get the most out of your domain or waste all the efforts you previously did in making your sites rank high.

For instance, just think how many links people created to mysqluc.com when MySQL conference was known as MySQL Users Conference. Just because MySQL used an insensible 302 redirect, all the efforts of community in linking to that domain went down the drain. The reason being that Google neither transfers the rank with 302 redirect, nor it consolidates the incoming votes/links from old domains to new domains.

Update: Why I keep talking about this: With Google's Bourbon update, I lost a very well performing site that I had worked on for many years. My investigations all pointed to using 302 redirect instead of 301 redirect. Basically, my site was wrongfully classified by Google as engaging in sneaking redirects. The site never rebounded. It was listed in Google news and was doing millions of page views a month. It had proper licensing from all content and news providers. 'Not knowing' didn't set me free in Google's court. By all means, it was a legitimate, high quality content provider site. Although this may never happen to you, Google still considers 302 a very bad form of redirecting and whenever possible it should be avoided.

InnoDB Sessions at MySQL Conference

This year MySQL Conference features some of the best talks on InnoDB and I couldn't be more excited. We'll be hearing from Heikki Tuuri, Ken Jacobs, Mark Callaghan, Vadim Tkachenko, Peter Zaitsev and me :)

Kudos to conference organizers have really done a great job in balancing the sessions this year.

MySQL conference is a great venue to get up to date with what's happening with your favorite database/storage engine. Early registrations end soon so save yourself some money and register now.

If you have known me for sometime or if you are a regular blog reader, please send me a note and I will send you a coupon code to save even more when you register at the conference. You can email me at sofwareengineer99 at yahoo.

Without further ado, here are mouth watering InnoDB sessions scheduled for this year.

Mark Callaghan: Helping InnoDB Scale on Servers with Many CPU Cores and Disks

Ken Jacobs: InnoDB: Status, Architecture and New Features

Heikki Tuuri / Ken Jacobs: InnoDB: Fast, Reliable, Proven Transactional Storage for MySQL

Vadim Tkachenko and Peter Zaitsev: Investigating InnoDB Scaling Limits

Heikki Tuuri: InnoDB: Status, Architecture and New Features

Farhan "Frank" Mashraqi (Me): Optimizing MySQL and InnoDB on Solaris 10 for World's Largest Photo Blogging Community

Friday, January 25, 2008

When no disaster recovery plan helps

Regardless of how "prepared" and "ready" one feel for a disaster, it will, in one form or another, inevitably happen. The best thing you could do is continuously revise and test your disaster recovery plan, strengthening it each time against any kind of disaster you can think of. Things generally go wrong when you least expect them to go wrong.

I was getting chills reading about Charter Communications, a St. Louis based ISP, accidentally deleted 14,000 active email accounts along with any attachments that they carried. All the deleted data of active customers is irretrievable. As someone who is responsible for data of one of the top 15 heaviest trafficked site in the world, according to Alexa, I know, I'd HATE to be in shoes of the person responsible for this.

As I was reading the news story, I was constantly thinking about the title of Jay and Mike's 2006 presentation: "What Do You Mean There's No Backup?"

Once a disaster happens, you can immediately think of the possible ways it could have been avoided. The real challenge is implementing ways of avoiding all types of disasters before they happen.

For instance, to protect against such a disaster, or at the very least, be able to recover from its effects, Charter communications could have:

1. fully tested the script on a QA/test box to ensure no test records of active users are deleted.
2. created a backup of the data by creating a file system snapshot just before running the script. That way deleted data can be recovered. Depending on your operating system/storage system, there are a lot of tools available that let you take file system snapshots such as fssnap (Solaris), LVM (Linux).
3. had a recoverable backup. There are a lot of cases out there where either no backup exists or the one that does exist, turns out to be corrupt. With a periodic backup, Charter could have, for instance, just announced to their customers that they lost their new emails since last week, instead of dropping the ball and saying that *all* their email is lost. Even having an off-site backup in this case would help if selective restore from that backup was possible.

BTW, Just a few days ago, I was testing a random sample of backups and found backups of a database to be corrupt. That triggered a system wide check of backups. The best way I have found is to have a list of backups from all databases sent to me by email. My report contains information about backups running at the time the script was generated and the backups that were created the previous night.

4. If the data deleted was on a database such as MySQL, recovery from this disaster would be possible by keeping a slave intentionally behind.

What are some of the other ways you can think of to avoid a disaster or to execute a recovery plan?

There are many ways a disaster like this can be triggered. A few, seemingly bizarre but very real, that come to my mind:

- What if you accidentally re-run a previously executed DELETE command, stored in your mysql client history, in a hurry, on the wrong server? Or you re-ran a disastrous command in your shell history in the wrong directory?

- What if you used a list of IDs generated from your QA/test machine to delete users from production machines/databases? Oh and the IDs were generated from an auto-increment column?

Can you think of more?

Sure, there are ways to prevent against each kind of disaster. The question then is: Are you prepared against 'all' of them?

The disaster recovery plan of your company may help your steer out of such a disaster, but in the case of Charter, their DR plan didn't cover this. They do, however, have plans to reimburse their customers $50. Don't know if that'd be sufficient to keep customers from switching.

If you are someone responsible for administering and executing disaster recovery plan(s) for your company, you may find my "Disaster is Inevitable--Are you ready" session at the MySQL Conference 2008 interesting. Plus, we can have great conversation afterward. :)

See also: disaster recovery, disaster recovery journal, mysql conference

Tuesday, January 22, 2008

Speaking at MySQL Conference 2008

I have been meaning to blog about this for quite some time but time seems to be the most scarce resource in my life.

This year, I will be presenting three sessions at MySQL Conference:

Disaster Is Inevitable—Are You Prepared?

What’s the worst database disaster you expect to happen? Are you prepared? Does your architecture support quicker recovery or do you have recovery bottlenecks built throughout your architecture? What will happen if InnoDB crashed beyond repair or if you have a massive irrepairable data corruption? What can you do to better prepare for the disaster, when it does happen? Do you have data restoration tools and procedures in place in case you need to resort to extreme measures? Join us in this eye opening, heart-racing, real-life inspired presentation by Fotolog’s Director of Database Infrastructure, Farhan “Frank” Mashraqi” to find out answers to these questions and more.

Optimizing MySQL and InnoDB on Solaris 10 for World's Largest Photo Blogging Community

Fotolog is a top 19 Internet destination with more than 12 million members, 315 million photos and more than 3 billion page views a month. In just a few years Fotolog has become a social phenomenon in Europe and South America. Through modifications to its data architecture, Fotolog was able to serve four times the number of users using the same number of database servers. A non-conventional, hybrid presentation that conveys the importance of scalability, performance tuning and schema optimizations in a practical way.

The Power of Lucene

Lucene is a powerful, high-performance, full-featured text search engine library that is written entirely in Java and provides a technology suitable for all size applications requiring full-text search in heterogeneous environments.

In this presentation, learn how you can use Lucene with MySQL to offer powerful searching capabilities within your application. The presentation will cover installation, usage, optimization of Lucene, and how to interface a Ruby on Rails application with Lucene using a custom Java server. This session is highly recommended for those looking to add full-text cross-platform, database independent search capability to their application.

Registration is now open. See you all soon!

Wednesday, May 23, 2007

Technocation presents MySQL conference videos and presentations

Sheeri, a good friend of mine has been working very hard to record, process and then publish hours and hours of MySQL conference videos. She has published the recordings on Technocation.

One request that I wanted to make to everyone is that if you download videos or presentations from Technocation, please consider giving a little back in form of a donation. All that goes in Technocation is ultimately for the benefit of the community. And remember, every little bit counts.

Saturday, May 12, 2007

Pre-fetch binlogs to speed up MySQL replication

Note: this "script" is a one liner really and not something meant for general purpose use. It is very specific to my environment so I take no responsibility of any kind. It is very raw so use at YOUR OWN risk.

Last year at the MySQL camp, I heard about Paul Tuckfield's script that pre-fetches relay logs to speed up replication. Since then I got really interested in implementing it but never got the time. This year at MySQL conference Paul's script got a lot of attention again. Since it hasn't been released and I really need to start testing this on my test server. So I started hacking on creating a shell script to dig deeper into what's involved. Once I have that I may work on porting it to other languages.

To make the process of writing this script easier for others in their favorite languages I thought I will go through what's involved in creating this script on my blog. That way when you would like to write an open-source :) port of this script, you won't have to dig through everything. I will, of course, love to get feedback specially since I know so many people are interested in it.

Paul's keynote was among the best keynotes of MySQL conference. It was so popular that even after the keynote finished, people didn't stop picking Paul's mind. The "always-looking-out-for-the-community", Jay Pipes quickly offered , I must mention that Paul was very generous in answering many questions at both the MySQL Camp and MySQL Conference and Expo.

So the concept is something like this (unless otherwise specified, I will be talking about MySQL slave and not the master since we want to speed up the slave):

- Get the current position from where the slave is currently reading binlogs. This will give us the position we need to pass to mysqlbinlog program.
- If the slave isn't replicating all databases from the master then we need to filter those databases out when reading binlog output.
- Decide the maximum amount of data we would request mysqlbinlog to read at a time. Depending on the options passed to mysqlbinlog we can request whether the reading should stop at the end of current log file or should it continue till the end.
- As we get output from mysqlbinlog, we will need to extract the UPDATE statements.
- Once the UPDATE statements have been extracted we need to get the information contained in the WHERE clause and create a SELECT statement from it.
- These SELECT statements can then be played in advance to help speed up the UPDATE processes.

One thing to note is that if you use UPDATE LOW PRIORITY ... on your master then beware that it can further increase the slave lag.

One question I have in mind is whether a slave with query cache turned off will be able to realize the same performance benefits as a slave that has query cache turned on.

Another thing I am working on determining is some kind of formula that will help optimize how much data in advance should be read. This will all become reality soon.

Finally, I would like to measure key performance indicators before and after implementing this script. Ideas are certainly welcome regarding this.

If you have had a chance to play with mysqlbinlog you probably already know that most of the work is already done, thanks to the wealth of options that are available.

Of course, not everyone who will read this blog post will know about mysqlbinlog. So, I am going to give a brief introduction.

Moving relay logs to a test environment
To start playing with mysqlbinlog command, move some relay logs to your test machine.

Say hello to mysqlbinlog
Imagine if we would have had to read the binlogs ourself. Suddenly this whole script would become very challenging for some. Luckily we don't have to do that. Within the bin directory of your MySQL installation there are several helpful programs (or utilities) that you can use to ease your everyday life as in this case.

To view the mysqlbinlog options, run it without any option. I am going to touch on some of these options below.

First, let's start with invoking mysqlbinlog with the name of a relay log file that we are interested in reading.

slave:/path/] /path/to/mysqlbinlog slave-relay-bin.000169  | more

Typical output of the above will be something similar to the following:

/*!40019 SET @@session.max_insert_delayed_threads=0*/;
# at 4
#070329 18:11:33 server id 1  log_pos 915813658         Query   thread_id=345365808     exec_time=0     error_code=0
use db;
SET TIMESTAMP=1175206293;
REPLACE INTO ...;
# at 207
#070329 18:11:33 server id 1  log_pos 915813861         Query   thread_id=345007376     exec_time=0     error_code=0
use mysql_rocks;
SET TIMESTAMP=1175206293;
UPDATE world set db='MySQL';

Let's de-construct this line by line:

# at 4

This means that the following statement is at delimiter position 4.

#070329 18:11:33 server id 1  log_pos 915813658         Query   thread_id=345365808     exec_time=0     error_code=0

First, we can see the date and time followed by server id to which the statement originally belonged.

The number following log_pos corresponds to the position of this relay log record in the original binlog.

Next is the type of statement, the thread id that caused this statement to be inserted. It seems like the next is probably time it took to execute the query and the error code generated when the query was executed..


use mysql_rocks;
SET TIMESTAMP=1175206293;
UPDATE world set db='MySQL';

The remaining lines are pretty self explanatory and contain the database name, original timestamp and the actual query.

As you can see there is a bunch of output that, while helpful, won't be needed for the purpose of creating this script. We can use the -s or --short-form option to obtain just the queries. Try it out:

slave:/path/] /path/to/mysqlbinlog -s slave-relay-bin.000169  | more

Most of my slaves are not exact replicas of the master and do not "replicate" all databases. Some databases present on the master are ignored. Generally, when this is the case you would be using the --binlog-ignore-db=db_being_ignored on your master to ignore statements that belong to db_being_ignored don't make it to the binlog. But let's assume that for some reason your relay logs contain statements belonging to databases that aren't being replicated and should be ignored for the purpose of this script. If these statements aren't filtered out, you will end up issuing SELECTs that cannot be executed due to missing database and tables.

To get the queries for a specific database only we could use an external utility such as grep or sed, but not use -d or --database=name option and let mysqlbinlog do the job for us. In these posts I am going to just focus on creating this script for a single database. Later, if time permits, I will work on multiple databases. Our "script-in-progress" now becomes:

slave:/path/] /path/to/mysqlbinlog -s -d db_to_filter_for slave-relay-bin.000169  | more

This will make mysqlbinlog output only entries that belong to db_being_ignored.

Unfortunately, I couldn't find any built-in filter to filter out UPDATE statements, therefore I am going to resort to some external utility like grep. Actually, let's use sed :)

/path/to/mysqlbinlog -s -d db_to_filter_for slave-relay-bin.000169  | sed  -e '/UPDATE.*SET.*WHERE/b' -e d | more

The -e option specifies that we will be issuing edit command. As you can see multiple -e options can be specified. The /b flag is used for specifying patterns we want to keep and d to delete the pattern space.

I couldn't find the case-insensitive option for sed on Solaris 10 so let's try this (I'll improve it later):

mysqlbinlog -s -d db_to_filter_for slave-relay-bin.000169  | sed  -e '/UPDATE.*SET.*WHERE/b' -e '/update.*set.*where/b' -e d | more

Now let's bite the bullet and convert the UPDATE statements to SELECT statements.

mysqlbinlog -s -d db_to_filter_for slave-relay-bin.000169  | sed  -e '/UPDATE.*SET.*WHERE/b' -e '/update.*set.*where/b' -e d | sed -e 's/LOW_PRIORITY//' | sed -e 's/UPDATE\(.*\)SET\(.*\)WHERE\(.*\)/SELECT \* FROM\1 WHERE \3/g' | sed -e 's/update\(.*\) set \(.*\)where\(.*\)/SELECT \* FROM\1 WHERE \3/g'

OMG!, this is looking pretty good :)

Before we go any further, let's specify exactly how much of relay binlog should be read.

We have at least two options as far the decision of where to start and stop.

First, we can specify --start-position=# and --stop-position=# to specify the start and stop positions respectively. The default values for --start-position is 4 and for --stop-position is 18446744073709551615

Second, we can use --start-datetime=# and --stop-datetime=# to specify the boundaries. The format of the datetime can be any accepted MySQL format.

I will be using the --start-position=# and --stop-position=#.

After adding this to our script, we get:

start_pos=4; offset=` expr 1024 '*' 1024 `; stop_pos=`expr $start_pos + $offset`; /path/to/mysqlbinlog -s -d db_to_filter_for --start-position=$start_pos --stop-position=$stop_pos slave-relay-bin.000169  | sed  -e '/UPDATE.*SET.*WHERE/b' -e '/update.*set.*where/b' -e d | sed -e 's/LOW_PRIORITY//' | sed -e 's/UPDATE\(.*\)SET\(.*\)WHERE\(.*\)/SELECT \* FROM\1 WHERE \3/g' | sed -e 's/update\(.*\) set \(.*\)where\(.*\)/SELECT \* FROM\1 WHERE \3/g'

If you specify too low offset you may not get any output.

(some stats related to my system for reference:) For my system, setting the offset to 1MB

offset=` expr 1024 '*' 1024 `;

on average produced 927 SELECTs. Without any options specified, the total output was 26314 lines (not queries). On average the relay log contained 26075379 lines. The last position in relay logs on average was 1073741494. From every relay log file I was able to produce on average 1049323 SELECT statements. On an idle server, producing the SELECT statements after parsing 100MB of relay binlog took 12 seconds and about one second for processing 10MB.

To be continued...

Monday, May 07, 2007

MySQL conference ending thoughts and presentation files

Man, I can't believe it's been over a week since I returned from the very great and exciting MySQL conference 2007. I got to meet all my old and new friends. Big Kudos to Jay Pipes and all MySQL'ers who helped make this event possible.

To me, this year's conference was the best ever. Partly because I made the very wise decision of staying at Hyatt so I won't miss a lot. At conferences like these the more you mingle with people, the more you get out of it. I had some amazing conversations with Mark Atwood, Brian Aker, Jay Pipes (I will never forget :)), Jeremy Cole, Eric Bergen, Pascal (Yahoo! France), Beat Vontobel, Markus Popp, Boel (MySQL HR), Christine Fortier, Marc Simony, Govi (Amazon), the "R"s of MySQL (Ronald and Ronald), Sheeri Kritzer, Carsten (certification), Ken Jacobs (Oracle), Kaj Arno, Dean (MySQL Support head), Domas, Kerry Ancheta (MySQL sales guru :)), Baron, Paul Tuckfield, Don MacAskill, Tobias, Peter Zaitsev, Chip Turner, Mark Callaghan and many more cool people. Thank you, everyone.

The sad part is that there wasn't enough time for me to hang out with people as much as I wanted. Oh well, MySQL Camp II is just around the corner in NYC.

I would also never forget the night I went to Denny's and had a Jalapeno burger with Michelle (wife), Jeremy, Adrienne (Mrs. Cole), Liam, Eric, Ronald, Domas and Pascal. It was so much fun.

At the conference, I also got a chance to be a part of MySQL Certification Technical Advisory Board (thanks to Carsten and Roland for having me). There were some excellent ideas and important issues discussed there. Everyone, including Mike Kruckenberg, Collins Charles and Sheeri contributed some excellent suggestions.

The presentation files for my sessions are now available at http://www.mysqlconf.com. Thank you to all those who written me repeatedly and kept reminding me about putting the slides online. I really appreciate your patience. The slides do not make up for the talk so if you find yourself with a question, please feel free to shoot an email. You can find my email address in the header of my blog.

Also, big thanks to Warren Habib, my boss, who was there to provide his support.

- MySQL and Lucene
- Fotolog: Scaling the world's largest photo blogging community (We have now crossed 100 million page views a day and are ranked as the 24th most visited site on the Internet by Alexa.)

For all those who have sent me an email, please bear with me as I will be sending a personal reply to everyone.

Thursday, April 26, 2007

InnoDB Performance Optimization

I am sitting in the session "InnoDB Performance Optimization" by Heikki Tuuri and Peter Zaitsev. This is going to be the last session of a very enjoyable and informative MySQL Conference and Expo 2007.

General Application Design is Paramount
Of course, you should design your schema, indexes and queries right. Storage engines aspects are often fine tuning. Storage engine selection may affect your schema layout and indexes.

Each storage engine has unique design and operating properties. App written for 1 storage engine may not perform best with other storage engines. Special optimizations exist for each storage engine.

There are always transactions with InnoDB, even if you don't use them explicitly. There is a cost with transactions. If running in autocommit mode, each transaction commit overhead for each statement.

You should wrap multiple updates in the same transaction for efficient operation (SET AUTOCOMMIT=0; ... COMMIT; ... COMMIT'_

HT points that InnoDB cannot purge deleted rows if transactions are open. Transaction commit doesn't affect query caching. If you run SELECT inside transaction Query cache may not be used.

You should make sure that you're catching deadlocks and wait timeouts. Transactions should not be too large. Deadlocks are normal and you should expect them. In app, try to reduce them but they will occur. If you run SELECTs you don't get deadlocks normally. Your application should deal with deadlocks appropriately.

Do not use LOCK tables as LOCK TABLES is designed to work with table level locking storage engines. With tow level lock storage engines, transactions are better choice. The behavior of LOCK TABLES is different in MySQL versions and depends on --innodb_table_locks. It can give problems for portable applications if you port from MySQL4 to later. Behavior may not be what you expect from MyISAM.

PRIMARY KEY CLUSTERING: It is very important. Accessing data by PK is faster than other keys and is true for both in-memory and disk based accesses. You should try to do most lookups by primary key.

In InnoDB data is clustered by PK and sequential PK values will likely have data on the same page. PK range and prefix lookups are also very efficient. It can be used to cluster data that is accessed together.

If storing user messages, you can use (user_id, message_id) primary key to keep all users messages in a few pages.

PK creates a "covering index" for any set of fields in the PK because PK holds all the data

What is the cost of clustered primary key:
PK in random order are costly and lead to table fragmentation. PK inserts should normally be in an ASC order. If you can, load data in PK order. At times, you may find that changing primary key to auto_increment is a good idea.

If you don't specify one, then InnoDB will create a clustered key internally, so you are better off defining one and using it.

Primary Key updates are very expensive and requires row data physically to be moved from one place in the index to another. The row has to be deleted and then inserted after which all secondary keys are updated. Never update primary key, change it to AUTO-INCREMENT. HT: If you need to update primary key then there is something wrong with your schema. Generally not a good schema/application design either.

What is a huge PK? It all depends on the access patterns of your application. Benchmark different keys.

What about 32-bit hash? That's like shooting yourself. Its really long and will be in non-sequential order.

If really large primary key then use AUTO-INC and demote your original primary key to unique key.

Keep your primary key short as secondary indexes use primary key to refer to the clustering index which makes PK value part of any/every index.

Long PK makes your indexes slow and long so you are better off reducing them.

Go easy on UNIQUE indexes as they do not use the "insert buffer" which can speed up index updates significantly. Indexes in InnoDB are not prefix compressed so they can take much more space then for MyISAM indexes (upto 10 times larger for the same structure). You should really avoid excessive or unneeded indexes. Remember MySQL will at most use one index for a query, not two.

Keep your updates fully indexed or you can see unexpected locking problems. A statement such as DELETE FROM users WHERE name="peter" may have to lock all rows in the table if the column name is not indexed. HT: MySQL 5.1 read-committed then only those rows are locked that are actually deleted.

AUTO-INCREMENT may limit scalability: AUTO-INC INSERTS may use table level locks (but only to the end of the INSERT statement, not transaction) even if you specify the AUTO-INC column value.

AUTO-INCREMENT limits scalability for concurrent inserts. HT: they have a patch and they are deciding on whether to implement it in 5.1.

MVCC: Complements row level locking to get even better concurrency. Standard SELECT statements set no locks, just reads appropriate row version.

LOCK IN SHARE MODE, FOR UPDATE modifiers can be done to do locking reads.

Even long running selects do not block updates to the table or other selects. Overly long queries (transactions) are bad for performance as a lot of unpurged versions accumulate. InnoDB can only purge rows when no active transactions are running. READ COMMITTED can ease the problem in such cases. InnoDB can only remove a row version when no transactions are open which can read it. HT: READ COMMITTED releases the snapshots at SELECT statements and doesn't keep it for the duration of transaction. READ COMMITTED is better for long transactions as it will let InnoDB purge transactions in background.

How MVCC plays with indexes? When you have index, reference is stored in index. When you update an index more and more versions are in the index that can slow index operations as well.

For UPDATE and LOCK IN SHARE MODE: Locking selects are executed in read committed mode because you cannot lock a row which doesn't exist. Because of this the results of these queries can be different than for standard SELECTs.

SELECT ... FOR UPDATE always has to access row data page to set the lock, so it can't run index covered queries which can further slow down queries significantly.

Reducing deadlocks: Deadlocks are normal for a transactional database. Non-locking SELECT statements do not deadlock with InnoDB. You should make sure to handle deadlocks and lock wait timeouts in your application. You should also make sure that your transactions lock data in the same order when possible.

Your update chunks should be smaller. You can achieve that by chopping transactions.

If you are going to update most of the selected rows then you are better off using SELECT ... FOR UPDATE. External locking can be used to avoid problem such as application level locks, SELECT GET_LOCK('mylock') etc.

Affect of isolation modes over performance:
Isolation modes in InnoDB can be set globally, per connection, or per transaction.

READ-UNCOMMITTED: If you are fine with dirty reads. Performance improvement is still limited.

Foreign Keys Performance: InnoDB checks foreign keys as soon as a row is updated, no batching is performed or checks delayed till transaction commit. FKs have their own cost and often have serious performance overhead. They are ideal for maintaining data consistency.

FKs increase amount of row level locking done and can make it spread to a lot of tables besides the one being updated.

Restrict number of open transactions: InnoDB performs best with a limited number of opern transactions and running queries. Multiple running queries may cause a lot of thrashing bumping into each other. Work is being done to improve performance in such cases. InnoDB_thread_concurrency can be used to restrict number of threads in InnoDB kernel. Many open transactions make lock graph building more cplicated and increase some other overhead. Whenever possible, you should keep a limited number of queries running at the same time by queuing on application side.

Guidelines for setting concurrency level: Manual says set it higher or lower. It really depends on application and number of hard drives.

HK: if you are worried about thread thrashing then set thread concurrency to 1.

INSERT...SELECT statement runs locking select. It is required for logical level replication to work properly. The problem goes away with MySQL 5.1 row level replication and the READ COMMITTED isolation level. Behavior is the same whenever you have log-bin enabled or not, to keep things consistent. innodb_locks_unsafe_for_binlog helps in 5.0 but your replication may get broken so do it if you know what you are doing. It also disables next-key locking.

SELECT...INTO OUTFILE + LOAD DATA INFILE can be often use as non-blocking safe alternative.

COUNT(*) facts and myths:
MyTH: "InnoDB doesn't handle count(*) queries well": Most count(*) queries are executed same way by all storage engines if you have a WHERE clause, otherwise you InnoDB will have to perform a full table scan.

FACT: InnoDB doesn't optimize count(*) queries without the where clause

You can use triggers and counter table to work it around or use SHOW table status like 'users' to get an approximation.

InnoDB and GROUP COMMIT: GC lets you commit several outstanding transactions with single log write. You can improve performance dramatically especially if no RAID with BBU. In 5.0, group commit does not work with binary logging due to a way XA (distributed transactions) support was implemented. Watch out for it if upgrading from MySQL 4.0.

Heikki will now talk further about server tuning settings:: Everything starts with innodb_buffer_pool_size which specifies main InnoDB buffer: data and index pages, insert buffer and locks are stored here. Most activity of InnoDB runs inside buffer pool. Why innoDB has its own cache? It's more efficient than OS cache. THe buffer pool is very important for performance on large data sets and much more efficient than OS cache especially for writes. InnoDB has to bypass OS buffering for writes. You can set buffer pool to 70-80% of memory for dedicated InnoDB-Only MySQL. Default value is just 8M, independent of available memory, just make sure to configure it.

InnoDB_additional_mem_pool: just stores dictionary information and is automatically increased, do not set too high. HK just uses 10MB for it. In future they may remove it altogether.

InnoDB_log_file_size: dramatically affects write performance. You should keep it high. However, high values increase recovery time. Check how large logs you can afford. Combined log file size should be 50% of the buffer pool memory.there is a 4GB total size limit.

InnoDB_log_files_in_group: this is the number of files of specified size that are used for log. There is usually no need to change default value.

InnoDB_log_buffer_size: You shouldn't set over 2-8MB unless you use huge BLOBs. Log file is flushed at least once per second anyway. Check innodb_os_log_written growth to see how actively your logs are written. innoDB logs are physio-logical, not page based so they are very compact.

InnoDB_flush_logs_at_trx_commit: By default logs are flushed to the disk at each transaction commit. This is required for ACID guarantees and is expensive. You can set to 2 or 0 if you can afford losing transactions for last one second or so (ie if you are using it as MyISAM tables replacement). Value 2 is nice if mysqld crashes then all your transactions are preserved, it is only bad in case of power crash or OS crash.

InnoDB log resizing: is not as simple as changing option and restarting. You need to shut down MySQL server and make sure it shuts down normally (check error log for errors to confirm). Then move away InnoDB log files ib_log*. Then start the MySQL server and check error log files to see if it successfully created new log files.

InnoDB_flush_method: specifies a way InnoDB will work with OS file system. On Windows unbuffered IO mode is always used. On Unix you can use fsync() or O_SYNC/O_DSYNC for flushing files. fsync() is usually faster. Always use default one or O_DIRECT (Peter).

InnoDB_file_per_table: allows InnoDB to store each table in its own file and makes disk management easier. Main tablespace is still needed for system needs. It can help to spread tables to multiple disks. It also allows you to reclaim space if a table is dropped. Sometimes slower for writes as fsync() is called sequentially. Can increase startup/shutdown time with large number of tables. I asked Heikki but unfortunately there are no plans to remove dependency on main tablespace as there are rollback segments that need to be there :((

Will there be InnoDB tablespace per database? No plans. Flexible tablespaces are similar to Oracle.

Other file IO settings:
innodb_autoextend_increment: HT doesn't think this is important. It specifies growth increment for shared tablespace (not for per table tablespaces). Larger values allow to reduce fragmentation.

InnoDB_file_io_threads: changes number of IO threads on Windows only. Note that all 4 threads are doing different jobs. In future it may change?

InnoDB_open_files: number of files used for per table tablespaces. You should increase it you have a lot of tables. No stats are available so far to show number of re-opens InnoDB needs to do.

innodb_support_xa: if set to 0 it reduces work InnoDB should do on transaction commit, however binlog can get out of sync.

So how do you minimize restart times: InnoDB buffer pool may have a lot of unflushed data and shutdown may take very long time. If you need to minimize downtime then

SET GLOBAL innodb_max_dirty_pages_pct=0

Watch for InnoDB_buffer_pool_pages_dirty in SHOW STATUS. As it gets close to 0, shut down the server.

InnoDB doesn't remove rows on delete and old row versions on update because these may be needed by other transactions. Purge thread is used to clean up these unused rows. In some workloads, the purge thread may not be able to keep up and the tablespace will grow witout bounds. CHECK TRANSACTIONS section in SHOW INNODB STATUS.

innodb_max_purge_lag limits number of transactions which have updated/deleted rows. This will delay insert/updates so purge thread can keep up. What is a good value of innodb_max_purge_lag? HT will set it to 100,000. It is measured in history list which contains transactions. Million transactions for this is too much.

innodb_commit_concurrency: umber of threads that are allowed at commit stage at the same time.

Questions: Are there any plans to make InnoDB's output to error log in a more defined/structured format?
Why does InnoDB show status sometimes only shows X transactions running and then truncates.

Google and MySQL

Chip Turner and Mark Callaghan are presenting the session "MySQL: The Real Grid Database"

Data is sharded vertically and they have a lots of replicas. Resharding is a bigger pain than sharding. Make really smart software and manage with least human resources as possible. They are going to talk about problems that matter to them.

The Grid database approach: deploy a large number of small servers.
Use highly redundant commodity components.
Added capacity has a low incremental cost.
Not much capacity lost when a server fails.
Which allows them to support many servers with a few DBAs.

I asked Chip earlier and he told me that they don't use any memcache at all. Their spindles are flying more crazily than their network bandwidth.

What is manageability?
-Make it easy to do the tasks that must be done
Reduce the number of tasks that must be done
Make all tasks scriptable

Why it matters?
You want to support hundreds of database servers with one DBA. This gives them time to solve more interesting problems.

They used to compile db on db server by hand, then they learned.

Underutilize your servers!
3 servers at 50% are better than 2 servers at 75%.
require less maintenance
requires less tuning
tolerate load spikes better and tolerate bad query plans better.

In perfect world:
Short running queries and transactions on master databases
Users kill mistake and runaway queries.
Accounts never use too many connections.
Query plans are good
New apps increase the db workload by a small amount
only things you need to store are stored.

In real world:
Long running transactions
servers with round robin DNS aliases make queries hard to find.
Apps create more connections when the db is slow
Some storage engines use sampling to get query plan statistics
New applications create new db performance problems
Apps use the db as a log and rows are never deleted
Many long running queries on replicas.

Solutions:
improve your ability to respond because prevention is impossible
Need tools to make monitoring easy
- determine what is happening across servers and what happened across servers.

Mantra:
Monitor everything you can (vmstat, iostat error logs).
- need these to reconstruct failures

Save as much as possible. disk spindles are expensive whereas disk sectors are cheap

Script as much as possible: You can automate more than you thought possible and you are likely to reuse these scripts.

You don't need to store vmstat in the same database. Store it on a separate database or store in text files.

Monitoring matters:
You need to know what is happening, what table, which user is doing?
Many fast queries can be as much of a problem as one slow query.

Record what happened: Archive SHOW STATUS counters somewhere. Query data from the archive and visualize data from the archive. Record queries that have been run (a sample is sufficient)
- Archive SHOW PROCESSLIST output somewhere. They take it every 30 seconds, compress it and save it. Then they know what was running. It is very useful.
-Query data from the archive and visualize data from the archive.

Record queries that have been run.

Monitoring tools: display counters and rate change for counters.
Aggregate values over many servers. Visualize and rank results and they display results over time. Use innotop and mytop. Google has release mypgrep. It goes across multiple servers.

Enhance monitoring: THey changed MySQL to count activity per account, table and index.
Results are exported via new SQL statements

SHOW USER STATISTICS:
displays for each account:
-seconds executing commands
-number of rows fetched and changed
-total and current connections
-number of select statements
-number of row change statements (I/U/D/R)
Number of other commands
Number of commits
Number of rollbacks
Number of binlog bytes written

SHOW TABLE_STATISTICS: displays for each table :
number of rows fetched and changed

SHOW INDEX STATISTICS:
displays the number of rows fetched per index
Helps find indexes that are never used. The more compact InnoDB table, the better.

MySQL High availability:
There are many great options and an even brighter future:
-MySQL cluster
-MySQL replication
-middleware
-DRBD

They need some features right now

They are commited to InnoDB and MySQL replication. They have some constraints that they want all features right now. InnoDB works from them and they are hesitant to move to anything else. Same for replication: it works for them.

They favor commodity hardware. MySQL Cluster tends to like solutions that guarantee latencies between nodes. They have to share network with lots of other applications and people running batch jobs can really saturate the network.

Desired HA functionality:
zero transaction loss on failure of master
minimal downtime on failures of a master
reasonable cost in performance and dollars
fast and automatic failover to local or remote server.

They would be happy with achieveing only downtime of a minute or so vs. trying to keep it down to seconds as measured by the MySQL cluster.

No changes to their programming model:
-does it support MVCC
- does it support long running transactions. UPDATES? Populate tmp table with queries and then use it to populate other tables etc. If something runs for more than 10 minutes they would like to stop it. Their application programmers love to do real-time analytics. More than a minute delay in replication will cause people to go to the master. They have queries that run for hours on replicas. Five tables at most in JOINs.

Failure happens everywhere
OS-kernel OOM or panic
mysqld failure caused by code they added (they do make mistakes)
Disk: misdirected write, corrupt write (they love InnoDB checksums) Disk failure rate: 5% per year
File system: inconsistent after unplanned hardware reboot
server: bad RAM
LAN, switch: lose connectivity
Rack: reboot. People accidentally reboot racks.
Data center: power loss, overheating, lightning, fire
People: things get killed or rebooted by mistake
-replication and reporting are concurrent on a slave. They have very little control over queries that are run.

Paul and Chip haven't seen servers. Most people haven't. A typo can cause an operation to go on a wrong server.

They try to work around BBRBC (battery backed) and RAID etc and try to work with the software work-arounds (cause they love commodity hw).

HA features they want:
-Synchronous replication as an option
- a product that watches a master and initiates a failover
-archives of the master's binlog stored elsewhere
-state stored in the filesystem to be consistent after a crash
--InnoDB and MySQL dictionaries can get out of syn
--replication can get out of sync

They couldn't wait: so they added features to 4.0.26 (they just had to make sure it would run on linux)
-transactional replication for slaves
-semi-synchronous replication
-mirrored binlogs
-fast and automatic failover

transactional replication
-replication state on a slave is stored in files
-slave SQL thread commits to storage engine and then updates a file
-a crash between the two can make replication state inconsistent
-transactional replication
-- stores replication state in teh InnoDB transaction log

Semi-synchronous replication
-HT has spoken and started building it. Not enought community demand
-blocks return from commit on master unit at least one slave has acknowledged recipt of all replication events for the transaction.
-Slave IO thread acknowledges receipt after buffering the changes
-modified MySQL replication protocol to support acknowledgements
Configurable options for
-- whether the master uses it
-- whether a slave uses it
-- How long the master waits for an ack.

Can run a server with some semi-sync replication slaves and some regular replication slaves. This can work with any storage engine that supports COMMIT, but they only support InnoDB. You can run master with some semi-synch slaves. As long as transaction is guaranteed to be on atleast one another slave when they return to user. Block commit, until replication runs on another server. They had a rack switch rebooted by mistake and they lost about 15K transactions. That's why this is very important. Even in case the master catches fire, nothing committed to the database is missing.

YouTube and MySQL

Paul Tuckfield of YouTube is giving a keynote. He has been sysadmin for 25 years, Oracle DBA for 15 years and MySQL DBA for 8 months.

For them, MySQL is one (important) piece of scalability. Top reasons for YouTube Scalability includes drinking :).

Top reasons for YouTUbe database scalability include Python, Memcache and MySQL replication. WHat you can do with Python, you can't do with other scripting languages. The fastest query on the database is that is never sent to the database.

They were doing click tracking on separate MyISAM whereas the read/write was being done on InnoDB. Everything was being written to one master and then being replicated to slaves. There were far fewer writes than reads. Every time replication slows down, you are screwed. Keeping replication going fast was hard.

They had a lot of replicas. One of the things that worked was separating pages for replicas. If you let entire pages built from a replica then there is cache affinity. The most important page is the watch page on YouTube. Most people coming to watch the video and if one thing they could make fast, it was that.

Standardizing on db boxes (they still crash daily)
4x2GHz Opteron core
16GRAM
12x10Krpm scsi
LSI hardware raid 10

Not fun to see corrupt in logs. InnoDB doesn't recover very fast. When it crashed they had so much in cache that it would take hours to recover. Master for them was a SPOF. How many backups can you restore in 8-10 hours? When you went to switch replicas there are a lot of questions to answer. Did you test the backup? No body does. Are the binlog coordinates correct? They had to promote replicas to master.

Database Failures:
Replication was key to trying different hardware permutations. They got good at re-parenting;/promoting replicas really fast. They they built up ways to clone databases as fast as possible (shutdown and sync) which would make site run again.

Lessons: Go replication. It is an excellent way to test tuning changes or fixes. You can create pools for QOS and cache affinity. You can get good at creating/recreating replicas fast. Get good at re-parenting replicas. Keep intentional lag / stemcell replicas. It stops SQL thread to keep it hours behind. Four kids is when people start saying "You got four kids?"

Trying out 5.0 "mystery cache hits"
They introduced a 5.0 slave for 4.1 master. One of the 5.0 slave was going really fast and other replicas were lacking. they were surprised. If you dump and load in 5.0, you will get more performance than just upgrading because of the compact low format in 5.0. So don't just upgrade but be sure to dump and load again for 5.0.

Top Heavy:
2x1.5GHz CPU
512M RAM
2x300G disk

Dream upgrade: (Put as much as memory as possible and put as many spindles as possible. Doesn't matter how fast. The number is really important.
4x4GHz cpu
16G Ram
10x10kRPM disk

If your db fits in ram, that's great otherwise here is what he says:
cache is king
writes should be cache by raid controller (buffered really) not the OS
only the db should cache reads (not raid and not even linux buffer cache).

Filesystem shouldn't cache reads, only db should cache it. Hit in db cache means lower caches went unused. Miss in db cache can only miss in other caches since they're smaller. Keep FS cache smaller than DB cache. RAID cache is even smaller. DB Cache is more useful. Read caching is pointless. Caching reads is worse than useless: they often serialize behind writes. Do not cache reads in the RAID controller.

Write engages 2 spindles (mirrors).

Don't cache in linux
--innodb_flush_method=O_DIRECT
vm.swappiness = 1-5

Cache in InnoDB

He is concerned about mystery serialization.

Writes are parallel to master but serialized to replicas. They need true horizontal scaling. Starting to run out of bandwidth on master. They want true independent master each with their own set of replicas. How is it going to work? How do you move users from masters? You may screwed up and corrupt it.

EMD = Extreme Makeover: Database (Even more databases :)) codename for horizontal scaling

Slave exacts must serialize to preserve commit order.

oracle (lower case) caching algorithm to predict the future. Manually caching tables by doing selects. Fetch blocks into cache for changes. Replication get great cache hits by using mysqlbinlog + python (it got 3-4 times faster).

"The Diamond": For golive, play shards binlogs back to original master for fallback.

Some questions I have been wanting to ask Paul:
How many total database servers?
How many Memcache servers?
Read-ahead script for replication?
How many concurrent threads?

Wednesday, April 25, 2007

Fotolog: Scaling the world's largest photo blogging community

Yesterday my talk about "Scaling the world's largest photo blogging community" went very well and I couldn't be more happier. There were a lot of questions from the audience at the end which made me really happy as it was a clear sign that my presentation wasn't flying over their head :)

Thank you to all those who attended. I will be posting the slides to my talk later tonight (it sucks that blogger doesn't has file upload).