Thursday, April 26, 2007

YouTube and MySQL

Paul Tuckfield of YouTube is giving a keynote. He has been sysadmin for 25 years, Oracle DBA for 15 years and MySQL DBA for 8 months.

For them, MySQL is one (important) piece of scalability. Top reasons for YouTube Scalability includes drinking :).

Top reasons for YouTUbe database scalability include Python, Memcache and MySQL replication. WHat you can do with Python, you can't do with other scripting languages. The fastest query on the database is that is never sent to the database.

They were doing click tracking on separate MyISAM whereas the read/write was being done on InnoDB. Everything was being written to one master and then being replicated to slaves. There were far fewer writes than reads. Every time replication slows down, you are screwed. Keeping replication going fast was hard.

They had a lot of replicas. One of the things that worked was separating pages for replicas. If you let entire pages built from a replica then there is cache affinity. The most important page is the watch page on YouTube. Most people coming to watch the video and if one thing they could make fast, it was that.

Standardizing on db boxes (they still crash daily)
4x2GHz Opteron core
12x10Krpm scsi
LSI hardware raid 10

Not fun to see corrupt in logs. InnoDB doesn't recover very fast. When it crashed they had so much in cache that it would take hours to recover. Master for them was a SPOF. How many backups can you restore in 8-10 hours? When you went to switch replicas there are a lot of questions to answer. Did you test the backup? No body does. Are the binlog coordinates correct? They had to promote replicas to master.

Database Failures:
Replication was key to trying different hardware permutations. They got good at re-parenting;/promoting replicas really fast. They they built up ways to clone databases as fast as possible (shutdown and sync) which would make site run again.

Lessons: Go replication. It is an excellent way to test tuning changes or fixes. You can create pools for QOS and cache affinity. You can get good at creating/recreating replicas fast. Get good at re-parenting replicas. Keep intentional lag / stemcell replicas. It stops SQL thread to keep it hours behind. Four kids is when people start saying "You got four kids?"

Trying out 5.0 "mystery cache hits"
They introduced a 5.0 slave for 4.1 master. One of the 5.0 slave was going really fast and other replicas were lacking. they were surprised. If you dump and load in 5.0, you will get more performance than just upgrading because of the compact low format in 5.0. So don't just upgrade but be sure to dump and load again for 5.0.

Top Heavy:
2x1.5GHz CPU
512M RAM
2x300G disk

Dream upgrade: (Put as much as memory as possible and put as many spindles as possible. Doesn't matter how fast. The number is really important.
4x4GHz cpu
16G Ram
10x10kRPM disk

If your db fits in ram, that's great otherwise here is what he says:
cache is king
writes should be cache by raid controller (buffered really) not the OS
only the db should cache reads (not raid and not even linux buffer cache).

Filesystem shouldn't cache reads, only db should cache it. Hit in db cache means lower caches went unused. Miss in db cache can only miss in other caches since they're smaller. Keep FS cache smaller than DB cache. RAID cache is even smaller. DB Cache is more useful. Read caching is pointless. Caching reads is worse than useless: they often serialize behind writes. Do not cache reads in the RAID controller.

Write engages 2 spindles (mirrors).

Don't cache in linux
vm.swappiness = 1-5

Cache in InnoDB

He is concerned about mystery serialization.

Writes are parallel to master but serialized to replicas. They need true horizontal scaling. Starting to run out of bandwidth on master. They want true independent master each with their own set of replicas. How is it going to work? How do you move users from masters? You may screwed up and corrupt it.

EMD = Extreme Makeover: Database (Even more databases :)) codename for horizontal scaling

Slave exacts must serialize to preserve commit order.

oracle (lower case) caching algorithm to predict the future. Manually caching tables by doing selects. Fetch blocks into cache for changes. Replication get great cache hits by using mysqlbinlog + python (it got 3-4 times faster).

"The Diamond": For golive, play shards binlogs back to original master for fallback.

Some questions I have been wanting to ask Paul:
How many total database servers?
How many Memcache servers?
Read-ahead script for replication?
How many concurrent threads?


Anonymous said...

"Starting to run out of bandwidth on master" - I am curious to know if any numbers were mentioned ? How many users ? Are videos stored at a single master database, how does that scale ?

motiono3 said...


Robin said...

thanks for your summary!

Anonymous said...

WHat you can do with Python, you can't do with other scripting languages.

So what is it ? map/reduce maybe ?

Anonymous said...

nice post