Thursday, April 26, 2007

Google and MySQL

Chip Turner and Mark Callaghan are presenting the session "MySQL: The Real Grid Database"

Data is sharded vertically and they have a lots of replicas. Resharding is a bigger pain than sharding. Make really smart software and manage with least human resources as possible. They are going to talk about problems that matter to them.

The Grid database approach: deploy a large number of small servers.
Use highly redundant commodity components.
Added capacity has a low incremental cost.
Not much capacity lost when a server fails.
Which allows them to support many servers with a few DBAs.

I asked Chip earlier and he told me that they don't use any memcache at all. Their spindles are flying more crazily than their network bandwidth.

What is manageability?
-Make it easy to do the tasks that must be done
Reduce the number of tasks that must be done
Make all tasks scriptable

Why it matters?
You want to support hundreds of database servers with one DBA. This gives them time to solve more interesting problems.

They used to compile db on db server by hand, then they learned.

Underutilize your servers!
3 servers at 50% are better than 2 servers at 75%.
require less maintenance
requires less tuning
tolerate load spikes better and tolerate bad query plans better.

In perfect world:
Short running queries and transactions on master databases
Users kill mistake and runaway queries.
Accounts never use too many connections.
Query plans are good
New apps increase the db workload by a small amount
only things you need to store are stored.

In real world:
Long running transactions
servers with round robin DNS aliases make queries hard to find.
Apps create more connections when the db is slow
Some storage engines use sampling to get query plan statistics
New applications create new db performance problems
Apps use the db as a log and rows are never deleted
Many long running queries on replicas.

improve your ability to respond because prevention is impossible
Need tools to make monitoring easy
- determine what is happening across servers and what happened across servers.

Monitor everything you can (vmstat, iostat error logs).
- need these to reconstruct failures

Save as much as possible. disk spindles are expensive whereas disk sectors are cheap

Script as much as possible: You can automate more than you thought possible and you are likely to reuse these scripts.

You don't need to store vmstat in the same database. Store it on a separate database or store in text files.

Monitoring matters:
You need to know what is happening, what table, which user is doing?
Many fast queries can be as much of a problem as one slow query.

Record what happened: Archive SHOW STATUS counters somewhere. Query data from the archive and visualize data from the archive. Record queries that have been run (a sample is sufficient)
- Archive SHOW PROCESSLIST output somewhere. They take it every 30 seconds, compress it and save it. Then they know what was running. It is very useful.
-Query data from the archive and visualize data from the archive.

Record queries that have been run.

Monitoring tools: display counters and rate change for counters.
Aggregate values over many servers. Visualize and rank results and they display results over time. Use innotop and mytop. Google has release mypgrep. It goes across multiple servers.

Enhance monitoring: THey changed MySQL to count activity per account, table and index.
Results are exported via new SQL statements

displays for each account:
-seconds executing commands
-number of rows fetched and changed
-total and current connections
-number of select statements
-number of row change statements (I/U/D/R)
Number of other commands
Number of commits
Number of rollbacks
Number of binlog bytes written

SHOW TABLE_STATISTICS: displays for each table :
number of rows fetched and changed

displays the number of rows fetched per index
Helps find indexes that are never used. The more compact InnoDB table, the better.

MySQL High availability:
There are many great options and an even brighter future:
-MySQL cluster
-MySQL replication

They need some features right now

They are commited to InnoDB and MySQL replication. They have some constraints that they want all features right now. InnoDB works from them and they are hesitant to move to anything else. Same for replication: it works for them.

They favor commodity hardware. MySQL Cluster tends to like solutions that guarantee latencies between nodes. They have to share network with lots of other applications and people running batch jobs can really saturate the network.

Desired HA functionality:
zero transaction loss on failure of master
minimal downtime on failures of a master
reasonable cost in performance and dollars
fast and automatic failover to local or remote server.

They would be happy with achieveing only downtime of a minute or so vs. trying to keep it down to seconds as measured by the MySQL cluster.

No changes to their programming model:
-does it support MVCC
- does it support long running transactions. UPDATES? Populate tmp table with queries and then use it to populate other tables etc. If something runs for more than 10 minutes they would like to stop it. Their application programmers love to do real-time analytics. More than a minute delay in replication will cause people to go to the master. They have queries that run for hours on replicas. Five tables at most in JOINs.

Failure happens everywhere
OS-kernel OOM or panic
mysqld failure caused by code they added (they do make mistakes)
Disk: misdirected write, corrupt write (they love InnoDB checksums) Disk failure rate: 5% per year
File system: inconsistent after unplanned hardware reboot
server: bad RAM
LAN, switch: lose connectivity
Rack: reboot. People accidentally reboot racks.
Data center: power loss, overheating, lightning, fire
People: things get killed or rebooted by mistake
-replication and reporting are concurrent on a slave. They have very little control over queries that are run.

Paul and Chip haven't seen servers. Most people haven't. A typo can cause an operation to go on a wrong server.

They try to work around BBRBC (battery backed) and RAID etc and try to work with the software work-arounds (cause they love commodity hw).

HA features they want:
-Synchronous replication as an option
- a product that watches a master and initiates a failover
-archives of the master's binlog stored elsewhere
-state stored in the filesystem to be consistent after a crash
--InnoDB and MySQL dictionaries can get out of syn
--replication can get out of sync

They couldn't wait: so they added features to 4.0.26 (they just had to make sure it would run on linux)
-transactional replication for slaves
-semi-synchronous replication
-mirrored binlogs
-fast and automatic failover

transactional replication
-replication state on a slave is stored in files
-slave SQL thread commits to storage engine and then updates a file
-a crash between the two can make replication state inconsistent
-transactional replication
-- stores replication state in teh InnoDB transaction log

Semi-synchronous replication
-HT has spoken and started building it. Not enought community demand
-blocks return from commit on master unit at least one slave has acknowledged recipt of all replication events for the transaction.
-Slave IO thread acknowledges receipt after buffering the changes
-modified MySQL replication protocol to support acknowledgements
Configurable options for
-- whether the master uses it
-- whether a slave uses it
-- How long the master waits for an ack.

Can run a server with some semi-sync replication slaves and some regular replication slaves. This can work with any storage engine that supports COMMIT, but they only support InnoDB. You can run master with some semi-synch slaves. As long as transaction is guaranteed to be on atleast one another slave when they return to user. Block commit, until replication runs on another server. They had a rack switch rebooted by mistake and they lost about 15K transactions. That's why this is very important. Even in case the master catches fire, nothing committed to the database is missing.

1 comment:

roobaron said...

Thanks for the post. Much appreciated.