Thursday, April 26, 2007

InnoDB Performance Optimization

I am sitting in the session "InnoDB Performance Optimization" by Heikki Tuuri and Peter Zaitsev. This is going to be the last session of a very enjoyable and informative MySQL Conference and Expo 2007.

General Application Design is Paramount
Of course, you should design your schema, indexes and queries right. Storage engines aspects are often fine tuning. Storage engine selection may affect your schema layout and indexes.

Each storage engine has unique design and operating properties. App written for 1 storage engine may not perform best with other storage engines. Special optimizations exist for each storage engine.

There are always transactions with InnoDB, even if you don't use them explicitly. There is a cost with transactions. If running in autocommit mode, each transaction commit overhead for each statement.

You should wrap multiple updates in the same transaction for efficient operation (SET AUTOCOMMIT=0; ... COMMIT; ... COMMIT'_

HT points that InnoDB cannot purge deleted rows if transactions are open. Transaction commit doesn't affect query caching. If you run SELECT inside transaction Query cache may not be used.

You should make sure that you're catching deadlocks and wait timeouts. Transactions should not be too large. Deadlocks are normal and you should expect them. In app, try to reduce them but they will occur. If you run SELECTs you don't get deadlocks normally. Your application should deal with deadlocks appropriately.

Do not use LOCK tables as LOCK TABLES is designed to work with table level locking storage engines. With tow level lock storage engines, transactions are better choice. The behavior of LOCK TABLES is different in MySQL versions and depends on --innodb_table_locks. It can give problems for portable applications if you port from MySQL4 to later. Behavior may not be what you expect from MyISAM.

PRIMARY KEY CLUSTERING: It is very important. Accessing data by PK is faster than other keys and is true for both in-memory and disk based accesses. You should try to do most lookups by primary key.

In InnoDB data is clustered by PK and sequential PK values will likely have data on the same page. PK range and prefix lookups are also very efficient. It can be used to cluster data that is accessed together.

If storing user messages, you can use (user_id, message_id) primary key to keep all users messages in a few pages.

PK creates a "covering index" for any set of fields in the PK because PK holds all the data

What is the cost of clustered primary key:
PK in random order are costly and lead to table fragmentation. PK inserts should normally be in an ASC order. If you can, load data in PK order. At times, you may find that changing primary key to auto_increment is a good idea.

If you don't specify one, then InnoDB will create a clustered key internally, so you are better off defining one and using it.

Primary Key updates are very expensive and requires row data physically to be moved from one place in the index to another. The row has to be deleted and then inserted after which all secondary keys are updated. Never update primary key, change it to AUTO-INCREMENT. HT: If you need to update primary key then there is something wrong with your schema. Generally not a good schema/application design either.

What is a huge PK? It all depends on the access patterns of your application. Benchmark different keys.

What about 32-bit hash? That's like shooting yourself. Its really long and will be in non-sequential order.

If really large primary key then use AUTO-INC and demote your original primary key to unique key.

Keep your primary key short as secondary indexes use primary key to refer to the clustering index which makes PK value part of any/every index.

Long PK makes your indexes slow and long so you are better off reducing them.

Go easy on UNIQUE indexes as they do not use the "insert buffer" which can speed up index updates significantly. Indexes in InnoDB are not prefix compressed so they can take much more space then for MyISAM indexes (upto 10 times larger for the same structure). You should really avoid excessive or unneeded indexes. Remember MySQL will at most use one index for a query, not two.

Keep your updates fully indexed or you can see unexpected locking problems. A statement such as DELETE FROM users WHERE name="peter" may have to lock all rows in the table if the column name is not indexed. HT: MySQL 5.1 read-committed then only those rows are locked that are actually deleted.

AUTO-INCREMENT may limit scalability: AUTO-INC INSERTS may use table level locks (but only to the end of the INSERT statement, not transaction) even if you specify the AUTO-INC column value.

AUTO-INCREMENT limits scalability for concurrent inserts. HT: they have a patch and they are deciding on whether to implement it in 5.1.

MVCC: Complements row level locking to get even better concurrency. Standard SELECT statements set no locks, just reads appropriate row version.

LOCK IN SHARE MODE, FOR UPDATE modifiers can be done to do locking reads.

Even long running selects do not block updates to the table or other selects. Overly long queries (transactions) are bad for performance as a lot of unpurged versions accumulate. InnoDB can only purge rows when no active transactions are running. READ COMMITTED can ease the problem in such cases. InnoDB can only remove a row version when no transactions are open which can read it. HT: READ COMMITTED releases the snapshots at SELECT statements and doesn't keep it for the duration of transaction. READ COMMITTED is better for long transactions as it will let InnoDB purge transactions in background.

How MVCC plays with indexes? When you have index, reference is stored in index. When you update an index more and more versions are in the index that can slow index operations as well.

For UPDATE and LOCK IN SHARE MODE: Locking selects are executed in read committed mode because you cannot lock a row which doesn't exist. Because of this the results of these queries can be different than for standard SELECTs.

SELECT ... FOR UPDATE always has to access row data page to set the lock, so it can't run index covered queries which can further slow down queries significantly.

Reducing deadlocks: Deadlocks are normal for a transactional database. Non-locking SELECT statements do not deadlock with InnoDB. You should make sure to handle deadlocks and lock wait timeouts in your application. You should also make sure that your transactions lock data in the same order when possible.

Your update chunks should be smaller. You can achieve that by chopping transactions.

If you are going to update most of the selected rows then you are better off using SELECT ... FOR UPDATE. External locking can be used to avoid problem such as application level locks, SELECT GET_LOCK('mylock') etc.

Affect of isolation modes over performance:
Isolation modes in InnoDB can be set globally, per connection, or per transaction.

READ-UNCOMMITTED: If you are fine with dirty reads. Performance improvement is still limited.

Foreign Keys Performance: InnoDB checks foreign keys as soon as a row is updated, no batching is performed or checks delayed till transaction commit. FKs have their own cost and often have serious performance overhead. They are ideal for maintaining data consistency.

FKs increase amount of row level locking done and can make it spread to a lot of tables besides the one being updated.

Restrict number of open transactions: InnoDB performs best with a limited number of opern transactions and running queries. Multiple running queries may cause a lot of thrashing bumping into each other. Work is being done to improve performance in such cases. InnoDB_thread_concurrency can be used to restrict number of threads in InnoDB kernel. Many open transactions make lock graph building more cplicated and increase some other overhead. Whenever possible, you should keep a limited number of queries running at the same time by queuing on application side.

Guidelines for setting concurrency level: Manual says set it higher or lower. It really depends on application and number of hard drives.

HK: if you are worried about thread thrashing then set thread concurrency to 1.

INSERT...SELECT statement runs locking select. It is required for logical level replication to work properly. The problem goes away with MySQL 5.1 row level replication and the READ COMMITTED isolation level. Behavior is the same whenever you have log-bin enabled or not, to keep things consistent. innodb_locks_unsafe_for_binlog helps in 5.0 but your replication may get broken so do it if you know what you are doing. It also disables next-key locking.

SELECT...INTO OUTFILE + LOAD DATA INFILE can be often use as non-blocking safe alternative.

COUNT(*) facts and myths:
MyTH: "InnoDB doesn't handle count(*) queries well": Most count(*) queries are executed same way by all storage engines if you have a WHERE clause, otherwise you InnoDB will have to perform a full table scan.

FACT: InnoDB doesn't optimize count(*) queries without the where clause

You can use triggers and counter table to work it around or use SHOW table status like 'users' to get an approximation.

InnoDB and GROUP COMMIT: GC lets you commit several outstanding transactions with single log write. You can improve performance dramatically especially if no RAID with BBU. In 5.0, group commit does not work with binary logging due to a way XA (distributed transactions) support was implemented. Watch out for it if upgrading from MySQL 4.0.

Heikki will now talk further about server tuning settings:: Everything starts with innodb_buffer_pool_size which specifies main InnoDB buffer: data and index pages, insert buffer and locks are stored here. Most activity of InnoDB runs inside buffer pool. Why innoDB has its own cache? It's more efficient than OS cache. THe buffer pool is very important for performance on large data sets and much more efficient than OS cache especially for writes. InnoDB has to bypass OS buffering for writes. You can set buffer pool to 70-80% of memory for dedicated InnoDB-Only MySQL. Default value is just 8M, independent of available memory, just make sure to configure it.

InnoDB_additional_mem_pool: just stores dictionary information and is automatically increased, do not set too high. HK just uses 10MB for it. In future they may remove it altogether.

InnoDB_log_file_size: dramatically affects write performance. You should keep it high. However, high values increase recovery time. Check how large logs you can afford. Combined log file size should be 50% of the buffer pool memory.there is a 4GB total size limit.

InnoDB_log_files_in_group: this is the number of files of specified size that are used for log. There is usually no need to change default value.

InnoDB_log_buffer_size: You shouldn't set over 2-8MB unless you use huge BLOBs. Log file is flushed at least once per second anyway. Check innodb_os_log_written growth to see how actively your logs are written. innoDB logs are physio-logical, not page based so they are very compact.

InnoDB_flush_logs_at_trx_commit: By default logs are flushed to the disk at each transaction commit. This is required for ACID guarantees and is expensive. You can set to 2 or 0 if you can afford losing transactions for last one second or so (ie if you are using it as MyISAM tables replacement). Value 2 is nice if mysqld crashes then all your transactions are preserved, it is only bad in case of power crash or OS crash.

InnoDB log resizing: is not as simple as changing option and restarting. You need to shut down MySQL server and make sure it shuts down normally (check error log for errors to confirm). Then move away InnoDB log files ib_log*. Then start the MySQL server and check error log files to see if it successfully created new log files.

InnoDB_flush_method: specifies a way InnoDB will work with OS file system. On Windows unbuffered IO mode is always used. On Unix you can use fsync() or O_SYNC/O_DSYNC for flushing files. fsync() is usually faster. Always use default one or O_DIRECT (Peter).

InnoDB_file_per_table: allows InnoDB to store each table in its own file and makes disk management easier. Main tablespace is still needed for system needs. It can help to spread tables to multiple disks. It also allows you to reclaim space if a table is dropped. Sometimes slower for writes as fsync() is called sequentially. Can increase startup/shutdown time with large number of tables. I asked Heikki but unfortunately there are no plans to remove dependency on main tablespace as there are rollback segments that need to be there :((

Will there be InnoDB tablespace per database? No plans. Flexible tablespaces are similar to Oracle.

Other file IO settings:
innodb_autoextend_increment: HT doesn't think this is important. It specifies growth increment for shared tablespace (not for per table tablespaces). Larger values allow to reduce fragmentation.

InnoDB_file_io_threads: changes number of IO threads on Windows only. Note that all 4 threads are doing different jobs. In future it may change?

InnoDB_open_files: number of files used for per table tablespaces. You should increase it you have a lot of tables. No stats are available so far to show number of re-opens InnoDB needs to do.

innodb_support_xa: if set to 0 it reduces work InnoDB should do on transaction commit, however binlog can get out of sync.

So how do you minimize restart times: InnoDB buffer pool may have a lot of unflushed data and shutdown may take very long time. If you need to minimize downtime then

SET GLOBAL innodb_max_dirty_pages_pct=0

Watch for InnoDB_buffer_pool_pages_dirty in SHOW STATUS. As it gets close to 0, shut down the server.

InnoDB doesn't remove rows on delete and old row versions on update because these may be needed by other transactions. Purge thread is used to clean up these unused rows. In some workloads, the purge thread may not be able to keep up and the tablespace will grow witout bounds. CHECK TRANSACTIONS section in SHOW INNODB STATUS.

innodb_max_purge_lag limits number of transactions which have updated/deleted rows. This will delay insert/updates so purge thread can keep up. What is a good value of innodb_max_purge_lag? HT will set it to 100,000. It is measured in history list which contains transactions. Million transactions for this is too much.

innodb_commit_concurrency: umber of threads that are allowed at commit stage at the same time.

Questions: Are there any plans to make InnoDB's output to error log in a more defined/structured format?
Why does InnoDB show status sometimes only shows X transactions running and then truncates.

Google and MySQL

Chip Turner and Mark Callaghan are presenting the session "MySQL: The Real Grid Database"

Data is sharded vertically and they have a lots of replicas. Resharding is a bigger pain than sharding. Make really smart software and manage with least human resources as possible. They are going to talk about problems that matter to them.

The Grid database approach: deploy a large number of small servers.
Use highly redundant commodity components.
Added capacity has a low incremental cost.
Not much capacity lost when a server fails.
Which allows them to support many servers with a few DBAs.

I asked Chip earlier and he told me that they don't use any memcache at all. Their spindles are flying more crazily than their network bandwidth.

What is manageability?
-Make it easy to do the tasks that must be done
Reduce the number of tasks that must be done
Make all tasks scriptable

Why it matters?
You want to support hundreds of database servers with one DBA. This gives them time to solve more interesting problems.

They used to compile db on db server by hand, then they learned.

Underutilize your servers!
3 servers at 50% are better than 2 servers at 75%.
require less maintenance
requires less tuning
tolerate load spikes better and tolerate bad query plans better.

In perfect world:
Short running queries and transactions on master databases
Users kill mistake and runaway queries.
Accounts never use too many connections.
Query plans are good
New apps increase the db workload by a small amount
only things you need to store are stored.

In real world:
Long running transactions
servers with round robin DNS aliases make queries hard to find.
Apps create more connections when the db is slow
Some storage engines use sampling to get query plan statistics
New applications create new db performance problems
Apps use the db as a log and rows are never deleted
Many long running queries on replicas.

improve your ability to respond because prevention is impossible
Need tools to make monitoring easy
- determine what is happening across servers and what happened across servers.

Monitor everything you can (vmstat, iostat error logs).
- need these to reconstruct failures

Save as much as possible. disk spindles are expensive whereas disk sectors are cheap

Script as much as possible: You can automate more than you thought possible and you are likely to reuse these scripts.

You don't need to store vmstat in the same database. Store it on a separate database or store in text files.

Monitoring matters:
You need to know what is happening, what table, which user is doing?
Many fast queries can be as much of a problem as one slow query.

Record what happened: Archive SHOW STATUS counters somewhere. Query data from the archive and visualize data from the archive. Record queries that have been run (a sample is sufficient)
- Archive SHOW PROCESSLIST output somewhere. They take it every 30 seconds, compress it and save it. Then they know what was running. It is very useful.
-Query data from the archive and visualize data from the archive.

Record queries that have been run.

Monitoring tools: display counters and rate change for counters.
Aggregate values over many servers. Visualize and rank results and they display results over time. Use innotop and mytop. Google has release mypgrep. It goes across multiple servers.

Enhance monitoring: THey changed MySQL to count activity per account, table and index.
Results are exported via new SQL statements

displays for each account:
-seconds executing commands
-number of rows fetched and changed
-total and current connections
-number of select statements
-number of row change statements (I/U/D/R)
Number of other commands
Number of commits
Number of rollbacks
Number of binlog bytes written

SHOW TABLE_STATISTICS: displays for each table :
number of rows fetched and changed

displays the number of rows fetched per index
Helps find indexes that are never used. The more compact InnoDB table, the better.

MySQL High availability:
There are many great options and an even brighter future:
-MySQL cluster
-MySQL replication

They need some features right now

They are commited to InnoDB and MySQL replication. They have some constraints that they want all features right now. InnoDB works from them and they are hesitant to move to anything else. Same for replication: it works for them.

They favor commodity hardware. MySQL Cluster tends to like solutions that guarantee latencies between nodes. They have to share network with lots of other applications and people running batch jobs can really saturate the network.

Desired HA functionality:
zero transaction loss on failure of master
minimal downtime on failures of a master
reasonable cost in performance and dollars
fast and automatic failover to local or remote server.

They would be happy with achieveing only downtime of a minute or so vs. trying to keep it down to seconds as measured by the MySQL cluster.

No changes to their programming model:
-does it support MVCC
- does it support long running transactions. UPDATES? Populate tmp table with queries and then use it to populate other tables etc. If something runs for more than 10 minutes they would like to stop it. Their application programmers love to do real-time analytics. More than a minute delay in replication will cause people to go to the master. They have queries that run for hours on replicas. Five tables at most in JOINs.

Failure happens everywhere
OS-kernel OOM or panic
mysqld failure caused by code they added (they do make mistakes)
Disk: misdirected write, corrupt write (they love InnoDB checksums) Disk failure rate: 5% per year
File system: inconsistent after unplanned hardware reboot
server: bad RAM
LAN, switch: lose connectivity
Rack: reboot. People accidentally reboot racks.
Data center: power loss, overheating, lightning, fire
People: things get killed or rebooted by mistake
-replication and reporting are concurrent on a slave. They have very little control over queries that are run.

Paul and Chip haven't seen servers. Most people haven't. A typo can cause an operation to go on a wrong server.

They try to work around BBRBC (battery backed) and RAID etc and try to work with the software work-arounds (cause they love commodity hw).

HA features they want:
-Synchronous replication as an option
- a product that watches a master and initiates a failover
-archives of the master's binlog stored elsewhere
-state stored in the filesystem to be consistent after a crash
--InnoDB and MySQL dictionaries can get out of syn
--replication can get out of sync

They couldn't wait: so they added features to 4.0.26 (they just had to make sure it would run on linux)
-transactional replication for slaves
-semi-synchronous replication
-mirrored binlogs
-fast and automatic failover

transactional replication
-replication state on a slave is stored in files
-slave SQL thread commits to storage engine and then updates a file
-a crash between the two can make replication state inconsistent
-transactional replication
-- stores replication state in teh InnoDB transaction log

Semi-synchronous replication
-HT has spoken and started building it. Not enought community demand
-blocks return from commit on master unit at least one slave has acknowledged recipt of all replication events for the transaction.
-Slave IO thread acknowledges receipt after buffering the changes
-modified MySQL replication protocol to support acknowledgements
Configurable options for
-- whether the master uses it
-- whether a slave uses it
-- How long the master waits for an ack.

Can run a server with some semi-sync replication slaves and some regular replication slaves. This can work with any storage engine that supports COMMIT, but they only support InnoDB. You can run master with some semi-synch slaves. As long as transaction is guaranteed to be on atleast one another slave when they return to user. Block commit, until replication runs on another server. They had a rack switch rebooted by mistake and they lost about 15K transactions. That's why this is very important. Even in case the master catches fire, nothing committed to the database is missing.

YouTube and MySQL

Paul Tuckfield of YouTube is giving a keynote. He has been sysadmin for 25 years, Oracle DBA for 15 years and MySQL DBA for 8 months.

For them, MySQL is one (important) piece of scalability. Top reasons for YouTube Scalability includes drinking :).

Top reasons for YouTUbe database scalability include Python, Memcache and MySQL replication. WHat you can do with Python, you can't do with other scripting languages. The fastest query on the database is that is never sent to the database.

They were doing click tracking on separate MyISAM whereas the read/write was being done on InnoDB. Everything was being written to one master and then being replicated to slaves. There were far fewer writes than reads. Every time replication slows down, you are screwed. Keeping replication going fast was hard.

They had a lot of replicas. One of the things that worked was separating pages for replicas. If you let entire pages built from a replica then there is cache affinity. The most important page is the watch page on YouTube. Most people coming to watch the video and if one thing they could make fast, it was that.

Standardizing on db boxes (they still crash daily)
4x2GHz Opteron core
12x10Krpm scsi
LSI hardware raid 10

Not fun to see corrupt in logs. InnoDB doesn't recover very fast. When it crashed they had so much in cache that it would take hours to recover. Master for them was a SPOF. How many backups can you restore in 8-10 hours? When you went to switch replicas there are a lot of questions to answer. Did you test the backup? No body does. Are the binlog coordinates correct? They had to promote replicas to master.

Database Failures:
Replication was key to trying different hardware permutations. They got good at re-parenting;/promoting replicas really fast. They they built up ways to clone databases as fast as possible (shutdown and sync) which would make site run again.

Lessons: Go replication. It is an excellent way to test tuning changes or fixes. You can create pools for QOS and cache affinity. You can get good at creating/recreating replicas fast. Get good at re-parenting replicas. Keep intentional lag / stemcell replicas. It stops SQL thread to keep it hours behind. Four kids is when people start saying "You got four kids?"

Trying out 5.0 "mystery cache hits"
They introduced a 5.0 slave for 4.1 master. One of the 5.0 slave was going really fast and other replicas were lacking. they were surprised. If you dump and load in 5.0, you will get more performance than just upgrading because of the compact low format in 5.0. So don't just upgrade but be sure to dump and load again for 5.0.

Top Heavy:
2x1.5GHz CPU
512M RAM
2x300G disk

Dream upgrade: (Put as much as memory as possible and put as many spindles as possible. Doesn't matter how fast. The number is really important.
4x4GHz cpu
16G Ram
10x10kRPM disk

If your db fits in ram, that's great otherwise here is what he says:
cache is king
writes should be cache by raid controller (buffered really) not the OS
only the db should cache reads (not raid and not even linux buffer cache).

Filesystem shouldn't cache reads, only db should cache it. Hit in db cache means lower caches went unused. Miss in db cache can only miss in other caches since they're smaller. Keep FS cache smaller than DB cache. RAID cache is even smaller. DB Cache is more useful. Read caching is pointless. Caching reads is worse than useless: they often serialize behind writes. Do not cache reads in the RAID controller.

Write engages 2 spindles (mirrors).

Don't cache in linux
vm.swappiness = 1-5

Cache in InnoDB

He is concerned about mystery serialization.

Writes are parallel to master but serialized to replicas. They need true horizontal scaling. Starting to run out of bandwidth on master. They want true independent master each with their own set of replicas. How is it going to work? How do you move users from masters? You may screwed up and corrupt it.

EMD = Extreme Makeover: Database (Even more databases :)) codename for horizontal scaling

Slave exacts must serialize to preserve commit order.

oracle (lower case) caching algorithm to predict the future. Manually caching tables by doing selects. Fetch blocks into cache for changes. Replication get great cache hits by using mysqlbinlog + python (it got 3-4 times faster).

"The Diamond": For golive, play shards binlogs back to original master for fallback.

Some questions I have been wanting to ask Paul:
How many total database servers?
How many Memcache servers?
Read-ahead script for replication?
How many concurrent threads?

Wednesday, April 25, 2007

Fotolog: Scaling the world's largest photo blogging community

Yesterday my talk about "Scaling the world's largest photo blogging community" went very well and I couldn't be more happier. There were a lot of questions from the audience at the end which made me really happy as it was a clear sign that my presentation wasn't flying over their head :)

Thank you to all those who attended. I will be posting the slides to my talk later tonight (it sucks that blogger doesn't has file upload).

Using Hibernate for MySQL

Charles Lee is presenting the session "Using Hibernate to Ease the Migration to Open Source Databases."

Most of the talk so far is Hyperic related.

EJBQL is a poor query language and has bad transaction handling. It also has pessimistic locking and lack of database tools. Entity Beans are proxy objects. Managers interface with entity beans.

Why migrate to Hibernate?

HQL and Criteria based queries supported by Hibernate are fully featured. It does Lazy fetching and doesn't loads objects until you need them. There is a straight forward transaction demarcation. it has secondary cache integration and is a pretty popular framework.

Hibernate POJOs and Transactions: Hibernate POJOs are actually detached from the framework and don't have to inherit the framework API. They are actual objects that can be passed around. Managers look up the POJOs and POJOs travel through the transaction boundaries. Hibernate sessions use optimistic locking. Doesn't lock anything until you flush session and reconnect. Integration with secondary cache to cache both POJOs and query results.

The query languages:
EJBQL: lacks trivial functions like "ORDER BY"

JBossQL: Some additional functions like "ORDER BY", "LCASE" etc

DeclaredSQL: Direct SQL: like queries declared as EJB finder methods, but not database-specific.

Direct SQL: When you hit a wall in application, they go to DirectSQL.

HQL (and Criteria API): It is close to SQL, object oriented and allows query results to be paged. You get all functions you expect like "Order by". You can take advantage of database specific SQL.

Some slides about HQL vs. Criteria API.

Hibernate supports MySQL out of the box. You can use hibernate's hbm2ddl to create database schema. It hooks into Hibernate's dialects to generate database specific DDL. You can also use own DBUtil to populate database initially.

Amazon S3 Storage Engine

Mark Atwood, a good friend of mine, is presenting a session on Amazon S3 Storage engine for MySQL.

Primary key has to be utf-8. Results coming back in less than 70ms. "If you build it, they will come".

A bucket is fully virtually hosted, it gives you a virtual SQL CMS for S3. You can put billion photographs.

Transactional engine? No it's not. S3 service is not transactional. There are no temporal guarantees in S3. If you start doing writes on a table and someone else starts then it may take some time for it show up. The reason is S3 is very distributed. Amazon S3 replicates it all over.

If you put no WHERE clause then you can "make money fast" for Amazon.

There are many opportunities. It keeps MD5 hash of contents of Blob. There is also ACL. S3 handles HTTPS but it isn't implemented yet in the storage engine?

There are other Amazon services that can use storage engines like replication distribution via S3 and cluster persistent storage via S3.

Where to get it? It uses MySQL 5.1 storage engine interface and is available at

He is disappointed about little feedback from community on his efforts.

Federation at Flickr: Doing Billions of queries a day

I am sitting in Dathan's full room session.

In 2005 they came up with Federation.

Key components of federation.

Shards are taking single master data containing a fraction of user content. A global ring is the ability to get to a shard. Think of it as a DNS. They calculate where your data is at that moment in time. Then they have PHP logic to connect to the shards (10 lines of codes).

Shards are slice of main database and are set up in Active Master-Master Ring replication which is done by sticking a user to a server in a shard. Shard assignments are from a random number for new accounts. Migration is done from time to time and shards can run on any hardware grade they just have to adjust number of users per shard. They have issues with auto-increment and 4.1. They have external processes that balance data across other shards. They move people all the time.

Global ring is aka lookup ring used for stuff that can't be federated. used to find where stuff is.

It makes sense for them to hash by user name but then what about groups? Many users contributing to the groups.

Owner_id -> Shard-ID
Photo_id -> Owner_id
Group_id -> Shard_id

Then they cache in memcache which lasts for 1/2 hour.

Two ticket servers (MySQL 5). Each has auto-increment offset feature of MySQL 5.

MySQL 4.1 doesn't let you have auto-increment offset.

Click a favorite which pulls photo owner id from cache to get the shard location. Which may be on shard 5. Then pull your information from cache to get shard which may be shard 13. Then start a "distributed transaction" to answer the questions such as "Who favorited a person's photo" and "what are your favorites." That way they can recover data in case of a catastrophe.

See slide for "distributed transaction."

What if replication breaks between two servers in a shard. On every page load the user is assigned to a bucket.

$id=intval(substr($user_id, -10));
$id % $count_of_hosts_in_shard.

If a host goes doewn then go to the next host in the list. If all hosts are down then display error page. On every page they calculate if they are able to connect.

Each server in a shard is 50% loaded i.e. 1 server in a shard can take the full load if a server of that shard is down or in maintenance mode.

If a photo gets really popular then the traffic spikes. Then they have to break the 50% rule to keep serving the traffic.

Shut down 1/2 the servers in each shard and bring them up do the other 1/2.

Average queries per page are from 27 to 35 SQL statements. They support any type of query types.

Add more shards grow to 2 to 6K queries, they add a new shard. Each shard holds 400K+ users data and he handles it all.

Search: They have two search back-ends and use Yahoo's search. Owner single tag search goes to the shards due to real-time requirements. If you do your own tagging then it goes to shards. All other searches go to a Yahoo Search backend. They store text fields and descriptions which return photo ids.

Hardware: EMT64 Running RHEL-4 with 18GB of RAM and 6 DISK 15K RPM RAID-10. Any class server can be used and users per shard just has to be adjusted. The cost per query is so little that "it should be considered 0". Data size is 12TB of user data and they are able to server the traffic which he calls "priceless."

What would help Flickr: Multi-Master replication and thread bug in InnoDB for 4.1 release already. Optimization for OR queries, Bitwise and Sets etc.

They use 5.0 in shards or shards that have auxillary data, for generating ticket id (no replication, single row per server) that contains last auto-increment id that was generated.

Backup procedure and Aggregation for data mining: they use ibbackup on cronjob that runs across various shards at different times.

How many max connections you have? Schema rollout? 400 connections per shard, 800 connections across both servers in shard.

innodb_thread_concurrency: 32 concurrent threads and 5% of 12TB in buffer pool.

Where do you store photos? on a filer in multiple data centers.

How much data per shard? At most 1/2 TB for large shards. They use file_per_table.

Tuesday, April 24, 2007

Getting Best out of MySQL on Solaris

I am sitting in the session "Getting Best out of MySQL on Solaris 10" by Yufei Zhu (yufei.zhu at

She is going to focus on InnoDB performance tips and how to monitor performance on Solaris using Dtrace examples.

When we try to use mysqldump to backup in zones, it ends prematurely.

MySQL is single process, multi-threaded application. Thousands of user connections means thousands of threads and this becomes a problem for some customers. MySQL and Sun Microsystems are close partners (I really doubt the "close" part).

She is now going over the characteristics of InnoDB like the storage of data in primary key order physically. You don't want your primary key too long (haha, come to my session for a different point of view).

When MySQL is first started, 10 threads are created.

thread 1 handles network connections and creates new threads for new user connections.

thread 2 to 5 are read ahead thread, log write thread, insert thread, write thread.

thread 6 is rollback thread, thread 7 to 8 are monitor threads

thread 9 is master thread
thread 10 is signal handler thread

theoretically raw device will give you best performance.

Use "show innodb status", "show processlist", "mysqladmin extended-status" for performance monitoring.

Cannot have IO bottlenecks. prstat is much more useful that top. She encourages use of forcedirectio for UFS.

If there is a lot of write transactions then make sure to configure enough disks with very small latency for log files since MySQL does serialize pwrite.

You need low latency for log writes.

Innodb_buffer_pool_size: how much data you want to cache in memory.
Innodb_log_buffer_size :
innodb_thread_concurrency: don't set it too high as there can be locking contentions.
innodb_flush_log_at_trx_commit = 0 (by default set to 1: makes big gain but at risk of losing 1 second data.
innodb_flush_method: default is as good as it gets.

MyISAM performance is fairly good but non-transactional. In 5.1 there is row level locking too for MyISAM (she thinks). Each MyISAM table is stored on disk in three files.

MyISAM doesn't have special data buffer cache, relying on filesystem buffer cache. The default cache setting is 12% of file system. Be sure to tune that up.

MySQL 5.1 has started using mmap instead of malloc which gives huge performance gaines.
Each pread is 200 bytes. She has seen 50% of performance jump.

General performance tips:
key_reads/ key_reads_requests gives you hit rate. Increase key_buffer_size if necessary as this is the index cache.

read_rnd_buffer_size improves "order by" performance.

Since MyISAM doesn't have its own cache

segmap_percent=50 in /etc/system will use 50% of memory as file system buffer cache (if using MyISAM make sure you tune this setting from default of 12%)

Use query_cache only if needed, there are locking issues with some customer applications.

Only release 5.1 has mmap change. It needs "myisam_use_mmap=1" in my.cnf to be effective.

Don't use forcedirectio for MyISAM as it will bypass filesystem cache.

MYSQL system tables are MyISAM so should we really turn on forcedirectio? Yes, then build different filesystems, one with forcedirectio and other without it for MyISAM. MyISAM has to rely on filesystem caching.

MySQL does a lot of malloc.

If using 64bit MySQL. setenv LD_PRELOAD_64 / usr/lib/sparcv9/ before MySQL is started. Gives better performance

On 32 bit servers, setenv LD_PRELOAD /usr/lib/ before starting MySQL. 4G is obviously the limit.

You gotta look at IO and disk utilization. Always separate log files from data files. Make them use different disks. One must eliminate IO bottleneck as first step. Don't put log and data together as the response time can go higher.

Analyze MySQL performance counters by tuning mysql startup option, optimizing sql statements and optimizing queries with "EXPLAIN".

Another thing is to analyze system performance statistics like checking cpu/memory utilizations. Are there lock contentions? Is CPU idle during high load?

Examples of performance analysis using Solaris performance tools.
You can get hooked to them :)

Why is the performance bad? Her examples are read-only workloads.
mpstat 5
smtx: system level mtex contention.

vmstat 5
How much memory is being used. Watch for swap and free memory.

ps -ef | grep mysql
gives you the PID. then do a blank prstat .
mysqld/74 means 74 threads. take 10 threads for when mySQL is started with InnoDB. Here we can see 64 connections are open. Also shown is resident memory size. Who consumes the most CPU on the system.

prstat -Lmc -p 16613
Give me all top threads (who is using most CPU)?

mysqld/65 means 65th thread is using most processor. Also look at user time and system time.

LCK is not accurate but tells you how much time it is waiting for user level lock. Gives a rough indication

SLP is sleep and gives how much time spent sleeping.

SCL: System calls (percentage)

pstack 16613/65: [PID/thread id]: WHat is the thread doing. Where is it stuck? If it is writing where is it writing to?

Do you have IO bottleneck?
iostat -xtcnz 4

Key thing is service and response times of disk. Don't only look at %b. Look for service time. Don't go over 25ms. The less the better. If you have 60ms then you have IO bottleneck.

Use iostat and dtrace combined to work out IO issues.

#!/usr/sbin/dtrace -s
{ self->t1 = timestamp;
printf ("hirestime %u, tid=%d, length=%d

Look at its output. tid in output is thread id. Offset is the block number on the disk. You capture everything with dtrace. You can see the timestamps. Elapsed time gives how much time went in pwrite. Length tells how many blocks. Apache also serializes log writes. When running ab, turn off log file to see a huge performance gain.

truss -c -p 16613
truss can solve about 90% of the problems in real life. It's so powerful. -c will count calls. Then you can see read, write and time calls etc.

lwp_park: is user level sleep queues. It means a lot of user level locks in this application.

truss -p 16613

see what a process is running. See how many time () calls are there and when write is being called. Shows socket numbers

pfiles 16613
what files are opened by a process. Reference here with socket numbers in truss. There are generic dtrace scripts available to count user level function calls. memcpy was the most called followed by mutex_unlock etc. This gives indication of what user function calls are being made.

dtrace -F -p 2185 -n 'pid$target:::entry,pid$target:::return{trace(timestamp);}'

MySQL dtrace provider: Still working with MySQL to get it integrated. They did some probes inhouse. They integrated with Chime (graphical tool) which allows you to look for MySQL IO wait time, InnoDB index scan, system calls, table scans etc. Start MySQL and fire off Chime and watch for goodness.

In demo booth they are showing how to use Chime. You can even have your own probes to work with Chime.

Consolidations with Solaris container
Virtualization with very very low performance overhead. Configuration and administer of Solaris containers is very easy. You can have multiple MySQLD instances (as many as 7) to reduce lock and IO contention. Performance wise the overhead of virtualization is around 1%.

On website, they have a graphical tool to build a container and they have scripts to DIY. Containers improve system utilizations and scalability.

SAMP stack can be downloaded from . It has Apache2, MySQL5, PHP5 and even memcached. The stack is optimized for Sun Solaris OS environment. Just install with pkg command. Reduces installation time.

At the MySQL Conference

We (me and Michelle) arrived late Sunday night in Santa Clara to attend the MySQL Conference and already it's so much fun that I haven't been blogging at all.

Yesterday night we had dinner with some friends including Mark Atwood, Marc Simony, Ronald Bradford, Sheeri Kritzer, Mr and Mrs. Cole, Eric Bergen, Pascal and Christine.

Sheeri also got an award for being the top community advocate.

Marten and Guy Kawasaki's keynotes were very interesting. Ronald has done a good job of jotting the interesting points from Guy Kawasaki's keynote.

More to follow...