Wednesday, April 25, 2007

Federation at Flickr: Doing Billions of queries a day

I am sitting in Dathan's full room session.

In 2005 they came up with Federation.

Key components of federation.

Shards are taking single master data containing a fraction of user content. A global ring is the ability to get to a shard. Think of it as a DNS. They calculate where your data is at that moment in time. Then they have PHP logic to connect to the shards (10 lines of codes).

Shards are slice of main database and are set up in Active Master-Master Ring replication which is done by sticking a user to a server in a shard. Shard assignments are from a random number for new accounts. Migration is done from time to time and shards can run on any hardware grade they just have to adjust number of users per shard. They have issues with auto-increment and 4.1. They have external processes that balance data across other shards. They move people all the time.

Global ring is aka lookup ring used for stuff that can't be federated. used to find where stuff is.

It makes sense for them to hash by user name but then what about groups? Many users contributing to the groups.

Owner_id -> Shard-ID
Photo_id -> Owner_id
Group_id -> Shard_id

Then they cache in memcache which lasts for 1/2 hour.

Two ticket servers (MySQL 5). Each has auto-increment offset feature of MySQL 5.

MySQL 4.1 doesn't let you have auto-increment offset.

Click a favorite which pulls photo owner id from cache to get the shard location. Which may be on shard 5. Then pull your information from cache to get shard which may be shard 13. Then start a "distributed transaction" to answer the questions such as "Who favorited a person's photo" and "what are your favorites." That way they can recover data in case of a catastrophe.

See slide for "distributed transaction."

What if replication breaks between two servers in a shard. On every page load the user is assigned to a bucket.

$id=intval(substr($user_id, -10));
$id % $count_of_hosts_in_shard.


If a host goes doewn then go to the next host in the list. If all hosts are down then display error page. On every page they calculate if they are able to connect.

Each server in a shard is 50% loaded i.e. 1 server in a shard can take the full load if a server of that shard is down or in maintenance mode.

If a photo gets really popular then the traffic spikes. Then they have to break the 50% rule to keep serving the traffic.

Shut down 1/2 the servers in each shard and bring them up do the other 1/2.

Average queries per page are from 27 to 35 SQL statements. They support any type of query types.

Add more shards grow to 2 to 6K queries, they add a new shard. Each shard holds 400K+ users data and he handles it all.

Search: They have two search back-ends and use Yahoo's search. Owner single tag search goes to the shards due to real-time requirements. If you do your own tagging then it goes to shards. All other searches go to a Yahoo Search backend. They store text fields and descriptions which return photo ids.


Hardware: EMT64 Running RHEL-4 with 18GB of RAM and 6 DISK 15K RPM RAID-10. Any class server can be used and users per shard just has to be adjusted. The cost per query is so little that "it should be considered 0". Data size is 12TB of user data and they are able to server the traffic which he calls "priceless."

What would help Flickr: Multi-Master replication and thread bug in InnoDB for 4.1 release already. Optimization for OR queries, Bitwise and Sets etc.

Questions:
They use 5.0 in shards or shards that have auxillary data, for generating ticket id (no replication, single row per server) that contains last auto-increment id that was generated.

Backup procedure and Aggregation for data mining: they use ibbackup on cronjob that runs across various shards at different times.

How many max connections you have? Schema rollout? 400 connections per shard, 800 connections across both servers in shard.

innodb_thread_concurrency: 32 concurrent threads and 5% of 12TB in buffer pool.

Where do you store photos? on a filer in multiple data centers.

How much data per shard? At most 1/2 TB for large shards. They use file_per_table.

No comments: