Saturday, May 31, 2008

Michael Arrington Asks Twitter a Few Tough Questions

Michael Arrington of TechCrunch asks Twitter a few questions. I have only included a sample list below but you should read his blog post for all the questions:
  • Is it true that you only have a single master MySQL server running replication to two slaves, and the architecture doesn’t auto-switch to a hot backup when the master goes down?
  • Do you really have a grand total of three physical database machines that are POWERING ALL OF TWITTER?
  • Is it true that the only way you can keep Twitter alive is to have somebody sit there and watch it constantly, and then manually switch databases over and re-build when one of the slaves fail?

A 'yes' answer to any of these questions by Twitter would be disturbing to say the least. However, it won't be surprising as companies expect databases to just somehow magically work without creating and supporting a proper architecture. High availability doesn't comes cheap and reputation for companies is everything.

I find it amusing that Twitter isn't even looking for a DBA. May be that's considered a job for the SA over there :)

5 comments:

Mark Callaghan said...

How do you do auto-failover from one master to one of N slaves without possibly losing the other slaves or skipping transactions or losing transactions? It can be very difficult to determine the correct offset for the slave on the new master.

Frank said...

I agree with your point. How about automatic failover from one master to another in multi-master setup? Sure, it's very difficult, but is it impossible?

Mark Callaghan said...

Will that scale beyond the 2 nodes? Or, if you attach slaves to each of the master nodes, can you tolerate loss of half of the slaves when a master fails? If not, then the problem hasn't gone away.

Anonymous said...

I agree with Mark. From what I understand, Twitter's problems seem to stem from write-related issues. I think a partitioned (sharded) solution is their only hope Obi-wan.

At least in a sharded solution, if they didn't have failover within the shard, only that shard's users would be affected.

-jay

Frank said...

Jay and Mike, I agree with you both.

I guess the point I am trying to make is to minimize the impact.

Depending on an un-partitioned master will bring down 100% of site down.