Thursday, June 10, 2010

Large-scale Storage/Computation at Google

Stu Hood pointed me to an interesting keynote today done by Jeffrey Dean, a Fellow at Google. It's part of the ACM Symposium on Cloud Computing. In it, he talks about the current large scale storage and computation infrastructure used at Google. It starts off kind of slow but picks up (for me) after he talks a bit about MapReduce.

The presentation with slides is available here (silverlight required)

Some interesting bits to me:
  • He talked about several patterns for distributed systems that they have found useful.
  • Google currently MapReduces through about an exabyte of data per month
  • Interesting example of how they use MapReduce - to return the relevant map tiles in Google Maps for a given query
  • He pointed out that they have Service Clusters of BigTable so that each group doesn't have to maintain their own - this relates to what Stu and I are doing at Rackspace - creating multi-tenant Hadoop and Cassandra clusters for similar reasons
  • They use ranged distribution of keys for BigTable, saying that consistent hashing is good in some ways, but they wanted to be able to have locality of key sequences.
  • He talked about something I've been looking at recently - how to do custom multi-datacenter replication by table (or for Cassandra by keyspace).

