Wednesday, June 16, 2010

Installing Sun Java on Lucid Lynx

First I had to install python-software-properties to be able to run add-apt-repository:

sudo apt-get install python-software-properties

Then I add the repo that has Sun's jdk:

sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo aptitude update
sudo aptitude install sun-java6-jdk

Then, if there are multiple jdk alternatives on the system, choose which one you want with:

sudo update-alternatives --config java

Thursday, June 10, 2010

Large-scale Storage/Computation at Google

Stu Hood pointed me to an interesting keynote today done by Jeffrey Dean, a Fellow at Google. It's part of the ACM Symposium on Cloud Computing. In it, he talks about the current large scale storage and computation infrastructure used at Google. It starts off kind of slow but picks up (for me) after he talks a bit about MapReduce.

The presentation with slides is available here (silverlight required)

Some interesting bits to me:
  • He talked about several patterns for distributed systems that they have found useful.
  • Google currently MapReduces through about an exabyte of data per month
  • Interesting example of how they use MapReduce - to return the relevant map tiles in Google Maps for a given query
  • He pointed out that they have Service Clusters of BigTable so that each group doesn't have to maintain their own - this relates to what Stu and I are doing at Rackspace - creating multi-tenant Hadoop and Cassandra clusters for similar reasons
  • They use ranged distribution of keys for BigTable, saying that consistent hashing is good in some ways, but they wanted to be able to have locality of key sequences.
  • He talked about something I've been looking at recently - how to do custom multi-datacenter replication by table (or for Cassandra by keyspace).

Wednesday, June 2, 2010

Presentation on Cassandra+Hadoop

Last night I gave a presentation at the Austin Hadoop User Group about Cassandra + Hadoop. It was a great group of people in this relatively new user group here, probably around 20-30 people were there.

My slides are available in keynote form on slideshare - linked here:

Steve Watt from IBM's BigSheets and Emerging Technologies team pointed out that Cassandra has an edge over the native Hadoop technologies in that you can query output in Cassandra immediately. Using just Hadoop, especially HDFS for the output, you have to export the MapReduce output to another system if you want to do any kind of reporting. That can be a significant extra step.

Stu Hood also pointed out that even though it is possible to run MapReduce over data in Cassandra, HDFS and HBase are built to stream large chunks of data. So Cassandra will be slower from that perspective at this point. Work can be done to optimize that though. I think no matter what, you're choosing your data store for a variety of reasons - if your data fits better in Cassandra, now you have an option of running MapReduce directly over it. I think that's a significant advance.

Lots of thanks to Stu Hood as well as Jeff Hammerbacher (on #hadoop IRC) for help on some of the details.

With this done with, it's back to doing what I can to help with Cassandra dev and looking forward to the Hadoop Summit at the end of the month.