Wednesday, June 2, 2010

Presentation on Cassandra+Hadoop

Last night I gave a presentation at the Austin Hadoop User Group about Cassandra + Hadoop. It was a great group of people in this relatively new user group here, probably around 20-30 people were there.

My slides are available in keynote form on slideshare - linked here:

Steve Watt from IBM's BigSheets and Emerging Technologies team pointed out that Cassandra has an edge over the native Hadoop technologies in that you can query output in Cassandra immediately. Using just Hadoop, especially HDFS for the output, you have to export the MapReduce output to another system if you want to do any kind of reporting. That can be a significant extra step.

Stu Hood also pointed out that even though it is possible to run MapReduce over data in Cassandra, HDFS and HBase are built to stream large chunks of data. So Cassandra will be slower from that perspective at this point. Work can be done to optimize that though. I think no matter what, you're choosing your data store for a variety of reasons - if your data fits better in Cassandra, now you have an option of running MapReduce directly over it. I think that's a significant advance.

Lots of thanks to Stu Hood as well as Jeff Hammerbacher (on #hadoop IRC) for help on some of the details.

With this done with, it's back to doing what I can to help with Cassandra dev and looking forward to the Hadoop Summit at the end of the month.

2 comments:

jeremy said...

One other note - I found out recently that Hadoop needs HDFS or some distributed filesystem in order to propagate the distributed cache. So you may need to run HDFS on your Cassandra nodes in order for your MapReduce code's jar file to get copied to all of the nodes.

Previously I had thought that you just needed task trackers on each of your Cassandra nodes.

jeremy said...

"you may need to run HDFS on your Cassandra nodes"

that is, datanode daemons