Last night I gave a presentation at the
Austin Hadoop User Group about Cassandra + Hadoop. It was a great group of people in this relatively new user group here, probably around 20-30 people were there.
My slides are available in keynote form on slideshare - linked here:
Steve Watt from IBM's BigSheets and Emerging Technologies team pointed out that Cassandra has an edge over the native Hadoop technologies in that you can query output in Cassandra immediately. Using just Hadoop, especially HDFS for the output, you have to export the MapReduce output to another system if you want to do any kind of reporting. That can be a significant extra step.
Stu Hood also pointed out that even though it is possible to run MapReduce over data in Cassandra, HDFS and HBase are built to stream large chunks of data. So Cassandra will be slower from that perspective at this point. Work can be done to optimize that though. I think no matter what, you're choosing your data store for a variety of reasons - if your data fits better in Cassandra, now you have an option of running MapReduce directly over it. I think that's a significant advance.
Lots of thanks to Stu Hood as well as Jeff Hammerbacher (on
#hadoop IRC) for help on some of the details.
With this done with, it's back to doing what I can to help with Cassandra dev and looking forward to the
Hadoop Summit at the end of the month.