jeromatron - dev lux

Wednesday, January 26, 2011

Simple python script for generating Cassandra initial tokens

When using a RandomPartitioner, it is recommended that you specify the initial tokens. On the Cassandra Operations wiki page, it says:

Using a strong hash function means RandomPartitioner keys will, on average, be evenly spread across the Token space, but you can still have imbalances if your Tokens do not divide up the range evenly, so you should specify InitialToken to your first nodes as i * (2**127 / N) for i = 0 .. N-1. In Cassandra 0.7, you should specify initial_token in cassandra.yaml.

Here is a simple python script for generating them:

#! /usr/bin/python
import sys
if (len(sys.argv) > 1):
    num=int(sys.argv[1])
else:
    num=int(raw_input("How many nodes are in your cluster? "))
for i in range(0, num):
    print 'node %d: %d' % (i, (i*(2**127)/num))

So it will take either a command-line arg for the number of nodes or will ask if none is given. For three nodes, it will give the following output:

node 0: 0
node 1: 56713727820156410577229101238628035242
node 2: 113427455640312821154458202477256070485

This post was adapted from this, just updated the script and corrected the formula.

Wednesday, June 16, 2010

Installing Sun Java on Lucid Lynx

First I had to install python-software-properties to be able to run add-apt-repository:

sudo apt-get install python-software-properties

Then I add the repo that has Sun's jdk:


sudo add-apt-repository "deb http://archive.canonical.com/ lucid  partner"
sudo aptitude update
sudo aptitude install sun-java6-jdk

Then, if there are multiple jdk alternatives on the system, choose which one you want with:

sudo update-alternatives --config java

Thursday, June 10, 2010

Large-scale Storage/Computation at Google

Stu Hood pointed me to an interesting keynote today done by Jeffrey Dean, a Fellow at Google. It's part of the ACM Symposium on Cloud Computing. In it, he talks about the current large scale storage and computation infrastructure used at Google. It starts off kind of slow but picks up (for me) after he talks a bit about MapReduce.

The presentation with slides is available here (silverlight required)

Some interesting bits to me:

He talked about several patterns for distributed systems that they have found useful.
Google currently MapReduces through about an exabyte of data per month
Interesting example of how they use MapReduce - to return the relevant map tiles in Google Maps for a given query
He pointed out that they have Service Clusters of BigTable so that each group doesn't have to maintain their own - this relates to what Stu and I are doing at Rackspace - creating multi-tenant Hadoop and Cassandra clusters for similar reasons
They use ranged distribution of keys for BigTable, saying that consistent hashing is good in some ways, but they wanted to be able to have locality of key sequences.
He talked about something I've been looking at recently - how to do custom multi-datacenter replication by table (or for Cassandra by keyspace).

Wednesday, June 2, 2010

Presentation on Cassandra+Hadoop

Last night I gave a presentation at the Austin Hadoop User Group about Cassandra + Hadoop. It was a great group of people in this relatively new user group here, probably around 20-30 people were there.

My slides are available in keynote form on slideshare - linked here:

Cassandra+Hadoop

View more presentations from Jeremy Hanna.

Steve Watt from IBM's BigSheets and Emerging Technologies team pointed out that Cassandra has an edge over the native Hadoop technologies in that you can query output in Cassandra immediately. Using just Hadoop, especially HDFS for the output, you have to export the MapReduce output to another system if you want to do any kind of reporting. That can be a significant extra step.

Stu Hood also pointed out that even though it is possible to run MapReduce over data in Cassandra, HDFS and HBase are built to stream large chunks of data. So Cassandra will be slower from that perspective at this point. Work can be done to optimize that though. I think no matter what, you're choosing your data store for a variety of reasons - if your data fits better in Cassandra, now you have an option of running MapReduce directly over it. I think that's a significant advance.

Lots of thanks to Stu Hood as well as Jeff Hammerbacher (on #hadoop IRC) for help on some of the details.

With this done with, it's back to doing what I can to help with Cassandra dev and looking forward to the Hadoop Summit at the end of the month.

Sunday, May 9, 2010

Facebook and Privacy

I recently decided to deactivate my Facebook account. I did that based on several reports of Facebook disregarding the privacy of users in order to further monetize their platform. To be sure, they have a fantastic platform and I've liked being able to connect with people I haven't seen in a long time. However, hosting user data comes with the responsibility of keeping the trust of the user. To me, for now they've broken that trust. Anyway, I just thought I would post what I sent to Facebook when I deactivated my account:

According to several reports lately from the EFF, Wired, and several online publications, Facebook has consistently changed privacy terms out from underneath users. Part of the reason that I felt safe on Facebook and not in other networks was that I trusted Facebook to some extent. Based on these new changes and the seeming disregard for its users, I would rather not support Facebook any longer. Thank you for the remarkable service, but right now I don't feel that Facebook is trustworthy. They seem like they will do anything in order to further monetize the network/platform, including compromise the trust of its users. It's an unfortunately short-sighted gamble and I hope you will reconsider.

See:
http://www.wired.com/epicenter/2010/05/facebook-rogue/
http://www.eff.org/deeplinks/2010/05/things-you-need-know-about-facebook
http://www.eff.org/deeplinks/2010/04/facebook-timeline
http://www.pcworld.com/article/195888/facebooks_antiprivacy_backlash_gains_ground.html

Friday, March 26, 2010

NOSQL

As I've started getting up to speed at my new job at Rackspace down here in Texas, I've come into a new world called NoSQL. NoSQL is a term that Eric Evans re-coined relatively recently and he's since clarified that to mean Not only SQL. It's a term that kind of describes a set of distributed databases that have some similar properties.

Some of the suspects include Google's BigTable, Hadoop's HBase, Amazon's Dynamo, Apache's Cassandra, CouchDB, MongoDB, Voldemort, and others.

It seems to be based on the notion that if you have really, really, really large data sets, you run into some boundaries with the limits that a relational database imposes with ACID properties, transactions, and the unattainable triforce of Consistency, Availability, and Partition-tolerance (from the CAP Theorem). Jonathan Ellis blogged about deciding whether you should consider a NoSQL solution here.

So I've started drinking from a firehose of sources to try to understand more about them. We've been looking heavily into pieces of the Hadoop project for its distributed filesystem and Map/Reduce implementation (not exactly NoSQL but siblings to HBase), as well as the Cassandra project because of how it brings together useful features of BigTable and Dynamo and allows for completely horizontal scaling - no single point of failure.

More about the subject:
http://www.royans.net/arch - a blog about scalable web architectures, often talking about big data and NoSQL
http://nosql.mypopescu.com - a blog called myNoSQL that deals with all things NoSQL

Wednesday, January 6, 2010

list comprehensions in python

One of my favorite features of python is a functional language feature that python itself borrowed - list comprehensions.

I just think it is wonderfully elegant if a language can do something like this:

Example 1:


lines = ['now is the time\r\n']
lines.append('   for all good men ')
lines.append(' to come to the aid of their country\n')

# Does not modify list, returns a new list
lines = [line.strip() for line in lines]

print lines

>>>['now is the time', 'for all good men', 'to come to the aid of their country']

Example 2:


lines = ['<act> is the next act']
lines.append('performing at our show')
lines.append('please give <act> a big round of applause')

print [line.replace('<act>', 'Go Dog Go') for line in lines]
print [line.replace('<act>', 'The Beatles') for line in lines]

>>>['Go Dog Go is the next act', 'performing at our show', 'please give Go Dog Go a big round of applause']

>>>['The Beatles is the next act', 'performing at our show', 'please give The Beatles a big round of applause']

You can do simple operations like strip and replace or even an in place lambda on every element of a list and return that list... all in one line.

I love Python and its functional cousin languages.