Saturday, March 24, 2012

Cassandra vs MongoDB

Big data has become a common discussion item within enterprises or developer community and everyone trying to solve this puzzle. It has become almost a challenge for developer and architect community to choose right product for their applications. There are few outstanding solutions are available in industry, starting from Cassandra, Hadoop, MySQL, MongoDB and Riak to name few. All of them have some great features and being used by enterprise lie Twitter, Facebook and Netflix. I’m not trying to promote one over other, but thought I would share what I have learned so far. while I have played with some of these solutions and read about others. Here’s I’m trying to list some features and benefits from Cassandra, MongoDB and Riak. Each of these have some advantages and some disadvantages. Here you go..


Cassandra is an open source distributed database management system written in Java. It’s designed to be a highly scalable second-generation distributed database. In Cassandra documents are known as “columns” which are really just a single key and value, , also got Big Table like features columns & columns families. You can query by column, range of keys. There’s also a timestamp field which is for internal replication and consistency. The value can be a single value but can also contain another “column”. These columns then exist within column families which order data based on a specific value in the columns, referenced by a key.
In Cassandra, nodes represent ranges of data.  By default, when a new machine is added, it will receive half of the largest range of data.  You can change this behavior by choosing different configuration options during node start-up.  There are certain configuration requirements to ensure safe and easy balancing, and there is a rebalance command that can perform the work throughout all the data ranges. It comes with monitoring tool that allows you to track the progress of the re-balancing. Cassandra is much lighter on the memory requirements, especially if you don’t need to keep a lot of data in cache. Cassandra requires a lot more meta data for indexes and requires secondary indexes if you want to do range queries.
One more advantage of using Cassandra, it has much more advanced support for replication. The server can be set to use a specific consistency level to ensure that queries are replicated locally, or to remote data locations. This means you can let Cassandra handle redundancy across nodes, where it is aware of which rack and data center those nodes are on. Cassandra can also monitor nodes and route queries away from “slow” responding nodes. You can choose between synchronous or asynchronous replication for each update. It has got highly available asynchronous operations.
I have already discussed about advantage of Cassandra, it would be wise to point out disadvantage. Cassandra replication settings are done on a node level with configuration files whereas MongoDB allows very granular ad-hoc control down the query level through driver options which can be called in code at run time.
Best used: When you have more writes compare to read, sometime like logging events. Financial institute are great example as they care about each activity and logs more data.


Unlike Cassandra which is written in java, MongoDB is written in C++ and provided in binary form for Linux, OS X, Windows and several other platforms. It’s extremely easy to “install” – download, extract and run.
Mongo provide Master Slave replication, auto failover with replica sets. It has sharding build in. it usages memory mapped file for data storage.

In MongoDB replication is achieved through replica sets. This is an enhanced master/slave model where you have a set of nodes where one is the master. Data is replicated to all nodes so that if the master fails, another member will take over. There are configuration options to determine which nodes have priority and you can set options like sync delay to have nodes lag behind.
This might be very important information for you, in case you plan to use data as part of audit or other words you can’t afford to lose data. You might want to think again before choose Mongo as writes in MongoDB are “unsafe” by default; data isn’t written right away by default so it’s possible that a write operation could return success but be lost if the server fails before the data is flushed to disk. This is how Mongo attains high performance. If you need increased durability then you can specify a safe write which will guarantee the data is written to disk before returning. Further, you can require that the data also be successfully written to n replication slaves.
MongoDB drivers also support the ability to read from slaves. This can be done on a connection, database, collection or even query level and the drivers handle sending the right queries to the right slaves, but there is no guarantee of consistency (unless you are using the option to write to all slaves before returning). In contrast Cassandra queries go to every node and the most up to date column is returned (based on the timestamp value).
MongoDB is mix and match solution from both world, NOSQL and RDBMS. MongoDB works very similar to relational databases. You create single or compound indexes on the collection level and every document inserted into that collection has those fields indexed. Querying by index is extremely fast so long as you have all your indexes in memory.
Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. 

I'm going to leave this here and will try to cover Riak with my next post.