Things we don't like about Couchdb

We love Couchdb, it is a workhorse, and we use it on a lot of projects. Obviously it does not fit everywhere. There are some places where it just does not cut it. I am trying to document a few issues we've been having with couchdb.

Building Views

Views are built at run request time. We think that should be an option - there should be a way to amortise the cost of building views either when adding documents or at usage time. We get a load of documents everyday into the system, and for a while we watched the system come to a grinding halt while the first request indexed the views, some other times the view request returns timeout.

The problem here is that when a view is being built it just blocks any operation on that design document, which means other requests are blocked as well (unless they go to a different design doc).

We work around this problem by pre-warming all the instances, after the documents are loaded into couchdb.

Query API

View query API is quite powerful but solving all problems with map-reduce is not trivial. Recently we had to implement a functionality to find if a date range overlaps with something else. Here is the gist that does something like that.

As you can see we had to emit all the days for the network and query the view with a startkey and endkey. This is way too complicated than a simple SQL query to acheive the same result. I am lucky that my granularity is days, if you want to find the range over a hourly basis, then it is much harder. The views will get much slower and (likely) take larger space.

The view collation is quite useful and powerful, but is it very restrictive. Array or String keys are used more often than Hash Keys(pretty much useless) for view collation to retrieve a result set. Even Array keys have limitations. for example if your view emits

['abcnews','christiane'], ['cnn', 'christiane'], ['cnn', 'fionnuala']

then startkey: ['cnn'],  endkey : ['cnn', {}] would match ['cnn', 'christiane'], ['cnn', 'fionnuala'] 

But you will have to write a new view if you want to search by the newscaster. i.e. you can't do

startkey: [{}, 'christiane'],  endkey : [{}, 'fionnuala'] or startkey: [{}, 'christiane'],  endkey : [{}, 'christiane']

Date Support

Date Support is pretty primitive. I cringe everytime I have to tell my team that dates in couchdb have to be in specific format. It may more be an issue with javascript date handling. But CouchDB spidermonkey supports only one date format. I do not think couch should go towards the MongoDB Date data-type, but having Date.parse support multiple formats, or even iso8601 would be really useful.

As you can see in the gist above, we manually parse the date and build the date object. This is error prone, there are atleast 2 javascript subtleties there.(Hint: parseInt(date, 10) and javascript month being 0 indexed)

Others

There are some other issues that do not climb high on my radar.

  1. Paging. It is slightly complicated in Couch and has caused a lot of confusion, but once you get the idea, it's pretty simple to implement. We struggled with it initially, but the pattern fell into place quickly
  2. Performance. The current release of Couch(1.1.1) seems to be much better on performance than the older versions. But then we would never choose Couchdb for it's performance alone. There are other alternatives out there.

Filed under  //   couchdb   databases   nosql  

CouchDb: The Honda Accord of databases

We have been using with CouchDB at Activesphere, for some of our customer projects.
We looked at a few options before we eventually plunged into Couchdb 'Relax'. I'm trying to list out in hindsight why CouchDb has worked out a good choice for us.

Powerful Replication

CouchDb has amazing replication support. Period. It supports all kinds of scenarios, you can think of and best of all it is trivial to set up continuos replication between two couchdb's.
 
On my current project we needed elastic scaling. We let Amazon ELB distribute the load for us and depending on the load,  add or remove machines from our "cluster". So all nodes having exactly the same software stack and data is extremely important. We run couchdb on every machine of the cluster, and each machine replicates with the master, so all the data spreads over the cluster in short time. This means no big beefy dedicated central database.
An accidental (and big) advantage of this architecure is that the database and the application server are always on the same machine. It takes away the network latency accessing the central database
 
Fast View Queries
 
We have a large amount of data that is mostly readonly, and that is primarily used to search information. The couch views provide us with an incredibly simple and performant solution. Even with a few million documents in couchDb, we get search results in a few hundredths of a second.
Among other things that is fundamentally different in couchdb and source of perpetual confusion is the lack of SQL finders, So there is no ability to run dynamic queries on your database that we are so used to.
What we get instead is this concept called Map Reduce. CouchDb map reduce is only conceptually similar to the Google/Hadoop Map reduce, but it is an easy switch for people coming from the big data world, where Map reduce is de rigueur.
By giving up the ability to query arbitrary SQL, CouchDb can always run queries that always uses the indices, which is the reason for it's query performance. Views are stored on the disk in a B-Tree based disk index. A View is built once as it is created, and after that it is always updated incrementally. Which explains the Consistent Query times even with few million documents, and slightly slower write times. In our system read-write ratios are about 90% - 10%. This works brilliantly for us.

Http Access to the views
 

CouchDb providers a simple http api to access the views. We use this extensively for our internal applications. We just document the ways of using these views and let users hit the database from ruby console or via their Apps. For internal Apps, this has become the standard, and avoids the layers of code that build up to support exposing the SQL reporting structure. I still dread at the thought of writing the multiple pages of SQL for reporting and turning them into objects that API could use. All our Billing information and metrics are exposed via Views. We use the myriad different ways of using the startkey, endkey, grouplevel to allow the different usages of the Information. That is a topic i plan to address in a later blog post.
And no it does not increase the load our database server, because we run analytics off a replicated analytics database.
There are of course some cons of using Couchdb some of it are easily addressed. I plan to write more in the next few posts.

Filed under  //   couchdb   databases   mapreduce