Scaling and Large numbers

Let's take a pop Quiz.

An alter table on a MySQL database, takes 10 seconds to execute on a table of 30,000 rows.

 

How much time will it take on a  table that has 80 million rows?

Try and take an educated guess of what the real time would have been?

Scroll down for the  answer.

So here goes a real story.

We needed to add a column to this table in one of my projects, and I intuitively assumed, it probably won't take more than couple of hours.

For somebody who's been working with large amounts of data, it was a stupid call to take. Even a back of the envelop calculations assuming linear scaling reveals it will take about 8 hours.

One of the biggest thing you learn when dealing with such big numbers ( 100 million is not by any extent a large number these days) is that intuition is really bad at figuring these things. It is the similar to the situation described by George Gamow's (Hottentots) Tribe, that has only 4 numbers one, two three and many.

Also to remember that a lot of things especially at large numbers do not scale linearly. (Of course I can't say these of all systems)

This is one of the reasons why I've seen a lot of apps that work very well with say 100,000 users  fail spectacularly when numbers increase to even 1 million or 50 million.

This graph shows that Couchdb performance seems to get worse especially at the 1 million mark, and even increase steadily afterwards. #couchdb (1.1) write performance over 5 million docs on my la... on Twitpic

A project that ignored to do this basic benchmarking failed pretty badly to get beyond a few million documents in the database.

The only way  you can build this intuition is to do it and figure it out over and over again. There is no replacement for real metrics that use the size and the shape of your data.These sorts of data is not amenable for "Oh! I can do it again" agilistic way of solving problems. Because of the times involved. Most big data problems have to be thought through upfront, otherwise it causes a lot of time wastage and downtime.

In Summary, Discard your intuitive thinking when dealing with large numbers. Create a model with the data similar in shape and size to base your estimates on. Otherwise there is a lot of wasted time and unplanned downtime for your application.

Answer

It took 1 day and 16 hours for our alter table to complete. Almost 5 times the estimated time

 

 

 

 

 

 

Filed under  //   couchdb   intuition   mysql   scaling  

Node.js Patterns Presentation

I spoke at the JsFoo a month ago, It was a brief introduction to Node and things we dealt with when building Activenode( from the NodeKnockout).

It was meant for a beginer Node programmer who knows javascript but is still trying to wrap his head around the asyncronicity of Node. 

I hope to update the presentation with all the stuff we've been learning, soon enough.

Here it is on iWork

 

 

Node toolbox

Node-toolbox

While searching for a node client for CouchDb, we came across quite a few node.js packages. 6 to be precise, to figure out the best fit for our project, we had to do a fair bit of Googleing and look at Github to see which ones are active and better fit for the project.

A lot of projects are defunct, have very little documentation or just experimental. It would be good to go to one place to find the right module and a general feeling of how good the module is. Turns out there is one, the Node Modules Wiki page. It shows simple wiki page  that lists modules by categories.

Coming from the Ruby world which has a similar vibrant community and tons of gems, we felt there needed to be a version of the awesome Ruby Toolbox 

So over a weekend we build a simple version of it and called it Node toolbox.

We pulled all the information from the Node Modules page  and  built the category mapping of each NPM Module. Whereas NPM Registry has almost 4500 packages, the Node Modules Wiki just lists  about 1000 of them.

NPM Registry does contain tags filled in by the package developers, but again only a few of the packages had this information.

There were more than 500 tags, from these we created about 100 categories, but merging them into one category, adding new ones or completely eliminating things we thought did not make sense. Along the way we removed duplicate tags like test/tests/Testing and so on.

Packages_for_categories

After putting the modules into their categories, we looked up github for the repository information, stuff like Watchers and Forks. Which in our opinion is a good indicator of the Project itself. And are currently using that to build our module ranking .

Ranked_packages

There are a lot of things missing now, The categorization and the categories themselves is arbitrary. We’d like owners of modules to categorize them correctly.

The choice of ranking the modules on github watchers/forks is also arbitrary. We could if it were available somewhere, leverage npm module downloads. User Feedback, repository information such as Open Issues Last commit time, Availability of tests, build status and so on.

As usual, feedback is appreciated and can be tweeted towards @nodetoolbox.

Filed under  //   node-toolbox   node.js   nodetoolbox   toolbox   toolbox.no.de  

NodeKnockout and Activenode.no.de

Last weekend we participated in the community driven 48 hour programming contest called Node Knockout. The idea is to go from an idea to production in exactly 48 hours using the wonderful Node.js. Noders from all over the world compete for the prizes that are awarded based on Innovation, Design, Utility and Completeness

We'd signed up long time ago, but we had no idea what we would be building till the morning 5.30AM( 0:00 GMT). Initially I was thinking of building  slick visualization over the Bitcoin transactions, but because of the limited scope and technical complexity, iI was wary about it. Eventually we decided to build a node.js equivalent the Rails' awesome New Relic's RPM. It also looked like something that could be build progressively, and be done quickly.

 

Another choice I had to make in the morning was the choice of deployment, We had to choose better the 3 official alternatives

  1. Joyent's No.de platform : It's easy to deploy and we get ssh access to the machines, but it uses Solaris and it's weird package manager, that we had very little idea about.
  2. Linode: Very flexible, and we get the choice of OS. But much harder to get the deploys right. We had to script our own, there was good documentation though
  3. Heroku: The an in between option, good deployment support but lack of ssh access and apps was a turn off.

We chose Joyent's No.de which remained our final deployment platform. We briefly switched to Linode, because for the first day of our usage No.de had broken web socket support, so the real time updates to the map was not possible. This issue was fixed the next day and we switched back to no.de for our deployment. It sucked a bit of time away but it was not really that much.

It is currently hosted here.

We used Redis for our storage, and later it turned out that it was a good choice, simpler storage format was a win. We used the Redis's monitor feature to build the live map. We'd pick the new request's Remote host and translate it to the city of origin which was plotted on the map. This gave us an awesome integration with socket.io. I am still at a loss of how we would have implemented it in any other database( couchdb's changes would also work)

We also used the awesome CoffeeScript to write the code, both of us were not CoffeeScript experts, but it was never a cause of any weird issue.

We published a npm module activenode-monitor (npm modules are equivalent of ruby gems) for usage in the express applications. But since it needed to talk to a central Redis on the no.de machine, which was not accessible from outside the machine(no.de has weird ssh setup) we could not get it to work. It took us a lot of time to realize that we could just use the free Redis plan of redistogo. And it just worked wonderfully well once we integrated it.

We store a lot of interesting client/server information into redis(though we did not get the time to implement views for all of it) and we are filling up the 512K of memory available on the free plan in about a day. So I need to flush out all the data almost every day.

It's an amazing experience to compress so much learning into 48 hrs, sometimes we were dazed and confused and sometimes in a wonderful flow. Having a usable app at the end of it is definitely a bonus.  Right now the judges are rating the final 180 entries out of the 302 that started. Some of the feedback we've got is pretty encouraging. We seem to be scoring high on utility and design and low on innovation and completeness. So any chance of our winning or being in the high ranks is unlikely.

I can excuse myself for the completeness, because of a conincidence of devops conference occuring on the same weekend and i had signed up to talk about Devops on Cloud. Finishing the slides, preparing and delivering the session sucked a lot of time out of my Sunday.

All in all it was good fun and a wonderful learning experience. We are looking forward to NodeKnockout 2012.

If you have reached the end of this. Please show us some love by voting for the app here.

 

Filed under  //   activenode   coffeescript   javascript   monitoring   node.js   nodeknockout   redis  

From the begining programmer's bookshelf

We were looking for some book recommendations for some of the new people we hired recently. And here is an opinionated list of books and some of my thoughts around it.

These are some of my favorite books that I wish i had read, when I started programming. But I did start with enviable The C++ Programming Language, which though my personal favourite, I would not recommend to any newbie.

 

A few caveats

There are programmers and not Managers or Agile Coaches(whatever that means)

They are early in their software development careers.

They are going to be in a Application development/Consulting Shop (no Don Knuth for them)

 

 General 

 Andy Hunt and Dave Thomas'     From Journeyman to Master  I wish i had read this book earlier, This changed the way i thought about programming.

 Kernighan & Pike's  Practice of Programming   A very dated C/Unix based book, but still the book that makes us realize we are standing on the shoulders of giants

Erich Gamma et. al's Design Patterns This might be hard to read, but the practice of using some of the pattern rightly or wrongly in code will make you a much better programmer.

Ruby


Dave Thomas' Programming Ruby 1.9 This was the best book when i started picking up ruby, and it is still the best book now.
 

Russ Olsen's Eloquent Ruby Ruby programmers are very opinionated, and this book helps newbie programmer to fit into that world much faster than reading other's code and practice( thought there is no substitute for that)

 

Agile

Kent Beck's Test Driven Development by example The book that launched a million Test driven developers. It needs no introduction.

Martin Fowler's  Refactoring The book that should be on the side of every programmer. Learnt so may techniques from this book.

 

Let me know if you have books that might be better?

Things we don't like about Couchdb

We love Couchdb, it is a workhorse, and we use it on a lot of projects. Obviously it does not fit everywhere. There are some places where it just does not cut it. I am trying to document a few issues we've been having with couchdb.

Building Views

Views are built at run request time. We think that should be an option - there should be a way to amortise the cost of building views either when adding documents or at usage time. We get a load of documents everyday into the system, and for a while we watched the system come to a grinding halt while the first request indexed the views, some other times the view request returns timeout.

The problem here is that when a view is being built it just blocks any operation on that design document, which means other requests are blocked as well (unless they go to a different design doc).

We work around this problem by pre-warming all the instances, after the documents are loaded into couchdb.

Query API

View query API is quite powerful but solving all problems with map-reduce is not trivial. Recently we had to implement a functionality to find if a date range overlaps with something else. Here is the gist that does something like that.

As you can see we had to emit all the days for the network and query the view with a startkey and endkey. This is way too complicated than a simple SQL query to acheive the same result. I am lucky that my granularity is days, if you want to find the range over a hourly basis, then it is much harder. The views will get much slower and (likely) take larger space.

The view collation is quite useful and powerful, but is it very restrictive. Array or String keys are used more often than Hash Keys(pretty much useless) for view collation to retrieve a result set. Even Array keys have limitations. for example if your view emits

['abcnews','christiane'], ['cnn', 'christiane'], ['cnn', 'fionnuala']

then startkey: ['cnn'],  endkey : ['cnn', {}] would match ['cnn', 'christiane'], ['cnn', 'fionnuala'] 

But you will have to write a new view if you want to search by the newscaster. i.e. you can't do

startkey: [{}, 'christiane'],  endkey : [{}, 'fionnuala'] or startkey: [{}, 'christiane'],  endkey : [{}, 'christiane']

Date Support

Date Support is pretty primitive. I cringe everytime I have to tell my team that dates in couchdb have to be in specific format. It may more be an issue with javascript date handling. But CouchDB spidermonkey supports only one date format. I do not think couch should go towards the MongoDB Date data-type, but having Date.parse support multiple formats, or even iso8601 would be really useful.

As you can see in the gist above, we manually parse the date and build the date object. This is error prone, there are atleast 2 javascript subtleties there.(Hint: parseInt(date, 10) and javascript month being 0 indexed)

Others

There are some other issues that do not climb high on my radar.

  1. Paging. It is slightly complicated in Couch and has caused a lot of confusion, but once you get the idea, it's pretty simple to implement. We struggled with it initially, but the pattern fell into place quickly
  2. Performance. The current release of Couch(1.1.1) seems to be much better on performance than the older versions. But then we would never choose Couchdb for it's performance alone. There are other alternatives out there.

Filed under  //   couchdb   databases   nosql  

Goodbye Rails, Hello Node.js

A few months ago, I decided to play with node.js. The new server side JavaScript hotness and now I am hooked onto it. I use Ruby on Rails for most of my work. Though I am not a big fan of Rails, it is better than most things around and it's in Ruby which I really like.

Off-late most projects that I have been working on, have much less ruby and a lot more of JavaScript. My typical application consists of thin RESTful API transporting data as JSON, which is consumed by JavaScript in the browser. Can we not use JavaScript for the whole application instead? And get significant other benefits as well?

This is where Node.js fits in. Sitting on top of the highly performant Google V8 JavaScript engine, it gives you a bunch of libraries to write your networking code. In other words, it is JavaScript on the server side. FTW

As a side note, I used to be extremely scared of the JavaScript funkiness, till I apprenticed with some JavaScript masters. Over a period of time working on more and more JavaScript, I have gotten pretty proficient. I still don't have the same control with JavaScript as I have on Ruby and  I still forget to add the semicolons to the statements, but it's not that often now.

Asynchronous event driven API that leads to high scaling and lower memory footprint, we all know that (The now famous C10K problem). I have used EventMachine in Ruby and it always felt like working against the grain in ruby. Blocks typically are thrown for instant execution and rarely as a callback. The lack of Fibers in Ruby 1.8.7 does not help. Turning non-evented ruby code into evented ones is mind-bending exercise till you get it.

On Node.js it's the most natural way of programming, it doesn't even feel like a shift in style. JavaScript programming is all about callbacks invoked later in time. There is no such thing as EM.defer and EM.run, and checking if you are in a reactor loop.

There is some serious innovation happening around Node.js. Some are being built over experiences of other Languages. There is no such thing as Rails for node.js (which I consider a smashing thing). Most people use Express which is similar to Sinatra. Coffee script became popular because of and in the Node.js community. Now the Rails community is trying to make coffee script the default . RVM became popular only in recent times in the Ruby Community, We already have NVM doing the same thing for node versions. I can't forget to mention the fantastic Socket.io for realtime push applications and it is truly trivial to use.

The mushrooming of cloud hosting for node.js is also a proof that a lot of vendors think it's interesting. Just off the top of my head, there is Joyent's no.de, Cloudfoundry, Duostack and Nodester all in some sort of beta stage. Although from my brief experiments, I prefer the CloudFoundry to no.de and Duostack.

And Rails definitely does not have a rap theme song as cool as this. http://soundcloud.com/marak/marak-the-node-js-rap. (Listen with headphones, please)

 

 

Filed under  //   architecture   evented   javascript   node.js   rails   ruby  

Rise of Symmetric Systems

For a long time we have been building software in a very standard fashion, put a web server in the front, a big application server and a standard backing database. This is the standard layered architecture. Most application stacks Rails or .Net for example assume a single database server, a few application servers, some web servers and a load balancer.

But as 'webscale' and 'elastic scaling', 'commodity hardware' have become buzzwords, there has been a quiet change in the way systems are being designed. Some people are calling it 'Symmetric Systems'.

In a nutshell, 'Symmetric Systems' is fitting the whole application stack on a single machine. The database, application server and web server are running on the same machine, hence it is completely self contained. If the application stack is crammed into one machine, then there is only limited amount of 'juice' left on the machine to service thousands of requests we all aspire to service. The scaling in these cases are achieved by adding a large number of such instances.

Symmetry

photo credit: southerncomfort

One of the systems I am currently working on, is symmetric. It has a very simple model as you can see from the figure below, each request is passed by the ELB to Nginx on any machine and passed all the way to Rails that interacts with CouchDB. The data is synchronized by CouchDB instances via replication.

Onemachinearchitecture1

This has been  a thought shift for me, where in my previous life, I've been scaling systems by breaking it down into smaller components, that can be hosted individually. For example, in the ticketing system that I had worked on before, the booking service was hosted on one machine, the customer service was on another machine and the ticketing service on another. That way the load could be distributed across the application. Really? What about the network call overhead across the services? What about complexity of network of interactions? Remember Metcalfe's law. There are ways to deal with those things but that leads to inevitable increase of complexity.

Benefits

Elastic scaling:

This should come as no surprise. Most of the times, choosing an hardware is a matter of guess about how many people are going to use the system and how the number of people will increase. These guesses are expensive to get it wrong. Instead there has been a large scale move to the cloud where you get instant instances and as load increase it is trivial to increase the number of machines serving out the application. In Theory, such systems can scale linearly to handle load.

No Single point of failure: 

If a machine goes down, it just reduces our load bearing capacity, it does not cripple the system in case of  Non Symmetric design, where if the ticketing service goes down, the application is down. This has a direct impact on the simplicity of the system. There are no backup machines no slaves  to manage and deal with.

Locality of code and data:

Since the whole stack on the same machine, there are no expensive network calls to talk to the database and to other components. This means faster and highly predictable response times.

Deployment:

Because all the stack is on one machine deployment is trivial, Copy the code, and start the services. Our deploy.rb (other than standard sets for rvm and bundler)

after "deploy:restart" , "bluepill:restart"

That's it. We also have ZERO deployment downtime, To redeploy we have the following instructions

  • cap elb:unregister
  • cap deploy
  • rake production_tests
  • cap elb:register

Rinse and repeat till all the machines are upgraded.

This personally is one of the single biggest win as developer(#devops). On a similar scale .Net application I worked on we had downtime window of 4hrs in the wee hours of morning to deploy and verify stuff was working on each component. And after a few hours of deployment of each component, we'd realize the whole system was not working because of a step forgotten on one component. I really really dread those deployments, so did everybody else. Eventual result we deployed every 3 months and that just made it worse.

Consistent Environments:

Developers QA's and the production environments are exactly the same. the number of machines in each  environment. The only thing missing in the development environment is ELB. This means the app is being tested by everybody and in the same way. Unlike the weird bugs you see only in testing and production environments, where the components are wired up differently or have incorrect wiring up.

No Versioning issues:

One of big problems with the divide an conquer approach is versioning of components in the system. Such systems inevitably need to version individual components and that means incompatible versions won't work together properly and break in rather weird ways. Add one more layer of complexity to manage all of this.

Symmetric systems don't have this issue, all of it is released in one go on a single machine.

Drivers

Low footprint databases, application and web servers:

Oracle or Sql Server need dedicated hardware and needs to run on separate machine, I don't know if that is strictly true, but definitely the pitch i have heard from the vendors. These days we have some serious options for running small databases on commodity hardware. No, i'm not talking about Sqlite or Prevayler. CouchDB, Redis, MongoDB all fit the bill for such databases and they are really fast but still battle hardened databases.

Amazon EC2 and friends:

The Amazon Dynamo paper is one of the key proponents of Symmetric systems. Amazon's architecture is based on dynamo, and it is no surprise that a lot of features it offers expect systems to be symmetric. For example, Cloudwatch is easiest to use if a system is symmetric.

Simplified Clustering/Replication of data:

Every NoSql database worth it's salt provides either clustering or replication or both of data. This means data does not have to live in a central database, it can be distributed over a cluster.

The hard stuff

All things can't be Distributed -  We have a Resque process that runs expiry of certain type records in couchdb every day. We have not found a simple way of distributing this on each machine, without duplication and conflicts. Such things are on a random instance picked up from a central redis queue.

Deployments are slightly hard - We use Capistrano, the default model is to deploy remotely to a set of machines labelled web, app and database, what we really wanted was to deploy the whole thing to one machine, that we just created by Amazon Auto Scaling. We have a hacky solution to this, and it took me a few days of head scratching to get right. I'd interested in seeing how other people are doing it.

Monitoring has to deal with lot more machines - The simple things are generally covered by your cloud provider, but application specific stuff is harder to deal with. For example debugging failures are much harder, because you never know which machine the request went to. We consolidate logs from each machine in one place, and review them periodically.

Can we fit the whole app on the same machine? - There are apps that just won't fit into a single machine, because of the earlier architectural decisions. We have been very aggressive with our choice of software to run on our machines. Apache is out because it takes a lot of memory, Nginx is our choice of Web server because of it's efficiency. Any other way we'd be putting out Bloatware.

Filed under  //   architecture   cloud   distributed   nosql  

Restful design, Open formats and an Outlook Add-In

We mostly work with startups, and most prefer using dynamic languages like Ruby and Python. Occasionally though, we have worked on .NET/Java projects because they were really interesting. One such project was to build an Outlook add-in to synchronize mails, contacts and calendar to a CRM system. There was already an existing add-in we were replacing, because it had lots of bugs, and its performance was bad. We analyzed the problems with the existing design and moved towards a more RESTful design. Using a RESTful design helped us in several ways:

1. Crisp protocol between server and client

The existing plugin and the server communicated over a SOAP based protocol. The protocol was RPCish, and this resulted in the protocol being very chatty. We were pretty sure that this would have caused performance issues for even moderate data. This also meant that there was a tight coupling between the server and the client - more service methods under contract makes it difficult to change either of the service or the client code independently. 

To avoid these problems, we exposed the server code as a RESTful web service. We identified resources (Appointments, Contacts and Emails) and standard operations available on these - using HTTP verbs (GET/POST/DELETE/PUT). The behaviour of these methods are well-understood. The fact that the service was over HTTP, makes it easy to implement the client - most libraries come with a standard HTTP client class. .NET exposes a very good HTTP Client as well.

We did break the REST in a couple of ways - we used POST as a “patch” method equivalent and had to introduce a “keep” method to explicitly synchronize the appointments due to 2 reaons:  a) Outlook add-in APIs don’t provide hooks to deletions on the client in a clean way and b) avoid client-side performance issues.

2. Server client responsibilities

One of the nice side-effects of REST is that, when you identify the resources correctly, the responsibility of the server and client falls in place - server lists the contacts/appointments when requested for the list of these and updates the resources when it receives a POST. The server is not responsible for making sure that the sync happens correctly, it is the responsibility of the client. Keeping track of the synchronization - last synchronization time, time differences between the server and client etc, are again the responsibilities of the client. 

The older add-in code stored the IDs of the outlook appointments/contacts/emails on the server. First of all, this information does not belong to the server, and then this also ties the server to a particular client. We avoided this to allow for synchronizations across multiple clients at the same time. This enables the users to install the same client on more than one machine, and still keeping all different systems in sync.

3. Open formats and standards

This was one of the first decisions we made, as this would help in better interoperability. Individual contacts were represented using vCard, appointments using iCalendar formats. This made it possible to subscribe to the appointments from any device that supports the iCalendar over HTTP - iPhone, Android and other smart phones, Mac OS X and Linux. We looked at using SyncML and ActiveSync protocols for the synchronizations. ActiveSync was out of question, as it was a proprietory. We finally ended up using a custom XML based protocol for emails and contacts, as there was no easily implementable standards for them.

Challenges

Testing in general was a challenge on this project. Outlook add-in APIs are basically wrappers over COM based APIs. These COM objects are inherently untestable. Mocking wasn’t of much help, as we were starting to see unmaintainable tests with very little value. We tried to get some automated functional tests using a library called White. But Outlook UI turned out to be very difficult to automate, and the effort in writing these tests was too much to be beneficial. 

Another challenge was the limitations of the Outlook APIs (there is no easy way to get notified of deletions), the fact that we had different synchronization strategies for Contacts, Appointments and Emails didn't help the cause either.

 

Filed under  //   outlook   projects   rest  

Introducing Couchup. An interactive Couchdb Console

Recently a friend asked me how to fetch multiple documents from couch using a single request. Couch 0.90 and above provided the ability to post to couch the array of keys to return the matching documents. But he was having trouble getting curl to work correctly. I had to dig through the code to give him the actual curl request.

One thing that I could never figure about the CouchDB tools ecosystem, was the total lack of a simple console to interact with CouchDb. MongoDb has a mongo command line client. Even Cassandra has a simple cassandra-cli.

I know Futon is nice, but as a developer i need more power. And even though HTTP is ubiquitous, i need simple ways of sending requests to Couchdb, without dealing with setting the headers and json encoding the parameters.

So I wrote a simple, irb based Couchdb Console, we call it Couchup. It gives a simple way of accessing couchdb via command line. The source is hosted on here .

It's quite simple and based on the awesome couchrest gem

You can install it on your machine by running

 

It definitely is not a replacement to Futon, but is a simple way of interacting with couchdb. And it does some nifty things that Futon cannot do. Like, using the endkey/startkey to query views, IMHO one of the most powerful couchDB features.

Couchup can do a lot more, and i am using some of my 20% time to make those changes. The documenation will be kept up to date on the github pages.

One design choice I had to make about couchup was the choice of basing it on irb, which means that it has to follow ruby syntax. This can be extremly powerful for people who are familier with ruby  and easily write scripts based on it. The flip side though is that it has to follow the ruby syntax which leads to some verbosity in the commands.

eg.  

>> create :database, :foo

is the correct way of creating databases, instead of conventional(from mysql client)

>> create database foo

There is a bunch of stuff that needs to be done on couchup and it is rough not just on the edges but also in the core. But i hope to finetune it to something that could be a useful tool in you couchdb toolbox.

And yeah, any feedback on what could be done better would be great. Feel free to send me an email sreeix@gmail.com.

Filed under  //   console   couchdb   couchup   nosql   tech