This week has been highlighted by my running up against some limitations in my datastore. Somewhat succinctly:
- CouchDB has what amounts to incredibly slow insert performance. This became problematic as I needed to do several rounds of inserts with upwards of 10 million rows per pass. I spent a good deal of time optimizing my code, and was able to improve the performance of my insert program by almost a factor of 2 through various optimizations. In the end, however, CouchDB imports seem to happen an order of magnitude slower than those in Postgres or MySQL. I’m not sure whether this is due to the “everything over HTTP” model or the underlying data structure but it became painful for a large dataset under a tight time constraint.
- CouchDB view generation is also slow. I actually knew this from the onset but didn’t fully understand how challenging it would end up being in the context of my project. Since a large part of my project has to do with data exploration, slow view generation amounts to slow data exploration; it takes hours to try a new view on a large dataset, regardless of whether or not it turns out to be a useful avenue of exploration.
I see now that CouchDB, while appropriate for the end result of my project (an HTML5 app with JSON data), seems unsuited for exploring large data sets. That being the case, I’ve shifted my immediate goal using the Unix command line to put together a suitable subset of summary data that I can use in a demonstration with HTML5 and Javascript.