Skip navigation

This week has been highlighted by my running up against some limitations in my datastore. Somewhat succinctly:

  • CouchDB has what amounts to incredibly slow insert performance. This became problematic as I needed to do several rounds of inserts with upwards of 10 million rows per pass. I spent a good deal of time optimizing my code, and was able to improve the performance of my insert program by almost a factor of 2 through various optimizations. In the end, however, CouchDB imports seem to happen an order of magnitude slower than those in Postgres or MySQL. I’m not sure whether this is due to the “everything over HTTP” model or the underlying data structure but  it became painful for a large dataset under a tight time constraint.
  • CouchDB view generation is also slow. I actually knew this from the onset but didn’t fully understand how challenging it would end up being in the context of my project. Since a large part of my project has to do with data exploration, slow view generation amounts to slow data exploration; it takes hours to try a new view on a large dataset, regardless of whether or not it turns out to be a useful avenue of exploration.

 

I see now that CouchDB, while appropriate for the end result of my project (an HTML5 app with JSON data), seems unsuited for exploring large data sets. That being the case, I’ve shifted my immediate goal using the Unix command line to put together a suitable subset of summary data that I can use in a demonstration with HTML5 and Javascript.

This week I’ve been continuing work on data imports and exports (currently I’ve processed ~33 million rows of the MySQL database), and have done some test imports on the JSON data from S3. As of now I have a subset of about 30,000 rows in CouchDB that I’m starting to use for testing visualization.

I also had some success with getting my HTML documents into CouchDB. It’s been a short week, but I hope to have  a couple working prototypes of visualizations with real data by the end of next week.

A few updates this week. First, I have an updated JSON format:

{
  "referrer": {
    "url":"-",
    "host_ip":null,"host":"-"
  },
  "release":"Happy Birthday",
  "title":"Girls FM",
  "artist":"Happy Birthday",
  "format":"mp3",
  "source":"s3",
  "date":"2010-01-13",
  "request_location": {
    "city":null,
    "country_code":"US",
    "state":null,
    "lat":null,
    "lng":null
  },
  "time":"23:01:53"
}

This format encapsulates a couple changes. Most importantly, I’ve stopped trying to track the location data for referrer URLs. I found that the data was generally incomplete. Even more problematically, the data was relatively meaningless; referrer location was based on an host lookup via domain name and then the resultant IP was geolocated. Since the host name to IP mapping is not guaranteed to be consistant, it was quite possible that the IP addresses were not the same as when the download request was made. Furthermore, the IP address in most all cases points to the location of the referring server, which, in practice, has very little to do with the location of the content creator of a web site or it’s content.

Second, I spent quite a bit of time generating JSON data. I was able to churn through all the S3 logs (~) and I’ve made significant, if only partial progress on the older MySQL data from ogami. As of this writing I’ve generated JSON for roughly 16 million of the 47 million rows in the table. The remainder of the processing should be done sometime this weekend.

Lastly, I’ve been doing some experiments with graphical elements in HTML5. Using a small Javascript library that I wrote, I created the following page, which generates random orbiting circles, which follow the mouse and settle down after a few seconds:

http://fi.ero.com/dLib/mouse.html

This should work in all modern browsers and IE9.

Mid-term presentation:

http://prezi.com/ackfew7iahrp/visualization-of-digital-asset-consumption-data-in-html5-and-other-web-technologies/

Latest commit is here:

https://github.com/deanh/DXARTS-511/commit/640a6c9742936300704fe2e09877242573bd46ae

An interesting article popped up by Lev Manovich, “What is Visualization?”

http://manovich.net/2010/10/25/new-article-what-is-visualization/

The article discusses the differences between scientific visualization and “InfoViz” and is relatively apropos to the conversations that we had in class.

Manovich is particularly concerned with the difference between the reduction which has been used historically and the concept of “visualization without reduction.” The latter has been made available with current technology and may allow for richer, more interesting visualizations.

Also of interest is his concept of media visualization, which may or may not (depending on the medium) be actually visual. In media visualization, elements of the actual media are used in their representation; the classic example is the tag cloud where each word represents itself but wherein the font size is used to represent relative frequency.

Accomplished Tasks:

  • Research on Data Analysis, including a work through of the first 4 chapters in Head First: Data Analysis, by Michael Milton. This is a basic overview of data analysis techniques; it helps contextualize my work so far.
  • Further processing of S3 data: most of the MySQL and S3 download data has now been staged to disk, ready to be sent into CloudDB.

Next Steps:

  • Doing a closer analysis of the variables I’m mapping data to and understanding where I may be able to pull further data from if it proves under-interesting
  • Looking more deeply at existent Infoviz approaches as templates for approaching my data

Accomplished Tasks:

  • Major refinement of MySQL data extractor: handles edge cases better and retrieves geo data more reliably
  • First pass at S3 log parser: I have a working regex to extract the necessary data
  • Re-organization to put shared code in a library
  • Test runs with generated JSON

Red Flags / Notes:

  • Now that I have adequate disk, I’m changing data import into two step process: export to JSON text files on disk, load JSON to CouchDB. This will allow me to work on smaller chunks of data and better isolate loading issues
  • The MySQL data has consistency issues. I’m still working through it and fixing it where possible.
  • MySQL data has only referrer data (by hostname) and not request IP information. I’ve separated this out into two data fields: referrer_location and request_location
  • Hostname info in MySQL database is from request date–doing IP look ups now may lead to different location data than was accurate when the request was made

Commit info is located here:

http://github.com/deanh/DXARTS-511/commit/af9aa2eca56ea513059f0539c3640ce37ca54032

An example of the current JSON data format is as follows:

{
  "release":"In Name and Blood",
  "format":"",
  "date":"2000-10-11",
  "artist":"Murder City Devils",
  "referrer_location": {
    "lng":72.9926986694,
    "country_code":"US",
    "zip":null,
    "state":"CT",
    "city":"Waterbury",
    "lat":41.5514984131
  },
  "request_location": {
    "lng":null,
    "country_code":null,
    "zip":null,
    "state":null,
    "city":null,
    "lat":null
  },
  "title":"Murder City Devils / At the Drive In Tour e-card"
}

Accomplished Tasks:

  • The MySQL and S3 data has been consolidated on a single terrabyte hard drive
  • CouchDB storage and hosting is set up via CouchOne and they have confirmed that they will be able to host ~50 million rows
  • The MySQL data has been cleaned up so that downloads correctly reference the multimedia metadata in the multimedia table
  • An initial data loading script has been written to import the appropriate subset of the MySQL data into CouchDB

Red Flags:

  • We’re missing referrer data for a surprisingly large number of entries in the historical MySQL data; I may not be able to rely on geolocating those requests.

Tufte, E. R. (1983). The visual display of quantitative information. Cheshire, Conn. (Box 430, Cheshire 06410: Graphics Press.

Tufte, E. R. (1990). Envisioning information. Cheshire, Conn. (P.O. Box 430, Cheshire 06410: Graphics Press.

Tufte, E. R. (1997). Visual explanations: Images and quantities, evidence and narrative. Cheshire, Conn: Graphics Press.

Tufte, E. R. (2006). Beautiful evidence. Cheshire, Conn: Graphics Press.

Hawk, B., Rieder, D. M., & Oviedo, O. O. (2008). Small tech: The culture of digital tools. Minneapolis: University of Minnesota Press.

Lam, H. (2008). A framework of interaction costs in information visualization. Visualization and Computer Graphics, IEEE Transactions on, 14(6), 1149-1156.

Title

Visualization of Digital Asset Consumption Data in HTML5 and Other Web Technologies

Description

The primary purpose of this project is to explore the viability of HTML5 and other web-based tools as a platform for build data visualizations which operate on large datasets. Specifically, I hope to normalize a dataset representing digital asset consumption patterns of users of subpop.com from 2003 (the earliest period for which we have data available) to present, and and provide a web based tool to explore the data visually.

Dataset

Due to web server reconfigurations over the time period in question, the log data exists several different formats: 2003-2007 data exists in a table called “downloads” in the main subpop.com MySQL database consisting of millions of row; Nginx and Apache log files exist in various formats on subpop.com web servers from 2007-2010 and on the newer, cloud-based server; AWS S3 logs exist for most of the downloads hosted in S3 since 2008; and web usage data which has been stored in Google Analytics since mid-2006. Each of these disparate sources share the following common information: asset identifiers (database IDs or track titles which can be linked to track metadata), IP addresses, and timestamps.

Task List

  1. Move log data from multiple sources to external hard drive
    • Pull 2003-2007 data from SQL file on S3 and import to local MySQL server
    • Collect Apache/Nginx logs from S3 and current servers
    • Retrieve S3 data
  2. Normalize and import data into single data store
    • Set up normalized database schema
    • Build data importer for MySQL data
    • Build data importer for Apache/Nginx logs
    • Build data importer for S3 logs
    • Import dataset
  3. Expose data via web API
    • Wrap data in ActiveRecord objects
    • Build Sinatra app to handle controller layer, and return JSON over HTTP
    • Set up hosting for Sinatra app on Heroku
    • OR

    • Set up public CouchDB host with datastore
  4. Design visualization
    • Identify information vectors
    • Identify possible narratives
    • Explore appropriate representations
    • Create Interface 

    • Write HTML5 template (using Canvas and SVG tags)
    • Write JavaScript code to poll data from server via AJAX calls
    • Implement visualization in JavaSscript using direct calls to Canvas API or via a library

Tools / Reseources

    • Hosting 

    • Heroku: Rails Hosting
    • Couchone.com: CouchDB hosting
    • Github.com: Source code repository
    • Software 

    • Protovis: JS data visulization software
    • Ruby/Perl
    • MySQL (old database
    • Postgres
    • git: version control
    • JQuery
    • ExplorerCanvas

I spent the weekend looking through some of the non-proprietary data that I have access to via my work at Sub Pop, in an attempt to figure out what of it could be a good fit for my project.

I’ve settled on asset download data; while the data from our recent digital consumption survey is interesting, the dataset is small, and for the purposes of this exercise I want to make sure that I have to address issues that only arise when dealing with larger quantities of data. In addition to server logs and Google Analytics data from our current web app (2007-present), I have an SQL database with download data from 2002-2007, which represents tens of millions of downloads from subpop.com over that period.

This historical download data can be mapped to artist, release, and track metadata via Sub Pop’s catalog database and should, in most cases, include both requester IP addresses and download timestamps. The IP addresses will enable us to generate location data (via GeoKit or something similar). Hopefully, this will provide a few options when I begin to design the visualization.

Also of longer term interest, but possibly beyond the scope of the immediate project, is data exposed via the YouTube and Twitter APIs. I’ve been looking into the GData API for YouTube and there is a ton of real time data on media consumption patterns to be had in real time. Similarly, Twitter’s search API could provide an interesting window into media popularity and sentiment.

My initial plan is to normalize and pull the available data into a centralized Postgres or CouchDB datastore. The datastore will sit behind a thin web server which will provide a REST API capable of  responding to data queries with JSON objects. CouchDB already exposes its data as JSON via a REST API, so this may end up being the most straightforward approach. If CouchDB is somehow unsuitable, however, my plan is to build a simple REST API with Ruby/Sinatra or Node.js. The server will also be responsible for exposing an API for summary data.

Data manipulation and rendering will happen on the client side via JavaScript. Rendering is to (tentatively) be handled via a developed library which will sit on top of SVG or the HTML5 Canvas API .

Here are some JavaScript visualization libraries which may provide starting points for research:

http://vis.stanford.edu/protovis/

http://raphaeljs.com/

http://github.com/michael/unveil

http://processingjs.org/

Flare is a Flash library for data visualization:

http://flare.prefuse.org/

Follow

Get every new post delivered to your Inbox.