Node.js Analytics - Part 1 : State of the Art

Thursday

Jun 2

, 2011

Part 1 in a series of articles exploring real-time and historical analytics for Node.js, covering analytics options currently available in the Node.js ecosystem. I think there's a need for better analytics in the Node.js ecosystem to support engineers and business folks. I'm researching and writing to stimulate feedback on this, and to explore what if anything needs to be developed in terms of open source modules, APIs, etc.

I've been working with server-side javascript and Node.js for about 6 months, including some work with server side DOM manipulation and static site generation. I've been working on consumer internet startups for about 4 years, and before that I was in the business intelligence and analytics world. Having not quite rid myself of the analytics bug, I've been thinking a lot about how the practice of analytics might evolve in the Node.js world and tinkering with two open source contributions toward analytics:

Let's go through what's available currently to implement analytics in Node.js.

Logging

If you're developing with Node.js, chances are you're doing some kind of persistent logging. You might be logging critical events to make troubleshooting and development testing easier, logging outcomes of test suites like Vows, or keeping logs of requests and responses to understand how your server(s) are being used. You might be using a logging module like Nodejitsu's Winston and/or PaaS offerings designed specifically for logging like Splunk or Loggly, or may have a customized approach to logging. And hosting / deployment platforms like Heroku, Joyent, Nodejitsu, etc. often provide additional logging capabilities.

How logs are retrieved and analyzed is dictated by the method in which the logs are written. Developers analyze filesystem-based logs in an ad hoc way during development using grep or other powerful basic methods. Hosting environments may come with analysis tools, like Heroku's revamped logging and Joyent's Cloud Analytics. Cloud logging services provide APIs that are specialized for log retrieval and analysis, like Loggly's Retrieve API. Analysis tools for log data primarily focus on serving engineering needs, like troubleshooting, performance tuning, provisioning, testing, etc. with two basic functions:

search - search-engine-style query to retrieve log entries containing keywords
find - database-style query to find or count log entries based on metadata such as date, server, severity level, etc.

The value of logs isn't limited to engineering - there's a long history of log data being used more broadly to make marketing and other business decisions - e.g. Apache logs providing an early foundation for the what would become the multi-billion dollar web analytics industry. While this industry has evolved to rely more on client-side analytics and less on server-side data like logfiles, there's an interesting case to be made for the importance of server-side approaches in Node.js applications, supported most obviously by widespread use of Node.js in API servers (e.g. serving JSON to other servers via REST) and other situations where "clients" don't support client-side analytics.

Client-side Analytics on the Server

Because Node.js supports reuse of code between browser and server, Javascript developers can now embed some of the most popular client-based web analytics services into server applications to take advantage of many of the powerful features these services offer.

This approach does not seem to have taken off yet judging by the apparently limited adoption of these two modules. The case for using Google Analytics in this way is diminished by the extent to which some of it's key features rely on cookies that aren't available in this server-side context. I'm curious to learn more about why there doesn't seem to be more interest in using Mixpanel, a more generalized approach to capturing javascript-defined events that seems readily applicable to the server.

Server-side Analytics on the Server

The analytics (software that helps engineers and business people understand what's happening with their product) available for Node.js is limited to the two categories above. As far as I know. If there are other options out there I'd love to hear about it.

A new category that makes sense to me is software that's designed specifically for the Node.js server with the following characteristics:

Makes the collection of metadata about the sending and receiving of data as automatic as possible, e.g. add three lines of code and it's done
This metadata is instrumented in a way that's standardized and optimized for analytics
A REST API exists on top of this metadata to support the use of this metadata in a variety of analytics clients

I've made a small amount of progress toward this by taking a stab at the instrumentation of a Node.js request object for analytics.

I'd love to hear feedback, corrections, etc. on my take on the state of analytics in the Node.js ecosystem and on this specific proposal for an additions to it.

Stay tuned for coverage of real-time analytics next week.

Mark Soper Software Engineer Cambridge, MA

Node.js Analytics - Part 1 : State of the Art

Logging

Client-side Analytics on the Server

Server-side Analytics on the Server

Mark Soper
Software Engineer
Cambridge, MA