Tuesday, August 28, 2012

Measure Everything

Measuring and acting on stats is an essential part of building successful products. There are many direct and indirect benefits to pervasive measurement and tracking of stats:

  • Accurate, real-time data enables better and faster decisions.
  • It empowers a data-driven culture where ideas can come from anyone -- ideas can be easily tested, and the best ones can be chosen based on their merit, instead of pure intuition, or because it’s the HiPPO.
  • Tracking stats and watching them improve in response to our iterations helps teams stay focused and is incredibly motivating.
  • Historical stats are a great way to keep an eye on how code changes affect the health of a product and processes.

Early on in the life of Polyvore, we decided to weave stats into everything we did. One of our engineering philosophies is to invest a fair amount of our resources into making ourselves more efficient. This approach introduces some latency into our projects but this is made up in the future by an increase in overall team bandwidth. Consistent with thie philosophy, the first thing we did was to invest in building a system that made it easy to collect stats and made it immediately useful.

We started by identifying different types of stats that people were already collecting in ad hoc ways or wished that they could collect. Based on these requirements, we came up with the following, super simple API:

my $stats = Polyvore::Stats->new({ 
    name => 'db_layer' , 
    roleup => MINUTE 

# observe a value for a given key
$stats->observe(45, ‘file_size’); 

# observe the occurrence of events
pre>stats->add(5, ‘facebook_disconnects’);

# utility hi-res timing functions that observe elapsed time for a given key

To measure anything, a developer on our team would instantiate a stats object with a given name, granularity. Arbitrary stats can be recorded using this object. Once the measurement was set up and running, we could access the collected data via a generic web dashboard, pragmatically through an API or graph them on a dashboard.

Some Examples

Polyvore's stats collection system is integrated into almost every subsystem (except itself!)

One of the earliest applications was in instrumenting our page generation times. Site performance is very important to us and we needed to profile where we were spending time in generating the response to each request. Tracking the # of calls and the time spent on each call allowed us to identify bottlenecks and to prioritize our time on optimizing the most important ones (via tuning SQL, caching, parallelization, pre-computation, etc). We make all our back end calls through an abstraction layer and this allowed us to easily instrument our DB reads by collecting stats in just a few places:

sub _read_sql {

# retrieve a stats object from a factory
# (these are reused across persistent server processes)
$stats = $instance->stats({ name => 'backend_stats' })

# find the calling method that is requesting the read so that we can log it.
# eg: select_user_by_id
my $caller = Polyvore::Util::first_caller_inside_pkg();

# begin hi-res timer
$stats->timer_begin($caller, 'read');

# execute the query
my $result = $self->_read($sql, $data, $shard_name);

# end hi-res timer
$stats->timer_end($caller, 'read');

return $result;

We did the same thing for our other back-ends -- Memcached, Solr and Cassandra.

Collecting this data for each request allows us do other cool things, like dumping it at the end of every page as an HTML comment, so that we could view source and see what had happened during a particular request:

<!-- 2368 : 0.180855 p1 
shard 'read'
shard main read: db17.polyvore.comcntmaxminsum
… some stats omitted …

This request took 180ms to generate (somewhat average for a typical page) and performed 25 reads, including some memcached hits.

We can also look at historical data in the web UI for each of the calls (past 24 hours):


select_collection_comments on average takes 1.7ms to execute.



We started out by storing the stats archives in mysql using a very simple auto-generated table schema. This approach got us off the ground quickly and worked reasonably well until the stats system became a victim of its own success. We started to collect stats in every aspect of our system and as the volume of data grew, we ran into scalability issues on the storage system. Given the simple schema (key, timestamp, value) we decided to use Cassandra as the storage system.

Today, we are logging and archiving 35M data points / day from over 100 different subsystems.

Write Performance

We have been using the stats system to collect data in our production environment and very early on it became clear that we could not simply write out all the stats without overwhelming our storage system.

Fortunately, we had already started working on a distributed job queue system. Our stats objects internally buffer the stats and occasionally flush to a RabbitMQ based job queue. The data is picked up by a bank of workers that write it to Cassandra. This approach has allowed us to not worry about the volume of data we are collecting and let the job queue smooth out and manage the write volume. The other benefit is that writing to a queue is essentially async and very fast. Our front end code never has to wait while stats are being flushed.

Getting Everyone Onboard

We made it easy, simple and useful to collect stats. But we’ve also made a point of celebrating stats. Having actual data that showed how a metric was improved was very gratifying, and we made a point of encouraging everyone to present their work in the context of the metrics they had improved.

A work in Progress

As with everything else we do, our stats system is a work in progress.

The first area for improvement is keep track of distributions in values. Because of the way we’re rolling up data, we’re only looking at average measurements over a given time period. Looking at averages alone can be misleading because it hides patterns in how the values are distributed. For example, if we’re measuring the performance of a single database select over time in order to spot slow queries (something that we do!), it would be useful to know that 99% run in under 4ms, but that 1% take 200ms, vs an overall average value of 6ms. Knowing that the queries is generally fast but dramatically slows down for 1% of cases would suggest a need for a fix where the average of 6ms might seem just fine.

Another interesting possibility to our status system is integration with Splunk. We’ve recently started using Splunk, an amazing tool for slicing and dicing log data in real time. and it would be interesting to tee off of our stats to Splunk for real-time monitoring and ad hoc queries.


Measuring and collecting stats is an essential tool for building successful products. The key to success is to make it easy, immediately useful, ensure you’re making decisions based on the data and to celebrate the results.

Also See