Tuesday, January 22, 2013

Worker Queue System

In the good old days, Polyvore started out with a very simple, traditional LAMP architecture. Everything more or less directly accessed a bank of MySQL servers, both to service web requests and also batch housekeeping jobs. But our traffic and data set has kept growing and growing. We started to experience massive spikes in our DB load when nightly batch jobs were kicked off. Jobs that would complete in 3 minutes started to take 5 hours.

Fortunately, our infrastructure team has a ton of experience dealing with scalability issues. Their solution was to use a worker queue system to break up massive jobs into smaller chunks, executed by a bank of workers. This approach allows us to utilize our machines better by spreading load throughout the day and also scale jobs by parallel execution of chunks on different worker machines.

Since we used RabbitMQ before, we considered using it as the building block of our worker queue system as well. However, we quickly found out that RabbitMQ falls short in two important aspects:

  • RabbitMQ is essentially a generic message bus. This means we needed to extend its functionality to make it a full fledged worker queue system.
  • If a task is added to a RabbitMQ queue, there is no native way of inspecting that task while it is queued. That means we won’t be able to check which worker is handling that task, dedupe tasks and more.