Tuesday, January 22, 2013

Worker Queue System

In the good old days, Polyvore started out with a very simple, traditional LAMP architecture. Everything more or less directly accessed a bank of MySQL servers, both to service web requests and also batch housekeeping jobs. But our traffic and data set has kept growing and growing. We started to experience massive spikes in our DB load when nightly batch jobs were kicked off. Jobs that would complete in 3 minutes started to take 5 hours.

Fortunately, our infrastructure team has a ton of experience dealing with scalability issues. Their solution was to use a worker queue system to break up massive jobs into smaller chunks, executed by a bank of workers. This approach allows us to utilize our machines better by spreading load throughout the day and also scale jobs by parallel execution of chunks on different worker machines.

Since we used RabbitMQ before, we considered using it as the building block of our worker queue system as well. However, we quickly found out that RabbitMQ falls short in two important aspects:

  • RabbitMQ is essentially a generic message bus. This means we needed to extend its functionality to make it a full fledged worker queue system.
  • If a task is added to a RabbitMQ queue, there is no native way of inspecting that task while it is queued. That means we won’t be able to check which worker is handling that task, dedupe tasks and more.

After a bit of research we decided to use Gearman. Gearman is a generic framework to farm out work to other machines or processes. It was a great fit for our needs, especially since we use Perl extensively here in Polyvore, and Gearman has a Perl client and APIs. In addition, we already deployed and use Cassandra in production, and Gearman integrates well with Cassandra as its persistent storage.

Our implementation ended up being a light wrapper for Gearman. Our API is very simple:

A way to push a new task onto a named queue:

$queue->send_task({ channel => 'xzy', payload => $payload });

And a worker for consuming tasks in a given queue:

package Polyvore::Worker::XYZ;
use base qw(Polyvore::Worker);

# process is called with each task on the queue
sub process {
    my ($self, $payload) = @_;

    # do stuff

    # driver will auto-retry a few times
    die $exception if ($error);
}

# instantiate a worker and attach to channel to start consuming tasks.
Polyvore::Worker::XYZ->new({ channel => 'xyz'})->run();

The driver for workers will automatically retry the task a few times (with progressive back-off) if an exception occurs. This is very handy in the world of finicky Facebook, Twitter, etc… APIs. The worker system is integrated with our stats collection system which keeps track of jobs processed, time per task, exceptions and more.

We implemented other useful features as well; we have a worker manager that distributes the worker processes/tasks based on the workers cluster load and queue lengths. It is preferable to have a longer pending task queue than overloading the workers cluster. We also implemented a dependency protocol where a worker task can declare itself as dependent on other tasks in the system. That worker task won’t execute until all of its dependencies are complete.

We use the worker queue system both for scaling our backend processes and also for performing asynchronous front-end tasks. For example, our users post a ton of content to external services. Done synchronously, these post operations can hold up the response anywhere from 5 to 30 seconds (or fail entirely and have to be retried). Using the worker queue system we are able to perform these tasks asynchronously in the background and deliver a very responsive user experience.

Today, we have over 40 worker processes and handle over 18 million tasks per day.

Challenges

Sharding batch jobs

We started out by writing simple jobs that got all their data in one SQL statement. That worked for a while until the number of rows in the DB grew to the point that the select would make our read slaves keel over. We have since been sharding our jobs so that they can operate in smaller chunks, typically by operating over id ranges. Instead of processing all 1M rows, we break up the job into 1000 1K id ranges and treat each range as a task for the worker system.

Sharding to even out load based on data density

Some of the problems we solve using our worker queue system have interesting characteristics. These problems require us to split the data into buckets which are not necessarily equal in size in order to maximize efficiency. For example, we use machine learning to categorize items that we import into our index. We do incremental categorization for new items, but we also re-categorize older items that have been changed recently. Since the distribution of updates is biased toward newer items, we create inverse-log sized data buckets to even out the processing time for each group of items. This gives us larger buckets (~10M items) of old items (with few changes), and smaller buckets (~10K items) of new items (with more changes).

Testing

We have a great development environment which allows us to have multiple checkouts to work in. Each checkout can be previewed against development and production databases. We also have a per checkout test environment which allows us to run our test suite against a particular checkout, with its own isolated mysqld instance, Cassandra instance, etc… We also have per user worker queue in unit test.

Worker Queues vs. Map/Reduce

We use Hadoop for big data analysis. Currently we use it for a specific analysis we do on a subset of the data we have. However, we plan to expand our Hadoop deployment so that we can do more batch analysis on a larger portion of our data and in a lot of other use cases. Obviously Hadoop allows us to analyze our data in ways we couldn’t have done before. Using Hadoop also raises the question of which tasks are best suited for our worker queue system and which tasks will benefit more from map/reduce.

Our worker queue system is great for procedural tasks such as user notifications, user emails, posting on Facebook wall, or extracting meta-data from uploaded images. In addition, we use our worker queue system for scheduling Hadoop jobs. All of those tasks are asynchronous, independent and sometimes require a retry mechanism.

Summary

Worker queues are a great way to scale batch jobs, and increase the utilization of computation resources by spreading load to avoid spikes; thus it helps in designing and implementing a scalable architecture. It also lets you provide a better user experience by performing long blocking tasks asynchronously. As we expand our usage of Hadoop, we will continue to assess which tasks are better suited for our worker queue system and which ones can benefit from Hadoop’s map/reduce design pattern. Using the right tool for the job is an important principle. Designing and implementing simple, scalable tools allows us to uphold that principle.

Also See

42 comments:

  1. This is really fantastic advice, thank you so much
    Pabx Systems

    ReplyDelete
  2. After reading you site, Your site is very useful for me .I bookmarked your site! kim kardashian hermes

    ReplyDelete
  3. I really appreciate the kind of topics you post here. Thanks for sharing us a great information that is actually helpful. Good day!
    similar online programs

    ReplyDelete
  4. Tailor made School Article Producing companies are generally carefully readily available over the web presently. When a person browse through the net, you would run into a new world-wide-web site that may be selling and also advertising write my essays to be able to unwary scholars everywhere in the earth.

    ReplyDelete
  5. I recently came across your blog and have been reading along. I thought http://essay-writings-services.com/I would leave my first comment. I don’t know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.I’ll use this information for my essays.:-)

    ReplyDelete
  6. This works well for my case. This can also the solution for MN internet marketing.

    ReplyDelete
  7. Designing and applying simple, scalable resources allows us to maintain that concept.google





    ReplyDelete
  8. His dentist’s workplace is covered with before-and-after images of several kids with excessive orthodontic issues, set by his handiwork. visit my site

    ReplyDelete
  9. Tailor made School Article Producing companies are generally carefully readily available over the web presently. When a person browse through the net, you would run into a new world-wide-web site that may be selling.
    free people search

    ReplyDelete
  10. Tailor made School Article Producing companies are generally carefully readily available over the web presently.
    The Best SEO Services

    ReplyDelete
  11. It seems that many company and personal marketing emails are going on digitally these days. It is often much easier to create a quick email or deliver a brief written text than to get in touch with someone on the cellphone. criminal lawyer

    ReplyDelete
  12. Shows deal with all segments of the industry (like AHR Expo) or concentrate on particular segments. here

    ReplyDelete
  13. However, we plan to expand our Hadoop deployment so that we can do more batch analysis on a larger portion of our data and in a lot of other use cases. Obviously Hadoop allows us. Victory Martial Arts

    ReplyDelete
  14. Tailor made School Article Producing companies are generally carefully readily available over the web presently. When. shemale cams

    ReplyDelete
  15. To be present at academic classes and earn certification.Sweet sorghum is a hardy crop that can extend the ethanol production season by up to 60 days in Brazil. microsoft points

    ReplyDelete
  16. As we age this careful system changes. Moreover to monitoring moment-to-moment threats such as an beginning car or a decrease banister, our threat verifying starts to intuit a distant but progressively approaching dark thinking — the approaching end, the biggest boundary. Klik4D

    ReplyDelete
  17. The driver for workers will automatically retry the task a few times (with progressive back-off) if an exception occurs. free netflix account

    ReplyDelete
  18. What other contractor can say they were involved in the design process of an all-electric service vehicle. click to investigate

    ReplyDelete
  19. As we age this careful system changes. Moreover to monitoring moment-to-moment threats such as an beginning. bounce house rentals pensacola

    ReplyDelete
  20. Thanks a lot for being our mentor on this niche. I enjoyed your current article very much and most of all cherished how you really handled the aspect I considered to be controversial. www.traveltims.org |

    www.travelredsea.org |

    www.phototravels.org |

    www.traveldomain.org |

    www.denver-travel.org |

    www.zesttravel.org |

    www.mytraveldna.org |

    www.texastravelfreedom.org |

    www.foodietravel.org |

    www.socaltravelsurvey.org |

    ReplyDelete
  21. It is then our intention to start the transition at two per month. pinterest

    ReplyDelete
  22. As we age this careful system changes. Moreover to monitoring moment-to-moment threats such as an beginning car or a decrease banister. swissphysio.co.uk

    ReplyDelete
  23. Disappointed that it did not include a miter guide. I have even made adjustments so I can throw in a little ground flavored coffee while still using the grinder. rowlett carpet cleaning

    ReplyDelete
  24. I think many business owners have often noticed about the key advantages of going natural, but are still reluctant to make the modify. agen bola terpercaya 

    ReplyDelete
  25. Ellen Israel, Pathfinder's Mature Technological Consultant for Females Health insurance coverage Privileges, said. more learn

    ReplyDelete
  26. Method exterior the actual corporals within your collection, moreover restoration real for every single package malfunction regularly be just one. blog here

    ReplyDelete
  27. These storms were forecastMichael Kors Bags Outlet to spread, bringing downpours to Georgia, South Carolina and up Michael Kors Outlet Onlinethe East Coast into Monday. blog here

    ReplyDelete
  28. This is quite wonderful post. The article affects a lot of urgent challenges of our society. We can not be indifferent to these challenges. Your post gives the light in which we can observe our real life. Very professional. venus factor negative reviews

    ReplyDelete
  29. After conference with Brownish lots of periods, Robichaud was marketed on the advantage to the surroundings, his business and the group. Robichaud also employed Accogliente swan Energy of Louisville, Colo., to set up residential solar sections on Perfection head office to help renew the battery power for eight time in the evening after specialists generate them 120 kilometers during the day. send him a message

    ReplyDelete
  30. But Little wasn’t even near to completed securing horns with LG&E. Actually, she was getting ready to take on other coal-burning causes, as well, journeying outside of Louisville to help areas experiencing identical situations. Tattoo Shops Chicago

    ReplyDelete
  31. That’s a terrible sensation. So what do you do if you can’t secure your child? You go out there and increase some terrible. And you get individuals to pay interest. So that’s what I do. including major wine dealers

    ReplyDelete
  32. You (need to) get individuals to recognize the effect, the struggling that individuals withstand every day residing near these fossil fuel ash features, the fear that you have every day, the pressure that you experience because you can’t secure your kid,” Little says. help redefining the platform

    ReplyDelete
  33. In fact, deficiency of way of lifestyle is an everpresent reality; it way of lifestyle at the border of self and ego and describes our way of lifestyle. All Voices

    ReplyDelete
  34. The next community meeting in Bedford will likely be organised in beginning 2012, and Little is already getting ready for the fight. George Washington University

    ReplyDelete
  35. The electric motor is two to five times more efficient than a diesel engine. Higher efficiency means less energy consumption. Less energy consumption means lower costs and less pollution. for more about him

    ReplyDelete
  36. Syngenta will offer its significant agronomy sources to assess its profile of plants security items together with Ceres compounds, and Ceres will offer both seeds and research assistance. Both companies will organize outreach to ethanol generators and create market training programs. bought his first building

    ReplyDelete
  37. Often it is the surprising lack of way of life of someone near that provides this home; and then the frequency of hospital visits and memorials progressively starts to select up amount, like a drumbeat in the woodlands. Stansberry & Associates

    ReplyDelete
  38. Participants must be more motivated to promote the Show to their concentrate on audiences and offer them with benefits for arriving to their device. The display organizer’s job is to get people to the Show, not to a particular company’s device. having donated to a Super PAC

    ReplyDelete
  39. One nice aspect about being the first organization to buy the vehicles is that Robichaud is fairly much engaged in any style changes that may be created to your vehicle. www.findthebest.com

    ReplyDelete
  40. Disappointed that it did not include a miter guide. I have even made adjustments so I can throw in a little ground flavored coffee while still using the grinder. bmw repair

    ReplyDelete
  41. It is then our objective to begin the conversion at two monthly. Logistically it needs a chance to exchange out an experienced to a new van and we do not want to overcome ourselves, although the earlier we change the earlier we begin preserving. buy truth about cellulite by joey atlas

    ReplyDelete