We've got issues. Oh boy, do we have issues. To keep Polyvore's databases filled with up-to-date information about the best and latest products you want to see and buy, we crawl hundreds of partner websites (with their permission, of course) and load feed files from dozens more. Every time one of those retailers changes their website design, our crawler configuration has to change. Every time one of those sites adds a "New Summer 2014 Collection" section, our crawler needs to know about it. And every time one of those stores has a website problem, our crawler has a problem. With hundreds of websites to crawl, we run into 5-10 errors a day, plus a bunch of warnings. Somebody should really fix all of those things...
Over the past couple of years, Polyvore's approach to issue tracking has evolved from countless hours of human labor into an indispensable tool that multiplies our attention tenfold. I'd like to walk you through our process, in the hopes that you see something that looks familiar and maybe even enlightening.
Triage, human style
At first, our crawlers numbered in the dozens, so we could just watch the output logs every day. After a while, this got to be too much, so we created scripts to read those log files for us and send email when they showed errors. This actually worked for quite a long time.
But after another growth spurt, we realized we were dropping error reports on the floor and not fixing them all. We got tired of having our inboxes fill up with error notices every morning. So we made our scripts smarter. That helped with the flood, but we would still occasionally forget about something that really needed immediate attention.
Triage, bug-tracking style
So we found a bug tracking system and changed our scripts to send error reports to it. That system (which shall remain nameless) let us keep a list of open bugs, but was decidedly simple when it came to workflow. Each bug had a status and an assignee, but that was about it. This let us search for open bugs for each member of the team and work on them. It was a huge leap above email in that regard. Now we could tell who was working on what and which crawlers had errors. We could count the number of bugs and make sure that number decreased over time. And we could count the number of closed bugs every month. Yay, metrics! (We love metrics.)
We felt this was a golden age, for a while, but the system had one major flaw: when you had to ask someone else to work on a bug, you had to assign it to them and change the status at the same time. As you can imagine, when things got busy, someone would change one field but forget the other. Bugs got stuck in indeterminate states. Things got dropped. Our team got bigger, which helped, but we all realized that the tool was not helping us as much as we wanted. So we took on a quest to find a better issue tracking tool.
We found JIRA. It had been around for a long time, but we'd always thought, "That's too big for our needs. It's too expensive. We don't want that much administrative overhead." When we looked at it again, we realized that the reason our existing system was floundering was that it didn't do enough for us -- it needed more structure, better reporting, and better automation.
So we took on two major projects to overhaul our crawler error tracking system:
- Define workflows for all our major types of errors that would let us not even have to think about what to do on them next.
- Build an automation system that did as much as possible for us without human intervention.
Triage, workflow style
Let's say a partner's website moves their "price" field to a different section of the page, but doesn't tell us about it. Suddenly the crawl output has no prices in it. You need to have someone investigate the error and then modify the crawler configuration for that site to pick up the price from the new location. You have to commit the change and wait for it to be pushed out to the crawler machines. You have to wait until the site is crawled again to make sure that it picked up the price correctly this time. Finally, you have to make sure that the new prices are loaded into the database properly. At the end, you want someone to review the code and data and pronounce them "Good."
This process can easily take a week of real time. During that time you may have a couple of different people looking at it, plus waiting time. You don't want anyone to forget about it and you want to make sure that every step is followed. So we created a series of workflows in JIRA that look like flowcharts:
Each transition between states in the diagram can have custom functions added to it. So, for example, when we start a new Issue we automatically assign it to the person on triage duty. When the changes to the crawler are ready for review, the workflow automatically assigns the task to a reviewer. And when we're ready to verify that everything is fixed properly, we automatically assign the Issue to one of our Data Editors who know what "right" looks like. As the Issue moves through the workflow, assignees are notified via email and on their JIRA Dashboard that they have new Issues to work on. And if someone later in the chain realizes that someone earlier missed something, it's easy to send the Issue back to a previous step with a comment explaining what's wrong by hitting a single button.
Using JIRA's existing web-based tools and reports, keeping track of Issues is easy. We have saved searches to tell us when something is being ignored, what's "stuck" in a step, and who is overloaded. And we have nifty graphs to show just how awesome we are at keeping up with all the Issues!
Triage, automation style
What's better than being able to track all the issues? Having the computer automatically handle them for you! JIRA has a very thorough REST API that lets you do everything you can do through their GUI in a script. We realized that we could use this to our advantage and reduce the amount of human time our issues take up.
First, we added code to our crawler and loading scripts to automatically create issues every time we could recognize problems with a site. For example, when you get several 500 HTTP responses in a row, you know that the site has stopped responding. In such a case, we can have the system create its own Issue, without waiting for a human to look at a report the next day.
Then we added code to our processing script that examines the crawled data to check for things like "a lot fewer pages crawled today than yesterday". This script compares the new crawled data with what we already have in our database and can discover problems that you can't see when just looking at the site alone. Again, when problems are detected, a JIRA Issue is created automatically.
Finally, to keep from being completely overwhelmed, we wrote a "close tickets" script that checks each auto-created Issue against our current database and closes it if the problem has been solved. There are only a few cases where the system can tell that the problem is fixed, but they happen often enough that it's a big help. For example, if we have an empty crawl one day, we'll create a JIRA Issue for it. If the crawl works fine the next day, we can close that Issue because we know it was a temporary problem. No human involved!
Our web crawling system, like many of its kind, is written in Perl. Conveniently, there are even a few JIRA client libraries in CPAN, to make scripting the system easier. Unfortunately, all of the ones existing when we started our project were either outdated or were missing features we needed, like attaching files to Issues, so we wrote our own.
We've also released this back to the community so that everyone can automate their Issue tracking like Polyvore does. JIRA::Client::Automated is available for everyone now. It's explicitly designed to be used in automated scripts so that JIRA Issues can be created, transitioned through their workflow, and closed without any human intervention (sometimes). We've found this tool extremely useful since we've rolled it out a year ago and we hope that others will find it useful too.
In the beginning, triaging errors was easy, but a very manual process. Once we started automating reporting, things got better, but it wasn't until we went full-bore with JIRA that we really felt like our tools were really pulling their weight.
These two projects, proper workflow design and full automation, have led us to our current utopia: we have one person triaging all the crawler errors each day and he can do other work too. There's always something that requires a smart and good-looking person to investigate, but we've taken a lot of the drudgery out of the process.
I hope seeing our evolution in this area is helpful to others. Maybe your team will be the one who skips straight to the end and never thinks of batch system errors as daily drudgery. Wouldn't that be something?