Wednesday, November 28, 2012

Crawler System

Polyvore's product index spans millions of items. The bulk of these arrive via our awesome user community who are constantly scouring the web for interesting products using our clipper bookmarklet.

Our clipper is quite smart -- it auto-detects the correct price, landing page, etc… We also use a background task to scrape the Facebook open graph meta information for gleaning the correct description and title for each product.  However, this information is essentially a snapshot taken at the time of clipping.  We don't get notified about price changes and the availability of the product.  Since Polyvore is a social commerce platform, we felt it was important to have up to date price and availability information about the products that are present in our index.

To augment our product index, we started by integrating data feeds directly from retailers that offered them.  But we soon found that these feeds were constantly breaking, out of date and missing useful meta data. So, we decided to write our own crawlers to regularly crawl retail sites and extract accurate, up to date product catalogue data.