Wednesday, April 2, 2014

6 Simple Steps to Mobile-Friendly Email

Why Mobile Emails Matter

While our mobile website is becoming a larger and larger portion of our traffic, we still see more users on our desktop website. So when we pulled data around our emails (and we love data), we were surprised to learn that most of the emails we send are actually opened on mobile devices.

Just as our mobile site is designed and built to be a great experience on the phone, our emails needed the same special thought and treatment. It sucks to open an email on your phone with tiny text, or images that stretch off the page, or links that are too tiny to click on a touchscreen.


We set out to ensure our emails, whether opened on a desktop client or a mobile phone, were as delightful as the rest of our site. Along the way, we found many guides and blog posts on the Internet, but nothing that laid out the whole process. We also ran into several pitfalls during development that we had to work around, so we've gathered them here in hopes of saving you the pain of having to discover them yourself!
A note about media queries
There are already plenty of posts on the web about using media queries to build responsive emails. The basics are that you can and should use media queries to target your CSS to certain device attributes, like screen width and retina displays. For this post, our focus will be on information that was hard to find or we didn't come across at all. But if you're not familiar at all with media queries or mobile-friendly email design, reading these articles first will make this blog post much more useful to you.
Step 1: Desktop or mobile?
While media queries make mobile-optimizing emails a whole lot easier, keep in mind that many desktop and webmail clients do not support them. They will just be ignored, so you need to choose your defaults wisely. We decided our default experience would be designed for desktop since the vast majority of our mobile users are on iOS and that supports media queries and CSS3.
Step 2: Split your CSS into two files
Chances are good that you will be supporting GMail (you can skip this step if you're not), so you will need to split your CSS into two files: inline.css and external.css. Gmail strips all <style> nodes from your document, so you will need to use an inlining tool to move all your CSS styles into the style attribute of each node. Using an inlining tool will simplify your workflow, because you can develop the email as though all clients support style nodes and have it transparently transition those styles to inline style attributes. However, be aware that like any inline style, it does not support pseudo-class selectors like :hover, :after, etc.

For other clients, you can include your external.css in a <style> node in the <head> of your document. However, because your inline.css styles were moved to inline styles, you'll have to add an !important to every rule in external.css to overcome the specificity of the inline styles.
Step 3: Reset your CSS
You should use a standard reset CSS in your inline.css, but most reset rules do not work exactly the same when inlined. Styles like font-size when inlined do not follow the same inheritance chain, so move any default-inherited styles like font-size to a more specific selector like #body.
Step 4: Wrap your email content in tables
For consistency, you should wrap all of your email's content with a table. Rather than displaying your content in an iframe, many clients will add a prefix to all of your CSS selectors and insert the HTML of your document's body into their page's content. This means that you cannot set any styles on the body tag. We wrapped our document with a table#body and specified a background-color to have our content visually different from the client's interface. This table is also where you can set default font-size / line-height and have it inherit properly across all clients.
Step 5: Use tables for layout
Most of the styles used to layout web experiences (float, position, display: table/inline-block, etc.), just do not work or are not supported in all clients. Unfortunately, in the age of HTML5, you'll still have to use tables for consistent layout in emails. To reset any table rendering differences between clients, always set the following attributes on table nodes:

  • table: cellspacing, cellpadding, border
  • table: table-layout: fixed (in CSS + inlined in style)
  • table and td: align

Step 6: Test, test and test some more
Testing across multiple devices and clients can quickly get tiresome, especially since we love to do quick iterations of design and development. Fortunately, services like emailonacid allow you to send one email and see it rendered on multiple experiences. It was invaluable especially for clients we don't commonly use around the office.

We also built our own test environment so we could have one place to see all of our emails and their variations. With one click, you can see an approximation of what the email will look like on a desktop, phone, or tablet client.


Results

And that's it! After you get one mobile-friendly email under your belt, the rest become much easier. After we switched to our new mobile emails, we saw increases in click-through rates and now we have a full arsenal of tools to make beautiful emails like this one:


Tips & Tricks

Some final do's and don'ts that can make a huge difference in the final rendering:
Specify a doctype
If you don't, it can trigger strange quirks modes in many clients. For the most part, the actual doctype you provide will be ignored--it just needs to be specified.
Don't comment your CSS
We are huge fans of well-comment code (including CSS), but we found that including CSS comments could trigger spam filters to treat our emails as suspicious.
Set width on <img> tags, but not height
GMail does not allow you change height in CSS, but if only width is set, height will be calculated based on the aspect ratio of the original image and it will be sized correctly.
Avoid these CSS attributes
  • margin: many clients do not support margin of any kind, but especially NO negative margin
  • display: many clients do not support any display type but the default, use a default block (<p>) or inline (<span>) element where necessary
  • float: many clients do not support float as you would expect, use tables to do your layout
  • box-sizing: because you’re using tables anyway, this isn't as useful as you’d expect anyway, probably won’t miss it
  • background-image: this does work in many clients, but do not require it to exist for the email to be usable
  • position: is not supported at all in most clients, will just throw it away or ignore it
  • list-style-type: ignored by some clients, so only use if it falling back to the default for that element is okay

See Also


Tuesday, February 25, 2014

Squash Bugs While You Sleep: Automating JIRA

We've got issues. Oh boy, do we have issues. To keep Polyvore's databases filled with up-to-date information about the best and latest products you want to see and buy, we crawl hundreds of partner websites (with their permission, of course) and load feed files from dozens more. Every time one of those retailers changes their website design, our crawler configuration has to change. Every time one of those sites adds a "New Summer 2014 Collection" section, our crawler needs to know about it. And every time one of those stores has a website problem, our crawler has a problem. With hundreds of websites to crawl, we run into 5-10 errors a day, plus a bunch of warnings. Somebody should really fix all of those things...

Over the past couple of years, Polyvore's approach to issue tracking has evolved from countless hours of human labor into an indispensable tool that multiplies our attention tenfold. I'd like to walk you through our process, in the hopes that you see something that looks familiar and maybe even enlightening.

Triage, human style

At first, our crawlers numbered in the dozens, so we could just watch the output logs every day. After a while, this got to be too much, so we created scripts to read those log files for us and send email when they showed errors. This actually worked for quite a long time.

But after another growth spurt, we realized we were dropping error reports on the floor and not fixing them all. We got tired of having our inboxes fill up with error notices every morning. So we made our scripts smarter. That helped with the flood, but we would still occasionally forget about something that really needed immediate attention.

Triage, bug-tracking style

So we found a bug tracking system and changed our scripts to send error reports to it. That system (which shall remain nameless) let us keep a list of open bugs, but was decidedly simple when it came to workflow. Each bug had a status and an assignee, but that was about it. This let us search for open bugs for each member of the team and work on them. It was a huge leap above email in that regard. Now we could tell who was working on what and which crawlers had errors. We could count the number of bugs and make sure that number decreased over time. And we could count the number of closed bugs every month. Yay, metrics! (We love metrics.)

We felt this was a golden age, for a while, but the system had one major flaw: when you had to ask someone else to work on a bug, you had to assign it to them and change the status at the same time. As you can imagine, when things got busy, someone would change one field but forget the other. Bugs got stuck in indeterminate states. Things got dropped. Our team got bigger, which helped, but we all realized that the tool was not helping us as much as we wanted. So we took on a quest to find a better issue tracking tool.

Enter JIRA

We found JIRA. It had been around for a long time, but we'd always thought, "That's too big for our needs. It's too expensive. We don't want that much administrative overhead." When we looked at it again, we realized that the reason our existing system was floundering was that it didn't do enough for us -- it needed more structure, better reporting, and better automation.

So we took on two major projects to overhaul our crawler error tracking system:

  • Define workflows for all our major types of errors that would let us not even have to think about what to do on them next.
  • Build an automation system that did as much as possible for us without human intervention.

Triage, workflow style

Let's say a partner's website moves their "price" field to a different section of the page, but doesn't tell us about it. Suddenly the crawl output has no prices in it. You need to have someone investigate the error and then modify the crawler configuration for that site to pick up the price from the new location. You have to commit the change and wait for it to be pushed out to the crawler machines. You have to wait until the site is crawled again to make sure that it picked up the price correctly this time. Finally, you have to make sure that the new prices are loaded into the database properly. At the end, you want someone to review the code and data and pronounce them "Good."

This process can easily take a week of real time. During that time you may have a couple of different people looking at it, plus waiting time. You don't want anyone to forget about it and you want to make sure that every step is followed. So we created a series of workflows in JIRA that look like flowcharts:

Each transition between states in the diagram can have custom functions added to it. So, for example, when we start a new Issue we automatically assign it to the person on triage duty. When the changes to the crawler are ready for review, the workflow automatically assigns the task to a reviewer. And when we're ready to verify that everything is fixed properly, we automatically assign the Issue to one of our Data Editors who know what "right" looks like. As the Issue moves through the workflow, assignees are notified via email and on their JIRA Dashboard that they have new Issues to work on. And if someone later in the chain realizes that someone earlier missed something, it's easy to send the Issue back to a previous step with a comment explaining what's wrong by hitting a single button.

Using JIRA's existing web-based tools and reports, keeping track of Issues is easy. We have saved searches to tell us when something is being ignored, what's "stuck" in a step, and who is overloaded. And we have nifty graphs to show just how awesome we are at keeping up with all the Issues!

Triage, automation style

What's better than being able to track all the issues? Having the computer automatically handle them for you! JIRA has a very thorough REST API that lets you do everything you can do through their GUI in a script. We realized that we could use this to our advantage and reduce the amount of human time our issues take up.

First, we added code to our crawler and loading scripts to automatically create issues every time we could recognize problems with a site. For example, when you get several 500 HTTP responses in a row, you know that the site has stopped responding. In such a case, we can have the system create its own Issue, without waiting for a human to look at a report the next day.

Then we added code to our processing script that examines the crawled data to check for things like "a lot fewer pages crawled today than yesterday". This script compares the new crawled data with what we already have in our database and can discover problems that you can't see when just looking at the site alone. Again, when problems are detected, a JIRA Issue is created automatically.

Finally, to keep from being completely overwhelmed, we wrote a "close tickets" script that checks each auto-created Issue against our current database and closes it if the problem has been solved. There are only a few cases where the system can tell that the problem is fixed, but they happen often enough that it's a big help. For example, if we have an empty crawl one day, we'll create a JIRA Issue for it. If the crawl works fine the next day, we can close that Issue because we know it was a temporary problem. No human involved!

JIRA, Perl-style

Our web crawling system, like many of its kind, is written in Perl. Conveniently, there are even a few JIRA client libraries in CPAN, to make scripting the system easier. Unfortunately, all of the ones existing when we started our project were either outdated or were missing features we needed, like attaching files to Issues, so we wrote our own.

We've also released this back to the community so that everyone can automate their Issue tracking like Polyvore does. JIRA::Client::Automated is available for everyone now. It's explicitly designed to be used in automated scripts so that JIRA Issues can be created, transitioned through their workflow, and closed without any human intervention (sometimes). We've found this tool extremely useful since we've rolled it out a year ago and we hope that others will find it useful too.

Conclusion

In the beginning, triaging errors was easy, but a very manual process. Once we started automating reporting, things got better, but it wasn't until we went full-bore with JIRA that we really felt like our tools were really pulling their weight.

These two projects, proper workflow design and full automation, have led us to our current utopia: we have one person triaging all the crawler errors each day and he can do other work too. There's always something that requires a smart and good-looking person to investigate, but we've taken a lot of the drudgery out of the process.

I hope seeing our evolution in this area is helpful to others. Maybe your team will be the one who skips straight to the end and never thinks of batch system errors as daily drudgery. Wouldn't that be something?

Wednesday, February 5, 2014

Instant MySQL Restores (with just this script and a ton of disk)

I don’t think that I need to sell you on the concept of backing up your data. But backup strategies and coverage vary widely from organization to organization. This blog post will detail Polyvore's approach to MySQL Backups.

We see 3 basic uses for database backups.

  • Recover from hardware failure
  • Recover from human or application error
  • Create a new replica or slave

Each of these use cases provides a different set of challenges, so we want a comprehensive policy that covers each use case.

Recover from hardware failure

So a mouse crawled into your server and got stuck in the main cooling fan in the middle of the night, frying the server 2 minutes later. I’ll ignore the obvious question of what you were doing running your datacenter in a barn, because we want to get your site back up and running.

Like everybody else, we use standard MySQL replication to protect ourselves from physical hardware failures. If it was a slave that died, no problem. We just take it out of rotation and provision a new one at our convenience. If it was a master that died, there are slaves waiting in the wings to take over.

For the unlikely case where our entire datacenter goes away, we also maintain a set of replicating database instances on Amazon EC2. We have a powerful EC2 instance with SSDs combined with several EBS volumes with generous reserved IOPS. It is capable of running 3 simultaneous MySQL instances, each replicating a different database from our datacenter. So, if we have to do a fully remote restore, we always have up to date replicas ready to use.

So, in the case of a hardware failure on any of our database servers, standard replication ensures that we have other up-to-date copies, both within and outside of the datacenter.

Recover from human or application error

Have you ever run a delete statement and forgot the where clause? How about a bug in your application that silently corrupted data in a way that took you a few days to notice? By the time the sinking feeling reaches all the way to the bottom of your stomach all of your slaves have happily replicated your error...and now you have N copies of borked data. You are likely panicked to get the data back as soon as possible and you definitely don't want to wait hours to downloaded the backup off S3 and prepare it for use. You are probably going to selectively retrieve the data that was lost and manually restore it back into the main database.

One approach to this would be to use pt-slave-delay to keep a slave behind by an hour. However, it might take you several hours, or several days, to notice and track down the bug, over running the delay buffer.

We keep 5 days worth of backups in a ready to use format in our datacenter. The backup directories that we create can be treated as self contained instances. They contain their own data dirs, socket files, temp dirs and management scripts. They co-exist nicely with other mysql instances running on the same server. So when we need access to old data, we can log into the backups server, cd into the backup directory from 3 days ago, and run a start script to fire up a local instance with that snapshot of data. All within 2 minutes.

Occasionally we get requests from the engineers to see what the database was like 2 weeks ago, or 6 months ago. So we keep periodic snapshots going all the way back in time to when we started doing backups. Unfortunately we do not have infinite disk space on our backups server, so we built our backup tool to automatically tar/gzip/encrypt the backed up database and upload it to S3. Unfortunately, we also do not have infinite bandwidth to Amazon EC2/S3 to be blasting full sized DB backups nightly. But as I just mentioned, we do have replicating instances on EC2. On a less frequent schedule, we run a second backup against our EC2 replicas and push the encrypted copy into S3 for long term storage.

Create a new replica or slave

Since polyvore.com keeps growing, we keep growing our database cluster by adding servers to the different slave pools and by replacing aging servers with fancy new high powered ones. When the time comes to deploy a new database instance, you have a choice: restore from a previous backup, or do a clone of an existing server. Since our backup tool (based on innobackupex from Percona www.percona.com) gives us a ready-to-use instance directory from a running server, there is not much difference between these approaches. We can either copy a ready-to-go directory from the backups server or run the tool against an existing slave. Restoring from an existing snapshot is faster, but could leave you a day behind in replication on the new server. Cloning an existing server takes longer but the resulting server will be more up-to-date when it comes online.

Our Tool

Written in Perl, mysql-backup-manager coordinates the various steps in backing up a MySQL database. It operates in several phases:

You can see the latest revision of the tool on github at: https://github.com/polyvore/mysql-backup-manager or download it with git clone git@github.com:polyvore/mysql-backup-manager.git

Copy Phase

Use the innobackupex tool from Percona to do a snapshot of a running database. You run the tool on the destination server. It supports several transport options: local (if you don’t need to copy over the network), netcat (for fast copies over your trusted LAN ) and ssh (for encryption, security, and to more easily get through firewalls).

Apply Logs Phase

When you use the innobackupex tool, you must prepare the backup before you can use it. Our tool automatically does this for you at backup time, in order to simplify and speed up a restore process.

Deploy Phase

At Polyvore, we have a standard filesystem layout for MySQL that is slightly different from the Debian layout. Data files, log files, pid files, sockets all reside within a defined directory structure. This structure allows us to easily run multiple myslqd instances on the same server and to have self-contained restore directories. The deploy phase moves files into the locations that we expect them to be. It also applies a template to create a local my.cnf file and start and stop scripts for running mysqld isolated to this directory.

Tar Phase

If you are archiving the backup somewhere, create a tarball that you can upload. You have the option to encrypt the backup using OpenSSL DES3, and to split the tarball into 1G chunks for upload.

Upload Phase

Upload the file to S3. We use the s3cmd tool to actually do the upload. The tool relies on s3cmd being properly configured with your Amazon keys.

Error Detection and Reporting

Our backup tool logs to syslog at various stages through the backup process, as well upon any error. We ingest these logs with Splunk ( www.splunk.com ) and use Splunk to alert if we encounter any errors or if an expected backup fails to run.

Example Usage

To see full usage details

    ./mysql-backup-manager.pl --help

The following example will create a backup from a remote host, encrypt it, upload it to S3 and keep a copy for instant restores

    ./mysql-backup-manager.pl --dbtype=databasename --dir=/srv/db-backups/databasename 
       --tardir=/srv/db-tars/databasename --timestamp --tarsplit --upload --encrypt 
       --password=CHANGEME --s3path=s3://mybucket/path --deletetar --transport=ssh 
       --host=hostname.domain.com --verbose

Once the backup is completed, if someone accidentally changes the password of every user on our website to "kitten" and I need to do a quick query of the user table from yesterday to use in restoring the data, I can quickly do it with the following commands ( a purely hypothetical situation. I assure you this exact scenario has never happened at a company I used to work for. ):

    cd /srv/db-backups/databasename/2014-1-18-13-10/deploy
    ./start.sh
    ./connect.sh
    USE myschemaname
    SELECT username, encryptedpassword FROM users;

If our entire datacenter has slid into the Pacific Ocean with the rest of California after the next big earthquake, I can do a restore from S3 with the following commands (the important commands are also spelled out in the "status" file created during the backup:

    s3cmd get --recursive s3://mybucket/path/databasename/2014-1-18-13-10/
    cd 2014-1-18-13-10/tar
    cat databasename.tgz.encrypted-split* | openssl des3 -d -k CHANGEME | tar zxivf -
If we used the template dir to give us start and stop scripts, these files would need to be edited to reflect the new path. Then we can run start.sh to launch the instance or move the directories/files into the normal locations.

Possible Improvements

  • We should write a script that cleans up our S3 backups and implements a retention policy, such as: “Keep daily backups for the last 2 weeks in S3. Keep weekly backups for the last 2 months in Glacier. Keep monthly backups forever in Glacier.”
  • Automate the retrieval and preparation of backups from S3
  • Make paths in templated configs relative - or add a command to adjust them to a new location

  • Friday, September 20, 2013

    Launching Home: The Good, The Bad, and The Ongoing

    This week we unveiled Polyvore for your Home, a major milestone in our 6-year history! Even though Polyvore is the largest fashion community on the web, our original prototype started in interior design, so we’ve always built our technology as a platform, designed to scale beyond fashion. That said, some parts of launching home were easy, some were hard, and there’s still plenty more work to do!

    The Easy Parts

    Because we had planned ahead (really far ahead, in some cases), some things were easy to build on top of our existing infrastructure.

    Getting products in

    We needed to bootstrap the home category with enough great products for our users to play with. Our data pipeline starts with a large network of crawlers running on EC2 that pull products from popular retailers like Nordstrom, West Elm and Fab. This is when we retrieve brand, price, availability, and more. The same technology we use on fashion sites works just as well on home sites, so it was just a matter of creating new crawler scripts to pull products from home retailer sites and feed them into our existing infrastructure.

    Ranking home items

    Every object in Polyvore is ranked. However, our scoring algorithm depends heavily on engagement data from our user community of tastemakers. In order to generate enough data to tell us what’s popular, we ran contests. That data allows us to generate our daily Top Home Sets and Top Home Products collections. As the home category grows, our ranking will improve.

    Delightful details

    We pride ourselves on delivering a delightful user experience (it’s one of our core values), which means spending time on getting even small details right. Luckily, for the Home launch, features like removing the background from an image and sending sale notifications were already built to handle multiple verticals.

    The Hard Parts

    Of course, like most things worth building, not everything was easy. We started laying groundwork in 2011 because we knew some pieces of the launch would require a lot of time.
    Classifying difficulties

    Our existing framework for classifying products uses a training set of items for each category, e.g. dresses, pants, furniture, rugs. Sounds like this would work for home products too, but we soon discovered the results for home categorization were not up to our high standards. Why? Because the home taxonomy is so much larger and more diverse than fashion’s. A shirt is pretty easy to categorize by keyword because most will have the word “shirt” or one of a few synonyms (“top”, “blouse”, “tee”) somewhere in the product title or description. But home items range from collectible frog figures to hardwood flooring to dog-shaped pillows to sleigh beds. Lots of wacky stuff! We ended up using much larger training sets than we do for fashion to reduce the noise and categorize more accurately.

    New ambiguities in search

    When we brought fashion and home into one experience, search queries that only had one meaning before became ambiguous. “Glasses” used to always mean eyewear, but now that query could be referring to drinking glasses.

    Home products can also be made up of separate buyable parts. The phrase “brass knob” doesn’t tell us if the user is searching for the knobs themselves or for furniture pieces with brass knob details, like cabinets or dressers.

    When a user has an ambiguous query, selecting the correct category to pull results from becomes more difficult. We ended up returning products depending on which category had the best results, but this is something we will continue to tune as we get more data.

    The Ongoing

    Tuesday’s announcement was merely the launch of Polyvore Home v1.0, and great products should continuously evolve and get better. This means our work is hardly done!

    New ways to browse and filter

    Discovering new products by category worked great for fashion, but home shoppers expect to be able to browse by room, so we’re working on associating products with the rooms they belong in. There is also an additional complication of home items having a range of prices--the same sheet set in California King size is going to cost more than the Twin. This wasn’t as big of a problem in fashion since most items have one price and when there is a range (regular vs. petite sizes, for example), the difference is smaller. Being able to show our users the correct price ranges will be a better shopping experience, so we’re extending our platform to support it.

    Growing a community

    Our fashion-focused users have grown to 20 million strong, but the home community still has a ways to go. We’re leveraging our experience growing our fashion user base to grow our home community as well as trying out some new experimental methods to (hopefully) accelerate our expansion. But that’s a whole ‘nother blog post, so stay tuned!

    Tuesday, May 21, 2013

    Under the Hood: How we Mask our Images

    Polyvore users create over 3 million sets every month, mixing and matching their favorite products to express their style. They clip in images from all over the web, like this teal T-shirt from nordstrom.com


    As you can see, the shirt image has a light gray background that creates an eyesore when layered with other items in a set:


    How do we strip this background away in order to achieve a cleaner look? More importantly, how can we do it for not just this T-shirt, but for any of the 2.2 million images our users clip in every month?

    Magic!

    Ok, not quite. But we do use ImageMagick software as well as some nifty tricks to do the transformation. Let’s walk through the basic steps.

    First, since the background is usually a neutral color, we use ImageMagick's Modulate to boost the color saturation to highlight the difference between the product and the background. In this shirt, we see the teal turn a bright blue:


    Next, we replace the background with white, starting at the pixel that is at the xy position of 2,2. We start there since it's almost certainly part of the background and not the product. ImageMagick has a Draw command that replaces pixels that are the same color as the start pixel with a new color. We use Draw to replace background pixels (those the same color as our start pixel) with white pixels. We also use Draw with a fuzz factor so that it also replaces pixels that are similar in color to the starting pixel to account for subtle gradients and variations in the background.


    Negate flips all the colors in the image, which renders our white background black.


    Threshold changes lighter-colored pixels to white and darker-colored pixels to black, giving us a black and white mask:


    Laying this mask on top of our original image using ImageMagick's Composite with CopyOpacity will keep the parts of the original image that are white in the mask and discard the parts that are black. In essence, the mask acts as a stencil to "cut out" the shirt from the original image:


    To demonstrate the background removal effect, here is that same final image laid over a dark gray background:


    Neat, right?

    This technique works great at smaller image sizes, and this was sufficient when the maximum Polyvore set image was 500x500. But with a general trend towards UIs with larger images and retina displays, we needed to render sets at 1024x1024. At this size, that jagged white outline became much more visible.


    We figured there had to be a better way. After some research and a lot of trial and error, here’s what we came up with:

    We start with the same teal T-shirt, and we do the same Modulate and Draw steps as before:


    Here is where we start to stray from our original masking algorithm. The step after this one works better with grayscale input, so we apply Threshold first:


    Then we use Potrace, an open-source utility which can convert bitmaps into vectors (SVG in our case). We use it to "trace" our mask, which produces smooth edges instead of jagged ones. It’s already looking better!


    Negate to get a mask with white shirt and black background:


    And here's what the masked image looks like on a gray background:


    We probably could have stopped here. The white outline around the T-shirt is much smoother than in the original, but wouldn't it be better to not have that outline at all?

    Let's go back to the mask we made with Potrace.


    Potrace draws such perfect lines that the transition between the black background and white foreground is very sharp. When we use Composite with a mask, ImageMagick interprets the black pixels as areas that should be transparent, white pixels as areas that should be opaque, and gray pixels as areas that should be semi-transparent. If we can replace the sudden black to white transition at the mask's edges with a gradient, we should be able to get the white outline in the final image to fade away.

    Using the feathering technique described here, we can smooth the boundary between the black and white by Blurring the edges, then applying this gradient to the Blur'd regions:


    It's hard to see the difference in the final mask, so here's a closeup of the right sleeve area before and after the feathering:


    Notice that the jagged edges have been replaced with smoother lines and curves and the transition between black and white is more gradual.

    And the final payoff, our original masking on the left, our current masking on the right:


    We're constantly looking for ways to improve every aspect of our product. Even though this may seem like a subtle feature, it represents the kind of detail and polish that we strive for and the lengths to which we go to make Polyvore great.

    Got any of your own image processing tips to share?

    Monday, March 4, 2013

    Polyvore Style Tips: CSS, Javascript and HTML

    Shhhh! Don't tell anyone, but the engineers at Polyvore are ... nerds. Most people have the misconception that Polyvore is full of fashion models and editors, but the truth is that Polyvore's a technology company wrapped in a Gucci dress.

    Our core web technology enables people to mix and match products from around the web to create works of art or whimsy.
    Our front ends have always been heavy in HTML, JS, and CSS. And over the years, we've learned a ton in building those features and trying to keep up with all the latest and greatest HTML5 features in modern browsers. I wanted to share some of our favourite tricks.

    Unless otherwise noted, these all work on the latest couple Safari, Chrome, and FireFox releases (obviously) but also on IE8 and higher. These are Polyvore's supported browsers and it's a decision based on traffic -- 5% or higher. YMMV if you are committed to supporting older versions.



    Overflow: hidden/auto

    We thought we understood what this does: it hides node content outside the regular flow or fixed size, right?
    We were surprised when we found a secondary behaviour: it also seems to cause a block to determine its width according to available space, without clearing floats or wrapping.

    I can't count the number of times we've done this:
    CSS:
    .leftImg { float: left; padding: 4px; border: 2px solid blue; }
    .leftImg + .txt { margin-left: 82px; /* 50px for the img + 2*4px padding + 2*2px border + 20px actual margin -- God I hope no one changes this... */ }
    
    HTML:
    
    
    Fatback spare ribs tri-tip, corned beef andouille bresaola swine meatball biltong short ribs. Corned beef brisket tail kielbasa rump cow t-bone biltong ham. Hamburger turkey corned beef beef ribs swine drumstick.

    Fatback spare ribs tri-tip, corned beef andouille bresaola swine meatball biltong short ribs. Corned beef brisket tail kielbasa rump cow t-bone biltong ham. Hamburger turkey corned beef beef ribs swine drumstick.


    But with overflow:hidden ...
    CSS:
    .leftImg { float: left; margin-right: 20px; }
    .filling-block { overflow: hidden; display: block; }
    
    HTML:
    
    
    Fatback spare ribs tri-tip, corned beef andouille bresaola swine meatball biltong short ribs. Corned beef brisket tail kielbasa rump cow t-bone biltong ham. Hamburger turkey corned beef beef ribs swine drumstick.
    Fatback spare ribs tri-tip, corned beef andouille bresaola swine meatball biltong short ribs. Corned beef brisket tail kielbasa rump cow t-bone biltong ham. Hamburger turkey corned beef beef ribs swine drumstick.


    As with relying on any secondary behaviour, you have to take the primary behaviour with it:
    Caveat: filling-block can't have "overhanging" content (i.e. content outside the regular flow) or else it'll clip / get scrollbars.



    A more sane box model

    When I say 100%, I mean "including my padding and border".
    Booooo


    * { box-sizing: border-box; -moz-box-sizing: border-box; }
    Yay



    localStorage for storing locally

    We often want to store things locally without enduring the network costs of cookie storage.
    One example for Polyvore is storing signed-out users' current composition, potentially a relatively large (~20k) JSON string.

    While localStorage may not have the scalability or flexibility as SQLite or IndexedDB, we like it for its ease-of-use (simple key-value pairs) and widespread support. We also have not seen it perform as poorly as some have experienced. Still to be safe, localStorage access shouldn't be part of an interactive operation (e.g. mouse move) or a tight loop.

    One piece missing from the localStorage API is cookie-style expiration:
    var LocalStorageCache = (function() {
        var WEEK = 1000 * 60 * 60 * 24 * 7;
        var metaKeyPrefix = 'meta_';
        function getMetaKey(key) {
            return metaKeyPrefix + key;
        }
    
        var lastCleanup = Number(localStorage.getItem('last_cleanup')) || 0;
        if (lastCleanup + WEEK < new Date().getTime()) {
            // Preventative clean once per week.                                                                            
            window.setTimeout(function() { LocalStorageCache.cleanup(); });
        }
    
        return {
            WEEK: WEEK,
            set: function(key, value, expires) {
                var jsonValue = '(' + JSON.stringify(value) + ')';
                try {
                    LocalStorageCache._set(key, jsonValue, expires);
                } catch(e) {
                    // Clean up and try again. This may still fail, but gives us a better chance.                           
                    LocalStorageCache.cleanup();
                    LocalStorageCache._set(key, jsonValue, expires);
                }
            },
            _set: function(key, jsonValue, expires) {
                localStorage.setItem(key, jsonValue);
                if (expires) {
                    localStorage.setItem(getMetaKey(key), '(' + JSON.stringify({
                        'createdon': new Date().getTime(),
                        'expires': new Date().getTime() + expires
                    }) + ')');
                } else {
                    localStorage.removeItem(getMetaKey(key));
                }
            },
            get: function(key) {
                var metaDataStr = localStorage.getItem(getMetaKey(key));
                if (metaDataStr) {
                    var metaData;
                    try { metaData = eval(metaDataStr); } catch(e) {}
                    if (!metaData) {
                        return this.remove(key);
                    }
    
                    // Check the expiration.                                                                                
                    var expires = Number(metaData.expires);
                    if (expires && expires < new Date().getTime()) {
                        return this.remove(key);
                    }
                }
    
                var dataStr = localStorage.getItem(key);
                if (!dataStr) {
                    return this.remove(key);
                }
                var data;
                try { data = eval(dataStr); } catch(e2) {} // parse error?                                                  
                return data ? data : this.remove(key);
            },
            remove: function(key) {
                localStorage.removeItem(key);
                localStorage.removeItem(getMetaKey(key));
                return null;
            },
            cleanup: function() {
                localStorage.setItem('last_cleanup', new Date().getTime());
                var keysToDelete = [];
                for (var i = 0, len = localStorage.length; i < len; i++) {
                    var key = localStorage.key(i);
                    if (key.indexOf(metaKeyPrefix) < 0) {
                        keysToDelete.push(key);
                    }
                }
                // If it has an expiration & it's expired, getting the item will clear that key.                            
                keysToDelete.forEach(this.get, this);
            }
        };
    })();
    
    

    And using the local storage API is easy:
    LocalStorageCache.set('a', { foo: 'bar' }, 1000 * 5); // Expires in 5 seconds
    LocalStorageCache.get('a'); // returns { foo: 'bar' } (the JS hash, not the JSON)
    
    Caveat: You can only store ~5MB into localStorage





    Using JS to edit CSS

    We typically change a node's style using node.setAttribute('style', 'width:50px') or similar. But sometimes, it would be nice to change an underlying CSS rule. Some uses we've encountered include:
    • Fluid / dynamic interfaces may need to update multiple nodes at once. If you edit CSS rules, you reflow the entire document but you only need to do so once.
    • We don't want to change every matching element's style attribute directly; just do it once and let CSS handle it -- now and for all future nodes that may be dynamically added
    • We want to dynamically add/change/remove :before/:after pseudo-elements
    Here's a demo of it in action.
    We do use this technique sparingly though. It can be hard to track down the source of CSS rules when the selectors are often generated dynamically (It's not easy to grep for '.' + myClass + ':before') and as mentioned above:
    Caveat: Touching CSS causes a full page reflow and repaint -- don't do this during user interaction or in tight loop!




    pointer-events: none

    This is a CSS style that suppresses mouse and pointer interaction on an element. With this style, mouse-based CSS selectors like :hover and :active do not activate and you get no JS mouse events.

    At Polyvore, we use this to avoid nodes from capturing mouse behavior simply because they have a higher z-index or node stacking. This gives us the freedom to use nodes to help with layout, without interfering with interaction.

    CSS:
    .controls { pointer-events: none; }
    .interactive { pointer-events: auto; }
    
    /* Only hovering over the .interactive parts will make .controls turn grey; other parts have no effect */
    .controls:hover { background-color: #ccc; }
    
    HTML:

    Demo:

    Caveat: not supported in IE9 & below. So it's great for polish, but not for cross-browser correctness.



    Summary

    At Polyvore, we're always looking for new and interesting ways to let the browser do the heavy lifting. We can't wait to see what the browser vendors will support next!

    Tuesday, January 22, 2013

    Worker Queue System

    In the good old days, Polyvore started out with a very simple, traditional LAMP architecture. Everything more or less directly accessed a bank of MySQL servers, both to service web requests and also batch housekeeping jobs. But our traffic and data set has kept growing and growing. We started to experience massive spikes in our DB load when nightly batch jobs were kicked off. Jobs that would complete in 3 minutes started to take 5 hours.

    Fortunately, our infrastructure team has a ton of experience dealing with scalability issues. Their solution was to use a worker queue system to break up massive jobs into smaller chunks, executed by a bank of workers. This approach allows us to utilize our machines better by spreading load throughout the day and also scale jobs by parallel execution of chunks on different worker machines.

    Since we used RabbitMQ before, we considered using it as the building block of our worker queue system as well. However, we quickly found out that RabbitMQ falls short in two important aspects:

    • RabbitMQ is essentially a generic message bus. This means we needed to extend its functionality to make it a full fledged worker queue system.
    • If a task is added to a RabbitMQ queue, there is no native way of inspecting that task while it is queued. That means we won’t be able to check which worker is handling that task, dedupe tasks and more.

    After a bit of research we decided to use Gearman. Gearman is a generic framework to farm out work to other machines or processes. It was a great fit for our needs, especially since we use Perl extensively here in Polyvore, and Gearman has a Perl client and APIs. In addition, we already deployed and use Cassandra in production, and Gearman integrates well with Cassandra as its persistent storage.

    Our implementation ended up being a light wrapper for Gearman. Our API is very simple:

    A way to push a new task onto a named queue:

    $queue->send_task({ channel => 'xzy', payload => $payload });
    

    And a worker for consuming tasks in a given queue:

    package Polyvore::Worker::XYZ;
    use base qw(Polyvore::Worker);
    
    # process is called with each task on the queue
    sub process {
        my ($self, $payload) = @_;
    
        # do stuff
    
        # driver will auto-retry a few times
        die $exception if ($error);
    }
    
    # instantiate a worker and attach to channel to start consuming tasks.
    Polyvore::Worker::XYZ->new({ channel => 'xyz'})->run();
    

    The driver for workers will automatically retry the task a few times (with progressive back-off) if an exception occurs. This is very handy in the world of finicky Facebook, Twitter, etc… APIs. The worker system is integrated with our stats collection system which keeps track of jobs processed, time per task, exceptions and more.

    We implemented other useful features as well; we have a worker manager that distributes the worker processes/tasks based on the workers cluster load and queue lengths. It is preferable to have a longer pending task queue than overloading the workers cluster. We also implemented a dependency protocol where a worker task can declare itself as dependent on other tasks in the system. That worker task won’t execute until all of its dependencies are complete.

    We use the worker queue system both for scaling our backend processes and also for performing asynchronous front-end tasks. For example, our users post a ton of content to external services. Done synchronously, these post operations can hold up the response anywhere from 5 to 30 seconds (or fail entirely and have to be retried). Using the worker queue system we are able to perform these tasks asynchronously in the background and deliver a very responsive user experience.

    Today, we have over 40 worker processes and handle over 18 million tasks per day.

    Challenges

    Sharding batch jobs

    We started out by writing simple jobs that got all their data in one SQL statement. That worked for a while until the number of rows in the DB grew to the point that the select would make our read slaves keel over. We have since been sharding our jobs so that they can operate in smaller chunks, typically by operating over id ranges. Instead of processing all 1M rows, we break up the job into 1000 1K id ranges and treat each range as a task for the worker system.

    Sharding to even out load based on data density

    Some of the problems we solve using our worker queue system have interesting characteristics. These problems require us to split the data into buckets which are not necessarily equal in size in order to maximize efficiency. For example, we use machine learning to categorize items that we import into our index. We do incremental categorization for new items, but we also re-categorize older items that have been changed recently. Since the distribution of updates is biased toward newer items, we create inverse-log sized data buckets to even out the processing time for each group of items. This gives us larger buckets (~10M items) of old items (with few changes), and smaller buckets (~10K items) of new items (with more changes).

    Testing

    We have a great development environment which allows us to have multiple checkouts to work in. Each checkout can be previewed against development and production databases. We also have a per checkout test environment which allows us to run our test suite against a particular checkout, with its own isolated mysqld instance, Cassandra instance, etc… We also have per user worker queue in unit test.

    Worker Queues vs. Map/Reduce

    We use Hadoop for big data analysis. Currently we use it for a specific analysis we do on a subset of the data we have. However, we plan to expand our Hadoop deployment so that we can do more batch analysis on a larger portion of our data and in a lot of other use cases. Obviously Hadoop allows us to analyze our data in ways we couldn’t have done before. Using Hadoop also raises the question of which tasks are best suited for our worker queue system and which tasks will benefit more from map/reduce.

    Our worker queue system is great for procedural tasks such as user notifications, user emails, posting on Facebook wall, or extracting meta-data from uploaded images. In addition, we use our worker queue system for scheduling Hadoop jobs. All of those tasks are asynchronous, independent and sometimes require a retry mechanism.

    Summary

    Worker queues are a great way to scale batch jobs, and increase the utilization of computation resources by spreading load to avoid spikes; thus it helps in designing and implementing a scalable architecture. It also lets you provide a better user experience by performing long blocking tasks asynchronously. As we expand our usage of Hadoop, we will continue to assess which tasks are better suited for our worker queue system and which ones can benefit from Hadoop’s map/reduce design pattern. Using the right tool for the job is an important principle. Designing and implementing simple, scalable tools allows us to uphold that principle.

    Also See