[nolan@nprescott.com] $>  cat blog archive feed

Image Capturing the 100 Most Popular Websites

2016-04-20

I was recently asked what conclusions I was able to draw from my analysis of web trends and color palettes and if the same could be done for the top 100 most popular sites.

Automating Image Capture

I have previously used the Alexa top websites data set, and I will be using a subset of it again here.

head -n100 top-1m.csv > top-100.csv
Rank Site
1 google.com
2 facebook.com
3 youtube.com
4 baidu.com
5 yahoo.com
6 amazon.com
7 wikipedia.org
8 qq.com
9 google.co.in
10 twitter.com
... ...

A Basic Crawler

I'm using Selenium and PhantomJS for headless browsing and image capture. A few things worth mentioning:

import logging
import csv
from selenium import webdriver

logging.basicConfig(level=logging.INFO)

driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any'])
driver.set_window_size(1024, 768)
driver.set_page_load_timeout(120)

It turns out Selenium/PhantomJS don't get along well with the protocol-less web addresses in the CSV file so I have to build a "full" URL before fetching the page.

Selenium has, for reasons I still don't understand, at least three different methods to invoke screen capture. So far as I can tell they all do the same thing and capture the full page height, which for some pages is an order of magnitude above the previously specified window height of 768.

with open('top-100.csv') as top_sites_csv:
    site_reader = csv.reader(top_sites_csv)
    for i, url in site_reader:
        try:
            full_url = "http://{}".format(url)
            logging.info("fetching: {}".format(full_url))
            driver.get(full_url)
            driver.save_screenshot("images/{}.png".format(url))
        except Exception as exc:
            logging.info(exc)

driver.quit()

Cropping to Standardize Image Size

To fix the issue of variable image sizes I'm once again using imagemagick:

  for i in images/*.png; do
      convert $i -crop x768+0+0 cropped-images/$i;
  done

where the x768 is not specifying a width (think of it as _x768) and the +0+0 is specifying the start position of the crop operation in pixels. Failure to include the start position results in n-images where n is the number of images necessary to include the full original image (making it more of a slice operation).

Trying To Visualize It

While it is possible to deconstruct the images into their component parts or RGB values with imagemagick, it provides little insight in and of itself.

I'll be using the GitHub homepage for this example:

GitHub homepage

convert cropped-images/github.com.png txt: | head
    # ImageMagick pixel enumeration: 1024,768,255,srgba
    0,0: (255,255,255,1)  #FFFFFFFF  white
    1,0: (255,255,255,1)  #FFFFFFFF  white
    2,0: (255,255,255,1)  #FFFFFFFF  white
    3,0: (255,255,255,1)  #FFFFFFFF  white
    4,0: (255,255,255,1)  #FFFFFFFF  white
    5,0: (255,255,255,1)  #FFFFFFFF  white
    6,0: (255,255,255,1)  #FFFFFFFF  white
    7,0: (255,255,255,1)  #FFFFFFFF  white
    8,0: (255,255,255,1)  #FFFFFFFF  white

I've tried plotting the RGB values in 3 dimensional space, but even at a reduced size it's still a matter of visualizing more than 30,000 points per image in a system that has no direct application to our perception of color.

3D plot of RGB colors

So once again, it means trying to extract the meaningful colors from the sea of data that's present in each image.

Imagemagick's Histogram

R G B
248 248 248
90 72 63
32 22 17
121 142 118
153 161 171

K-Means Analysis

Disappointed with the muddled results given by imagemagick's histogram I set about doing a K-means analysis of the per-pixel RGB values in the same test image. I'm pulling most of my background on K-means from John Foreman's Data Smart, which is a decent enough introduction to (exploratory) data analysis though all of the examples are worked in Excel; and Philipp Janert's Data Analysis with Open Source Tools.

I won't bother explaining too much of what's going on here because the results were ultimately pretty lackluster. You'll note I didn't bother correcting for exceeding the RGB range in the third row. At that point it was clear that this was a more intensive solution with a worse result.

  import numpy as np
  from scipy.cluster.vq import kmeans2

  clusters = 5
  pixel_array = np.loadtxt('pixel-values.txt')
  centroids, pixels = kmeans2(pixel_array, clusters)
R G B
31.674 22.019 15.089
247.560 247.800 247.489
277.554 300.70 301.190
89.965 70.016 61.269
134.312 149.269 143.781

I think the real issue here is the limited k value forces a worse approximation of the cluster centers, possibly due to the inherent "shape" of the RGB data points. Janert claims K-means works best for globular or star-convex clusters. I haven't been able to define the RGB points' shape, but I think it may be a function of the colorspace they are expressed in. An interesting experiment would be to convert each point to HSL or XYZ and try re-plotting.

So Why Is It Ugly?

Of course, in retrospect, it makes some sense that the K-means would result in less than idea identification of prominent colors in an image such as the example given. Because of the hero image in the jumbotron there are a huge number of colors on-screen, and the K-means is simply finding a best approximation of a local minimum for the Euclidean distance in RGB values. This means the flesh-tones and the dark greys have to be factored in as much as the bright green "Sign up for GitHub" button; resulting in a more muted palette.

One Step Back

It occurred to me, browsing through all of the captured images, that I would necessarily have the same issue with each site that heavily relied on images (of products, of users, of advertisements). What originally prompted me to examine these sites was to try and view the palettes they were designed with, and while the images are sometimes a part of this (tumblr.com for instance) more often they are just noise.

So I re-crawled each of the sites, this time omitting images from pages with an additional parameter to PhantomJS:

  driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true',
                                             '--ssl-protocol=any',
                                             '--load-images=no'])

Which results in the following:

GitHub homepage no images

So taking a histogram of an image-less screen capture results in the following histogram from imagemagick:

While it is better without the hero image, it still fails to (in my mind) adequately capture the saturation in the green call to action. This has lead me to conclude that I'm approaching the color extraction wrong and need to reassess the algorithmic approach.

Next Steps

Ultimately, I think this was a roundabout way to find I have been approaching the problem incorrectly. I think the next logical approach to test will be implementing a color quantization (Median Cut looks promising).

[nolan@nprescott.com] $> █