Image Capturing the 100 Most Popular Websites

2016-04-20

I was recently asked what conclusions I was able to draw from my analysis of web trends and color palettes and if the same could be done for the top 100 most popular sites.

Automating Image Capture

I have previously used the Alexa top websites data set, and I will be using a subset of it again here.

head -n100 top-1m.csv > top-100.csv

Rank	Site
1	google.com
2	facebook.com
3	youtube.com
4	baidu.com
5	yahoo.com
6	amazon.com
7	wikipedia.org
8	qq.com
9	google.co.in
10	twitter.com
…	…

A Basic Crawler

I'm using Selenium and PhantomJS for headless browsing and image capture. A few things worth mentioning:

the call to set_window_size is necessary to define a reasonable (to me) browser window width for the screenshot
- the height parameter is ignored by PhantomJS and the screenshot is of the full page height
set_page_load_timeout is to prevent the whole script from hanging (a bigger problem on a poor internet connection). I timed out on a single page with an upper bound on 2 minutes.
The service_args are included to avoid possible issues with bad TLS connections and were more of a preventative measure than absolutely required.

import logging
import csv
from selenium import webdriver

logging.basicConfig(level=logging.INFO)

driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any'])
driver.set_window_size(1024, 768)
driver.set_page_load_timeout(120)

It turns out Selenium/PhantomJS don't get along well with the protocol-less web addresses in the CSV file so I have to build a "full" URL before fetching the page.

Selenium has, for reasons I still don't understand, at least three different methods to invoke screen capture. So far as I can tell they all do the same thing and capture the full page height, which for some pages is an order of magnitude above the previously specified window height of 768.

with open('top-100.csv') as top_sites_csv:
    site_reader = csv.reader(top_sites_csv)
    for i, url in site_reader:
        try:
            full_url = "http://{}".format(url)
            logging.info("fetching: {}".format(full_url))
            driver.get(full_url)
            driver.save_screenshot("images/{}.png".format(url))
        except Exception as exc:
            logging.info(exc)

driver.quit()

Cropping to Standardize Image Size

To fix the issue of variable image sizes I'm once again using imagemagick:

  for i in images/*.png; do
      convert $i -crop x768+0+0 cropped-images/$i;
  done

where the x768 is not specifying a width (think of it as _x768) and the +0+0 is specifying the start position of the crop operation in pixels. Failure to include the start position results in n-images where n is the number of images necessary to include the full original image (making it more of a slice operation).

Trying To Visualize It

While it is possible to deconstruct the images into their component parts or RGB values with imagemagick, it provides little insight in and of itself.

I'll be using the GitHub homepage for this example:

GitHub homepage

convert cropped-images/github.com.png txt: | head

    # ImageMagick pixel enumeration: 1024,768,255,srgba
    0,0: (255,255,255,1)  #FFFFFFFF  white
    1,0: (255,255,255,1)  #FFFFFFFF  white
    2,0: (255,255,255,1)  #FFFFFFFF  white
    3,0: (255,255,255,1)  #FFFFFFFF  white
    4,0: (255,255,255,1)  #FFFFFFFF  white
    5,0: (255,255,255,1)  #FFFFFFFF  white
    6,0: (255,255,255,1)  #FFFFFFFF  white
    7,0: (255,255,255,1)  #FFFFFFFF  white
    8,0: (255,255,255,1)  #FFFFFFFF  white

I've tried plotting the RGB values in 3 dimensional space, but even at a reduced size it's still a matter of visualizing more than 30,000 points per image in a system that has no direct application to our perception of color.

So once again, it means trying to extract the meaningful colors from the sea of data that's present in each image.

Imagemagick's Histogram

R	G	B
248	248	248
90	72	63
32	22	17
121	142	118
153	161	171

K-Means Analysis

Disappointed with the muddled results given by imagemagick's histogram I set about doing a K-means analysis of the per-pixel RGB values in the same test image. I'm pulling most of my background on K-means from John Foreman's Data Smart, which is a decent enough introduction to (exploratory) data analysis though all of the examples are worked in Excel; and Philipp Janert's Data Analysis with Open Source Tools.

I won't bother explaining too much of what's going on here because the results were ultimately pretty lackluster. You'll note I didn't bother correcting for exceeding the RGB range in the third row. At that point it was clear that this was a more intensive solution with a worse result.

  import numpy as np
  from scipy.cluster.vq import kmeans2

  clusters = 5
  pixel_array = np.loadtxt('pixel-values.txt')
  centroids, pixels = kmeans2(pixel_array, clusters)

R	G	B
31.674	22.019	15.089
247.560	247.800	247.489
277.554	300.70	301.190
89.965	70.016	61.269
134.312	149.269	143.781

I think the real issue here is the limited k value forces a worse approximation of the cluster centers, possibly due to the inherent "shape" of the RGB data points. Janert claims K-means works best for globular or star-convex clusters. I haven't been able to define the RGB points' shape, but I think it may be a function of the colorspace they are expressed in. An interesting experiment would be to convert each point to HSL or XYZ and try re-plotting.

So Why Is It Ugly?

Of course, in retrospect, it makes some sense that the K-means would result in less than idea identification of prominent colors in an image such as the example given. Because of the hero image in the jumbotron there are a huge number of colors on-screen, and the K-means is simply finding a best approximation of a local minimum for the Euclidean distance in RGB values. This means the flesh-tones and the dark greys have to be factored in as much as the bright green "Sign up for GitHub" button; resulting in a more muted palette.

One Step Back

It occurred to me, browsing through all of the captured images, that I would necessarily have the same issue with each site that heavily relied on images (of products, of users, of advertisements). What originally prompted me to examine these sites was to try and view the palettes they were designed with, and while the images are sometimes a part of this (tumblr.com for instance) more often they are just noise.

So I re-crawled each of the sites, this time omitting images from pages with an additional parameter to PhantomJS:

  driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true',
                                             '--ssl-protocol=any',
                                             '--load-images=no'])

Which results in the following:

GitHub homepage no images

So taking a histogram of an image-less screen capture results in the following histogram from imagemagick:

While it is better without the hero image, it still fails to (in my mind) adequately capture the saturation in the green call to action. This has led me to conclude that I'm approaching the color extraction wrong and need to reassess the algorithmic approach.

Next Steps

Ultimately, I think this was a roundabout way to find I have been approaching the problem incorrectly. I think the next logical approach to test will be implementing a color quantization (Median Cut looks promising).