indexpost archiveatom feed syndication feed icon

Yak Shaving

2015-08-02

A short post chronicling an evening of yak-shaving, working out those CLI tools available for web scraping, printing and handling PDF files.

I found my way across Ian Taylor's series of posts documenting linkers again and got it in my head that what I really needed was to be able to read them all offline. This led to an almost comical weekend-edition of yak shaving 1.

I tend to use the Fish Shell 2 on my home computer, but find I often have to drop back to Bash for any kind of scripting due to differences in parameter expansion and substitution that I haven't willed into my brain yet. This time everything was pretty straight-forward though, and Fish really does have saner scripting than Bash.

Although curl is capable of iterating over a range of URLs with a simple syntax: http://www.airs.com/blog/archives/[38-57], what I ended up doing was a writing a loop myself in order to refer to the page index later in the pipeline.

    for i in (seq 38 57)
        curl -s "http://www.airs.com/blog/archives/"$i \
          | w3m -dump -T text/html \
          | awk -f scrape.awk \
          | textutil -stdin -convert rtf -font 'Courier New' -fontsize 9 -output $i.rtf;
        cupsfilter -D $i.rtf > $i.pdf;
    end

I'm not sure that looping with curl this way isn't slower than whatever the standard mechanism is, but the shell pipeline breaks down around textutil without a named output file when it overwrites itself with the default out.rtf. I hadn't used w3m in this manner before and was pleasantly surprised to find the text-based web browser capable of writing the formatted output to a file (or stdout in my case). One thing I didn't like was the inclusion of comments in the dumped text from the article, but I quickly realized it would be a simple thing to identify and strip them from the output. I used the AWK script below (or rather, I typed it out and have broken it out into its own script for the sake of readability).

    BEGIN {
        SKIP=0
    }

    $0 ~ /^[Pp]ermalink$/ {
        SKIP=1
    }

    {
        if(SKIP==1)
            next
        else
            print $0
    }

Which simply looks for the one place on the page I figured could be relied upon to signify the end of the article, which is the article's permalink. I can safely ignore everything after that so I call next until EOF is reached.

Once that is done, I have a directory full of PDF files, basically formatted how I want and all that's left is to merge each of the separate files into one. Honestly, I had no idea how to do this but hit on a simple solution using ghostscript:

    gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=linkers.pdf *.pdf

All that's left then is to clean up the lingering PDF files after checking that everything is in order and looks good (enough). I was again pleasantly surprised to find that ghostscript kept the file/page ordering intact and I didn't need to fiddle with anything. I probably spent more time than was strictly necessary down this particular rabbit-hole but I perform something like this sequence of actions often enough that it was worth scripting to some extent.


  1. http://catb.org/jargon/html/Y/yak-shaving.html
  2. http://fishshell.com/