A short post chronicling an evening of yak-shaving, working out those CLI tools available for web scraping, printing and handling PDF files.
I found my way across Ian Taylor's series of posts documenting linkers again and got it in my head that what I really needed was to be able to read them all offline. This led to an almost comical weekend-edition of yak shaving 1.
I tend to use the Fish Shell 2 on my home computer, but find I often have to drop back to Bash for any kind of scripting due to differences in parameter expansion and substitution that I haven't willed into my brain yet. This time everything was pretty straight-forward though, and Fish really does have saner scripting than Bash.
Although curl
is capable of iterating over a range of URLs with a simple
syntax: http://www.airs.com/blog/archives/[38-57]
, what I ended up doing was a
writing a loop myself in order to refer to the page index later in the pipeline.
for i in (seq 38 57)
curl -s "http://www.airs.com/blog/archives/"$i \
| w3m -dump -T text/html \
| awk -f scrape.awk \
| textutil -stdin -convert rtf -font 'Courier New' -fontsize 9 -output $i.rtf;
cupsfilter -D $i.rtf > $i.pdf;
end
I'm not sure that looping with curl
this way isn't slower than whatever the
standard mechanism is, but the shell pipeline breaks down around textutil
without a named output file when it overwrites itself with the default
out.rtf
. I hadn't used w3m
in this manner before and was pleasantly
surprised to find the text-based web browser capable of writing the formatted
output to a file (or stdout
in my case). One thing I didn't like was the
inclusion of comments in the dumped text from the article, but I quickly
realized it would be a simple thing to identify and strip them from the output.
I used the AWK script below (or rather, I typed it out and have broken it out
into its own script for the sake of readability).
BEGIN {
SKIP=0
}
$0 ~ /^[Pp]ermalink$/ {
SKIP=1
}
{
if(SKIP==1)
next
else
print $0
}
Which simply looks for the one place on the page I figured could be relied upon
to signify the end of the article, which is the article's permalink. I can
safely ignore everything after that so I call next
until EOF
is reached.
Once that is done, I have a directory full of PDF files, basically formatted
how I want and all that's left is to merge each of the separate files into one.
Honestly, I had no idea how to do this but hit on a
simple solution using
ghostscript
:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=linkers.pdf *.pdf
-dBATCH
causes ghostscript to exit after processing, skipping any user
interaction-dNOPAUSE
disables the pause and prompt after each file-q
for quiet startup-sDEVICE=pdfwrite
for PDF output-sOutputFile=linker.pdf
to name the single output fileAll that's left then is to clean up the lingering PDF files after checking that everything is in order and looks good (enough). I was again pleasantly surprised to find that ghostscript kept the file/page ordering intact and I didn't need to fiddle with anything. I probably spent more time than was strictly necessary down this particular rabbit-hole but I perform something like this sequence of actions often enough that it was worth scripting to some extent.