Optimizing Static Site Generation2017-01-07
I previously identified and fixed a design flaw in my own static site generator as part of a profiling effort. I similarly identified potential avenues for optimizing generation times of this site, specifically by avoiding unchanged posts.
Complicating Stated Goals
When I first created Static, it was designed to do the least surprising thing possible. Consequently, posts were regenerated on each run, regardless of whether they been previously. This made sense at the time of writing, and expedited the development, but ultimately doesn't make sense if this is to be a long-lived project. Generation times that scale linearly with the number of posts do not inspire me to keep writing. Seemingly, the only way forward is to bite the bullet and eliminate this shortcoming.
The Simplest Possible Solution
With a goal of eliminating re-renderings, the most obvious solution is to track posts as they are generated and check if they have been generated previously. To date, Static has relied mostly on simple data structures within Python, defining only a single class for post objects; I opted to pursue a similar line with this initial "caching" mechanism.
I first laid out the amount of information needed to recreate this site, assuming a previous successful run of the generator. What it largely came down to was the information necessary for the "archive" page, which boils down to:
- post title
- post date
- relative path to the generated post
The full body of those three items is collected for every post in a list
all_posts), I settled on simply serializing that list to JSON and writing it
out to disk. From there it was a simple matter of "loading" and "dumping" that
list before and after each generation cycle.
So - Did It Work?
Generally speaking, yes, it did work. Generating a single new post takes approximately 2⁄10ths of a second, about a third of the time taken previously. More importantly, this change should more or less fix generation times at this point due to the limitation of work being done. There is the growth of the single JSON file, but time spent de-serializing JSON is probably negligible.
It occurs to me, now that I've had some time to work with the new generation
mechanism, there isn't a good workflow yet for either ignoring the
file, in order to regenerate posts (short of deleting it entirely), or to force
a regeneration of a single post. The latter is pretty common operation when I'm
drafting a post, as I write, read and check for cogency. In the future, rather
than fiddling with the JSON file manually, I'd like to extend the command-line
arguments available to add a
--no-cache option or similar. Ultimately, I
think refinements will shake out of further use, in much the same way that I