indexpost archiveatom feed syndication feed icon

Optimizing Static Site Generation

2017-01-07

I previously identified and fixed a design flaw in my own static site generator as part of a profiling effort. I similarly identified potential avenues for optimizing generation times of this site, specifically by avoiding unchanged posts.

Complicating Stated Goals

When I first created Static, it was designed to do the least surprising thing possible. Consequently, posts were regenerated on each run, regardless of whether they been previously. This made sense at the time of writing, and expedited the development, but ultimately doesn't make sense if this is to be a long-lived project. Generation times that scale linearly with the number of posts do not inspire me to keep writing. Seemingly, the only way forward is to bite the bullet and eliminate this shortcoming.

The Simplest Possible Solution

With a goal of eliminating re-renderings, the most obvious solution is to track posts as they are generated and check if they have been generated previously. To date, Static has relied mostly on simple data structures within Python, defining only a single class for post objects; I opted to pursue a similar line with this initial "caching" mechanism.

I first laid out the amount of information needed to recreate this site, assuming a previous successful run of the generator. What it largely came down to was the information necessary for the "archive" page, which boils down to:

The full body of those three items is collected for every post in a list (all_posts), I settled on simply serializing that list to JSON and writing it out to disk. From there it was a simple matter of "loading" and "dumping" that list before and after each generation cycle.

So - Did It Work?

Generally speaking, yes, it did work. Generating a single new post takes approximately 210ths of a second, about a third of the time taken previously. More importantly, this change should more or less fix generation times at this point due to the limitation of work being done. There is the growth of the single JSON file, but time spent de-serializing JSON is probably negligible.

Potential Improvements

It occurs to me, now that I've had some time to work with the new generation mechanism, there isn't a good workflow yet for either ignoring the .prerender file, in order to regenerate posts (short of deleting it entirely), or to force a regeneration of a single post. The latter is pretty common operation when I'm drafting a post, as I write, read and check for cogency. In the future, rather than fiddling with the JSON file manually, I'd like to extend the command-line arguments available to add a --no-cache option or similar. Ultimately, I think refinements will shake out of further use, in much the same way that I arrived here.