Thinking About Another Static Site Generator

2021-11-26

I have been reading more about C programming and casting about for a project I am thinking of writing another static site generator. It is a small, self-contained project with a well known scope and I have real data.

Goals

no changes to existing documents
no broken links
keep full post body in feed entries
few dependencies

With several years of use I no longer want nor use the various (mis)features intended to make the current program reusable. I am perfectly happy to bake the few pieces of configuration into the program and require a recompilation to use. I am willing to trade compilation-time for run-time because I use the program more often than I modify it. Moreover editing a source file is as easy as editing a configuration file in the typical case.

Auto-generated content has been intentionally limited to just a "recent posts" main page, an archive page, and an Atom feed. While they were a notable inclusion in previous iterations of the project templates have not proven to be very interesting or useful.

Post size information

This is all done as an exercise, the amount of data is so small as to be trivially done in almost any language:

find /home/nolan/sources/idle-cycles/posts -type f -name '*.post' | wc -l

du -h --exclude "*/static" /home/nolan/sources/idle-cycles/posts | sort -k2

1.4M	/home/nolan/sources/idle-cycles/posts
16K	/home/nolan/sources/idle-cycles/posts/2014
88K	/home/nolan/sources/idle-cycles/posts/2015
252K	/home/nolan/sources/idle-cycles/posts/2016
136K	/home/nolan/sources/idle-cycles/posts/2017
160K	/home/nolan/sources/idle-cycles/posts/2018
172K	/home/nolan/sources/idle-cycles/posts/2019
212K	/home/nolan/sources/idle-cycles/posts/2020
340K	/home/nolan/sources/idle-cycles/posts/2021

Did you know that gnuplot can do rudimentary statistical analysis? I love finding out new things!

reset
!stat -c '%s' /home/nolan/sources/idle-cycles/posts/*/*.post > /tmp/data.txt

set key left top
unset xtics
set ylabel "File Size (kB)"

f(x) = mean_y
fit f(x) '/tmp/data.txt' via mean_y
stddev_y = sqrt(FIT_WSSR / (FIT_NDF + 1 ))

plot mean_y-stddev_y title "standard dev." with filledcurves y1=mean_y lt 1 lc rgb "#bbbbdd", \
     mean_y+stddev_y notitle with filledcurves y1=mean_y lt 1 lc rgb "#bbbbdd", \
     mean_y title "mean file size" w l lc 6 lw 4, \
     '/tmp/data.txt' title 'single file' w p pt 7 lt 1

File Tree Traversal

I have stitched the following together based on a few man pages:

#define _XOPEN_SOURCE 500
#include <ftw.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <libgen.h>

int neat_callback(const char *fpath, const struct stat *sb, int tflag, struct FTW *ftwbuf)
{
  const char *ext = strrchr(fpath, '.');
  if (ext && (strcmp("post", ext+1) == 0)) {
    size_t n = strlen(fpath)+1;
    char fpathCopyOne[n];
    char fpathCopyTwo[n];
    memcpy(fpathCopyOne, fpath, n);
    memcpy(fpathCopyTwo, fpath, n);
    printf("path: %s file: %s\n", dirname(fpathCopyOne), basename(fpathCopyTwo));
  }
  return 0;
}

int main(int argc, char *argv[])
{
  if (nftw("/tmp/idle-cycles/posts", neat_callback, 10, 0) == -1) {
    perror("junk program");
    exit(EXIT_FAILURE);
  }

  exit(EXIT_SUCCESS);
}

path: /tmp/idle-cycles/posts/2017 file: outlook-calendaring.post
path: /tmp/idle-cycles/posts/2017 file: belated-spring-cleaning.post
...
path: /tmp/idle-cycles/posts/2021 file: bikeshare-data.post

It appears to be sufficiently fast (meaning instantaneous for the current set of data):

$ /usr/bin/time /tmp/a.out
...
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 1416maxresident)k
0inputs+0outputs (0major+77minor)pagefaults 0swaps

How Long Should The Whole Program Take?

This is a difficult question for me because I'm not really sure how much work it is to open 100+ files and read their entirety. Based on the file traversal it would seem to be at least 1 or 2 orders of magnitude faster than I might have guessed. A current non-negligible amount of work is spent shuffling all of the media associated with the different post entries from their source to a destination sibling directory. This is entirely useless work and I am considering dropping the idea of a generated output directory and instead creating the output alongside the input. This would get links and references correct for free and removes the work of copying around a lot of bytes:

du -h /home/nolan/sources/idle-cycles/posts/{static,*/static}

8.0K	/home/nolan/sources/idle-cycles/posts/static
8.0K	/home/nolan/sources/idle-cycles/posts/2015/static
4.5M	/home/nolan/sources/idle-cycles/posts/2016/static
208K	/home/nolan/sources/idle-cycles/posts/2017/static
924K	/home/nolan/sources/idle-cycles/posts/2018/static
1.4M	/home/nolan/sources/idle-cycles/posts/2019/static
7.3M	/home/nolan/sources/idle-cycles/posts/2020/static
1.4M	/home/nolan/sources/idle-cycles/posts/2021/static

Post Titles to Slugs

I actually forgot I did this in my Python version of the site generator but I've helpfully left myself this note:

def slugify(text, escape=True):
    '''
    Build hyphenated post slugs from raw text. RFC3986 requires percent
    encoding for UCS/unicode points.

    >>> slugify("Wow, 2015 has \"Come and Gone\" already! It's amazing.")
    'wow-2015-has-come-and-gone-already-its-amazing'

    >>> slugify("λ is a lambda")
    '%CE%BB-is-a-lambda'
    '''

It looks like the following pieces of syntax have been used in post titles and will require handling:

:
(
)
!
,
?
'
π

Now that I am actually looking closely, I think I've inadvertently managed to disable escaping for Unicode characters. Looking at Leibniz Formula for π there is no translation of the Unicode character (in UTF-8, #xCF #x80) to the percent-encoded form in the build output. I guess the browser is smarter than RFC3986 gives them any right to be.

Something like iswalnum (is wide alpha numeric) looks promising, but I don't really understand the full implications of Unicode in C. Some basic searching indicates wchar_t is a historical accident and should not be used. I'm a little flummoxed by that but the following is a little gross looking and appears equivalent for my use case.

I think it makes sense to use the following test cases for interesting titles, each has a trailing character that should be stripped:

#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>
#include <ctype.h>
#include <string.h>

void checkOne(char *wideChar);

int main(void)
{
    char *titles[] = {
        "Back to Org?",
        "Leibniz Formula for π",
        "Why Bother With Markdown?",
        "Yak Shaving and Data Munging (with J)",
        "Poor Man's System Monitoring", // small hiccup, don't turn thing's into thing-s
        "Linear Congruence Generator (in Forth)",
        "Wow, 2015 has \"Come and Gone\" already! It's amazing."
    };

    for (int i = 0; i < (sizeof(titles) / sizeof(titles[0])); i++) {
        checkOne(titles[i]);
    }
}

void checkOne(char *s)
{
    size_t n = strlen(s) + 1;
    char out[n];
    int i = 0;
    bool previouslyIgnored = false;

    while (*s) {
        if (ispunct(*s) || isspace(*s)) {
            if (!previouslyIgnored && (*s != '\'')) {
                out[i++] = '-';
                previouslyIgnored = true;
            }
        } else if (isspace(*s)) {
            out[i++] = '-';
            previouslyIgnored = true;
        } else {
            out[i++] = tolower(*s);
            previouslyIgnored = false;
        }
        s++;
    }
    out[i] = '\0';

    while ((i > 0) && (out[--i] == '-')) {
        ;
    }
    out[i + 1] = '\0';

    printf("%s\n", out);
}

back-to-org
leibniz-formula-for-π
why-bother-with-markdown
yak-shaving-and-data-munging-with-j
poor-mans-system-monitoring
linear-congruence-generator-in-forth
wow-2015-has-come-and-gone-already-its-amazing

I am more than a little surprised that with the above poorly considered code I seem to have reproduced the existing "slugify" use cases with 100% compatibility.

Post Headers and Footers

I'm thinking of taking the easiest possible way out on this one and simply baking the header into a string within the program and writing it out to a file before the post body. After all, why not?

The result would be totally gross but I'm trying to disabuse myself of some ideas I have about "clean code". I imagine it would look something like this:

char* firstHtmlPart = "<html><head>...<title>";
char* title = post.title;
char* LastHtmlPart = "</title></head><body>...";

FILE *fp = fopen("...", "w");
fprintf(fp, firstHtmlPart);
...

It is difficult to see why this would be especially burdensome when, to my recollection, I have changed the site header perhaps twice in five years. Further, there doesn't seem to be any real need to materialize the whole output as a string in memory. I don't see why incremental writes wouldn't work (yet).

Write Post to File

The above actually encapsulates another necessary piece of the puzzle which is writing the post body. There is no need to materialize the entire generated content into memory so just sling it out to disk piecemeal!

Structured Post Data

Sorting Posts By Date

This one should be relatively easy because I have been using a lexographically sortable date format for years:

#include <string.h>
#include <stdio.h>
#include <stdlib.h>

typedef struct {
    char *date;
} Post;

int post_comparator_chronological(const void *v1, const void *v2)
{
    const Post *p1 = (Post *) v1;
    const Post *p2 = (Post *) v2;
    return strcmp(p1->date, p2->date);
}

int post_comparator_chronological_reverse(const void *v1, const void *v2)
{
    const Post *p1 = (Post *) v1;
    const Post *p2 = (Post *) v2;
    return strcmp(p2->date, p1->date);
}

int main()
{
    Post p0 = { "2021-11-30" };
    Post p1 = { "2020-12-03" };
    Post p2 = { "2020-11-30" };

    Post p[] = { p0, p1, p2 };

    printf("ascending:\n-----\n");
    qsort(p, 3, sizeof(Post), post_comparator_chronological);
    for (int i = 0; i < 3; i++) {
        printf("%s\n", p[i].date);
    }
    printf("descending:\n-----\n");
    qsort(p, 3, sizeof(Post), post_comparator_chronological_reverse);
    for (int i = 0; i < 3; i++) {
        printf("%s\n", p[i].date);
    }

    return 0;
}

ascending:
-----
2020-11-30
2020-12-03
2021-11-30
descending:
-----
2021-11-30
2020-12-03
2020-11-30

Parsing HTML

On the surface this sounds pretty onerous and I am not particularly excited about it. While I have considered pulling in a dependency like html-tidy or gumbo, I don't yet feel like I really understand how they are intended to be used.

The current HTML parsing is necessary for:

extracting title
extracting date
extracting first paragraph

I am currently considering the (generally) regular nature of all of the HTML that make up those three pieces of information. Surveying all of the posts written to date reveals the following structure:

  <h1>text</h1>
  <span>YYY-MM-DD</span>
  (whitespace)?
  (div)?
  <p>text</p>

The instances of whitespace are unimportant in HTML and can be ignored. The div encloses a table of contents in two different posts^1,2. While the paragraph tag does not nest, the div does and the entire contents of the div should be skipped in capturing the first paragraph.

Extracting Text into Structured Data

So far I've got something like the following laid out:

#include <stdio.h>

struct Post {
    char *title;
    char *date;
};

int main()
{
    struct Post p[] = {
        {.title = "first post",
         .date = "2021-11-30" },
        {.title = "second post",
         .date = "2020-11-30" },
        {.title = "third post",
         .date = "2020-12-03" }
    };

    for (int i = 0; i < (sizeof(p) / sizeof(struct Post)); i++) {
        printf("%s\n%s\n---\n", p[i].title, p[i].date);
    }

    return 0;
}

first post
2021-11-30
---
second post
2020-11-30
---
third post
2020-12-03
---

Just a title string and a date string, stuffed into a structure with no other consideration. Additionally, I need the first paragraph for the index page and the body of the post (which is really the entire file contents). It occurs to me though that the first paragraph (or "leader") is really a substring of the body and I might avoid duplicating data by instead tracking the span of such a substring:

<h1>Bulk Data Generation in SQLite</h1>
<span>2021-07-18</span>
<p>
  Browsing the internet this afternoon I
  saw <a href="https://avi.im/blag/2021/fast-sqlite-inserts/">an
  article about fast SQLite inserts</a> posted online. As a fun
  exercise I thought to try things out myself.
</p>

<h2>Premise</h2>
<p>
…

#include <stdio.h>

typedef struct Span {
    int start;
    int end;
} Span;

typedef struct Post {
    char *title;
    char *date;
    char *body;
    Span leader;
} Post;

int main()
{
    Post p[] = {
        {.title = "first post",
         .date = "2021-11-30",
         .body =
         "a title string\n2020-01-02\n<p>Imagine this is drawn from a file</p>\nPretend it is very long",
         .leader = { 26, 66} },
        {.title = "second post",
         .date = "2020-11-30",
         .body =
         "another title\n2020-03-04\n<p>It might be multiple lines long \nand of indeterminate length</p>Maybe it goes on for pages and pages",
         .leader = { 25, 92} },
        {.title = "third post",
         .date = "2020-12-03",
         .body =
         "the last title\n2020-05-06\n<p>But even so the span or extent should be able to index into the body of the whole</p>\n\n\nwho knows",
         .leader = { 26, 114} }
    };

    for (int i = 0; i < (sizeof(p) / sizeof(Post)); i++) {
        printf("%s\n%s\n---\n", p[i].title, p[i].date);
        for (int j = p[i].leader.start; j < p[i].leader.end; j++) {
            printf("%c", p[i].body[j]);
        }
        printf("\n\n");
    }

    return 0;
}

first post
2021-11-30
---
<p>Imagine this is drawn from a file</p>

second post
2020-11-30
---
<p>It might be multiple lines long 
and of indeterminate length</p>

third post
2020-12-03
---
<p>But even so the span or extent should be able to index into the body of the whole</p>

It seems as though I will obviously have another level of indirection for the strings but I'm not yet sure whether this is a premature move to optimize things or a truly better solution. I guess I'll try it out and see how much pain it causes (presumably in getting the "leader" out later).

How Do You "Right Size" an Array?

This seems totally sufficient, just set some starting size and double it in the case you reach that point.

#include <stdio.h>
#include <stdlib.h>

int main()
{
    size_t someSize = 10;
    int *arr = calloc(someSize, sizeof(int));

    if (!arr) {
        exit(EXIT_FAILURE);
    }

    for (int i = 0; i < 50; i++) {
        if (i >= someSize) {
            printf
                ("resizing array from %zu to %zu based on current index %d\n",
                 someSize, someSize * 2, i);

            someSize *= 2;
            int *tmp = realloc(arr, someSize * sizeof(int));
            if (tmp) {          /* what is the correct way to do this? */
                arr = tmp;
            } else {
                exit(EXIT_FAILURE);
            }

        }
        arr[i] = i;
    }
}

resizing array from 10 to 20 based on current index 10
resizing array from 20 to 40 based on current index 20
resizing array from 40 to 80 based on current index 40

Creating Index and Archive Pages

#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include <ctype.h>

void slugify(char *s, char *out);

typedef struct Span {
    int start;
    int end;
} Span;

typedef struct Post {
    char *title;
    char *date;
    char *body;
    Span leader;
} Post;

int main()
{
    Post p[] = {
        {.title = "First Post",
         .date = "2021-11-30",
         .body =
         "a title string\n2020-01-02\n<p>Imagine this is drawn from a file</p>\nPretend it is very long",
         .leader = { 26, 66} },
        {.title = "SECOND POST",
         .date = "2020-11-30",
         .body =
         "another title\n2020-03-04\n<p>It might be multiple lines long \nand of indeterminate length</p>Maybe it goes on for pages and pages",
         .leader = { 25, 92} },
        {.title = "Third post",
         .date = "2020-12-03",
         .body =
         "the last title\n2020-05-06\n<p>But even so the span or extent should be able to index into the body of the whole</p>\n\n\nwho knows",
         .leader = { 26, 114} }
    };

    printf("<h1>Index Page</h1>\n\n");

    for (int i = 0; i < (sizeof(p) / sizeof(Post)); i++) {
        size_t n = strlen(p[i].title) + 1;
        char out[n];
        slugify(p[i].title, out);

        printf
            ("<h2><a href=\"FIXME/%s.html\">%s</a></h2>\n<span>%s</span>\n<p>\n",
             out, p[i].title, p[i].date);
        for (int j = p[i].leader.start + 3; j < p[i].leader.end - 4; j++) {
            printf("%c", p[i].body[j]);
        }
        printf("</p>\n\n");
    }

    printf("<h1>Archive Page</h1>\n\n");
    printf("<ul>\n");
    for (int i = 0; i < (sizeof(p) / sizeof(Post)); i++) {

        size_t n = strlen(p[i].title) + 1;
        char out[n];
        slugify(p[i].title, out);

        printf("<li>%s <a href=\"FIXME/%s.html\">%s</a></li>\n", p[i].date,
               out, p[i].title);
    }
    printf("</ul>\n");

    return 0;
}

void slugify(char *s, char *out)
{
    int i = 0;
    bool previouslyIgnored = false;

    while (*s) {
        if (ispunct(*s) || isspace(*s)) {
            if (!previouslyIgnored && (*s != '\'')) {
                out[i++] = '-';
                previouslyIgnored = true;
            }
        } else if (isspace(*s)) {
            out[i++] = '-';
            previouslyIgnored = true;
        } else {
            out[i++] = tolower(*s);
            previouslyIgnored = false;
        }
        s++;
    }
    out[i] = '\0';

    while ((i > 0) && (out[--i] == '-')) {
        ;
    }
    out[i + 1] = '\0';
}

Index Page

First Post

2021-11-30

Imagine this is drawn from a file

SECOND POST

2020-11-30

It might be multiple lines long and of indeterminate length

Third post

2020-12-03

But even so the span or extent should be able to index into the body of the whole

Archive Page

2021-11-30 First Post
2020-11-30 SECOND POST
2020-12-03 Third post

Obviously I have overlooked a detail of the post hierarchy within my site which is that each post is nested within a directory named for the year in which the post was written. I think the easiest thing to do is stack allocate a fixed size character array derived from the full date:

#include <stdio.h>
#include <string.h>
#include <stdbool.h>

typedef struct Span {
    int start;
    int end;
} Span;

typedef struct Post {
    char *title;
    char *date;
    char *body;
    Span leader;
} Post;

int main()
{
    Post p[] = {
        {.title = "First Post",
         .date = "2021-11-30",
         .body =
         "a title string\n2020-01-02\n<p>Imagine this is drawn from a file</p>\nPretend it is very long",
         .leader = { 26, 66} },
        {.title = "SECOND POST",
         .date = "2020-11-30",
         .body =
         "another title\n2020-03-04\n<p>It might be multiple lines long \nand of indeterminate length</p>Maybe it goes on for pages and pages",
         .leader = { 25, 92} },
        {.title = "Third post",
         .date = "2020-12-03",
         .body =
         "the last title\n2020-05-06\n<p>But even so the span or extent should be able to index into the body of the whole</p>\n\n\nwho knows",
         .leader = { 26, 114} }
    };

    char year[5];
    printf("<ul>");
    for (int i = 0; i < (sizeof(p) / sizeof(Post)); i++) {

        for (int index = 0; index < 4; index++) {
            year[index] = p[i].date[index];
        }
        year[4] = '\0';

        printf("<li>%s <a href=\"%s/%s.html\">%s</a></li>\n", p[i].date,
               year, "placeholder", p[i].title);
    }
    printf("</ul>");

    return 0;
}

2021-11-30 First Post
2020-11-30 SECOND POST
2020-12-03 Third post

Warnings

I tried using clang-tidy to get a better sense of where I was misunderstanding C or relying on things working only incidentally. While it did catch two places I was over-allocating memory (allocating for sizeof Post instead of sizeof Post*, it has generally proven more annoying than productive. Specifically, every instance of this message is not particularly actionable:

warning: Call to function 'memcpy' is insecure as it does not provide security checks introduced in the C11 standard. 
Replace with analogous functions that support length arguments or provides boundary checks such as 'memcpy_s' in case of C11 
[clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]

Or this one:

warning: Call to function 'snprintf' is insecure as it does not provide security checks introduced in the C11 standard. 
Replace with analogous functions that support length arguments or provides boundary checks such as 'snprintf_s' in case of C11 
[clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]

Neither of the suggested replacements are available, as far as I can see. I could probably work out how to either include a 3rd party implementation or suppress the warning but I am not overly concerned enough to bother. I do wonder if I am using the tool incorrectly, although compiling under -Wall seems sufficient with the level of (manual) testing things have undergone.

Thoughts

Obviously the newer version of the program is faster owing mostly to the fact that it has to do less work. Not bothering to duplicate static assets and directory structures means the program really only smashes together a couple of files and strings. It is interesting to have spent time working on a problem-space I know well and finally having the tools to measure things more meaningfully. With an understanding of when and how files are read, and the memory allocations involved it is possible to observe the effect of the operating system caches in practice. Writing to /proc/sys/vm/drop_caches allows for purging inodes and pages in a way that provides a (reasonably) clean slate to test against. In a non-cache run with the C program there are hundreds of context switches and the CPU is largely idle, presumably waiting on memory accesses. With a cached run the CPU runs around 90% and has single-digit context switches.

Interestingly the Python program does not suffer as greatly from cache misses, presumably because of how poor the memory access is in the first place. Regardless of the cache state there are always thousands of context switches.

Overall I am surprised to find there was no single point that brought progress to a halt. Laying out all the pieces I required and simply reading documentation lead to a workable solution in about 500 lines of code. The dependencies are (to my knowledge) all basic system header files. The program probably only runs on UNIX-like systems (or perhaps POSIX systems?) and I'm perfectly content with that limitation.

I understand why you wouldn't munge strings together to form HTML and XML in a serious or real project but I am coming around to the idea that fewer projects fit such a description. Is it worth pulling in half the internet for dependencies when you can just write only what you need yourself? I'm not sure.

Bonus Thought

I didn't bother calling free on any of the memory allocations in my program. The whole thing is a one-shot and done which requires the operating system clean things up afterwards anyhow. This might not be standard practice but it is novel to have the tradeoff available.