I have been reading more about C programming and casting about for a project I am thinking of writing another static site generator. It is a small, self-contained project with a well known scope and I have real data.
With several years of use I no longer want nor use the various (mis)features intended to make the current program reusable. I am perfectly happy to bake the few pieces of configuration into the program and require a recompilation to use. I am willing to trade compilation-time for run-time because I use the program more often than I modify it. Moreover editing a source file is as easy as editing a configuration file in the typical case.
Auto-generated content has been intentionally limited to just a "recent posts" main page, an archive page, and an Atom feed. While they were a notable inclusion in previous iterations of the project templates have not proven to be very interesting or useful.
This is all done as an exercise, the amount of data is so small as to be trivially done in almost any language:
find /home/nolan/sources/idle-cycles/posts -type f -name '*.post' | wc -l
116
du -h --exclude "*/static" /home/nolan/sources/idle-cycles/posts | sort -k2
1.4M /home/nolan/sources/idle-cycles/posts 16K /home/nolan/sources/idle-cycles/posts/2014 88K /home/nolan/sources/idle-cycles/posts/2015 252K /home/nolan/sources/idle-cycles/posts/2016 136K /home/nolan/sources/idle-cycles/posts/2017 160K /home/nolan/sources/idle-cycles/posts/2018 172K /home/nolan/sources/idle-cycles/posts/2019 212K /home/nolan/sources/idle-cycles/posts/2020 340K /home/nolan/sources/idle-cycles/posts/2021
Did you know that gnuplot can do rudimentary statistical analysis? I love finding out new things!
reset !stat -c '%s' /home/nolan/sources/idle-cycles/posts/*/*.post > /tmp/data.txt set key left top unset xtics set ylabel "File Size (kB)" f(x) = mean_y fit f(x) '/tmp/data.txt' via mean_y stddev_y = sqrt(FIT_WSSR / (FIT_NDF + 1 )) plot mean_y-stddev_y title "standard dev." with filledcurves y1=mean_y lt 1 lc rgb "#bbbbdd", \ mean_y+stddev_y notitle with filledcurves y1=mean_y lt 1 lc rgb "#bbbbdd", \ mean_y title "mean file size" w l lc 6 lw 4, \ '/tmp/data.txt' title 'single file' w p pt 7 lt 1
I have stitched the following together based on a few man pages:
#define _XOPEN_SOURCE 500 #include <ftw.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <stdint.h> #include <libgen.h> int neat_callback(const char *fpath, const struct stat *sb, int tflag, struct FTW *ftwbuf) { const char *ext = strrchr(fpath, '.'); if (ext && (strcmp("post", ext+1) == 0)) { size_t n = strlen(fpath)+1; char fpathCopyOne[n]; char fpathCopyTwo[n]; memcpy(fpathCopyOne, fpath, n); memcpy(fpathCopyTwo, fpath, n); printf("path: %s file: %s\n", dirname(fpathCopyOne), basename(fpathCopyTwo)); } return 0; } int main(int argc, char *argv[]) { if (nftw("/tmp/idle-cycles/posts", neat_callback, 10, 0) == -1) { perror("junk program"); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
path: /tmp/idle-cycles/posts/2017 file: outlook-calendaring.post path: /tmp/idle-cycles/posts/2017 file: belated-spring-cleaning.post ... path: /tmp/idle-cycles/posts/2021 file: bikeshare-data.post
It appears to be sufficiently fast (meaning instantaneous for the current set of data):
$ /usr/bin/time /tmp/a.out ... 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 1416maxresident)k 0inputs+0outputs (0major+77minor)pagefaults 0swaps
This is a difficult question for me because I'm not really sure how much work it is to open 100+ files and read their entirety. Based on the file traversal it would seem to be at least 1 or 2 orders of magnitude faster than I might have guessed. A current non-negligible amount of work is spent shuffling all of the media associated with the different post entries from their source to a destination sibling directory. This is entirely useless work and I am considering dropping the idea of a generated output directory and instead creating the output alongside the input. This would get links and references correct for free and removes the work of copying around a lot of bytes:
du -h /home/nolan/sources/idle-cycles/posts/{static,*/static}
8.0K /home/nolan/sources/idle-cycles/posts/static 8.0K /home/nolan/sources/idle-cycles/posts/2015/static 4.5M /home/nolan/sources/idle-cycles/posts/2016/static 208K /home/nolan/sources/idle-cycles/posts/2017/static 924K /home/nolan/sources/idle-cycles/posts/2018/static 1.4M /home/nolan/sources/idle-cycles/posts/2019/static 7.3M /home/nolan/sources/idle-cycles/posts/2020/static 1.4M /home/nolan/sources/idle-cycles/posts/2021/static
I actually forgot I did this in my Python version of the site generator but I've helpfully left myself this note:
def slugify(text, escape=True): ''' Build hyphenated post slugs from raw text. RFC3986 requires percent encoding for UCS/unicode points. >>> slugify("Wow, 2015 has \"Come and Gone\" already! It's amazing.") 'wow-2015-has-come-and-gone-already-its-amazing' >>> slugify("λ is a lambda") '%CE%BB-is-a-lambda' '''
It looks like the following pieces of syntax have been used in post titles and will require handling:
Now that I am actually looking closely, I think I've inadvertently managed to disable escaping for Unicode characters. Looking at Leibniz Formula for π there is no translation of the Unicode character (in UTF-8, #xCF #x80) to the percent-encoded form in the build output. I guess the browser is smarter than RFC3986 gives them any right to be.
Something like iswalnum
(is wide alpha numeric) looks promising,
but I don't really understand the full implications of Unicode in
C. Some basic searching indicates wchar_t
is a historical accident
and should not be used. I'm a little flummoxed by that but the
following is a little gross looking and appears equivalent for my use
case.
I think it makes sense to use the following test cases for interesting titles, each has a trailing character that should be stripped:
#include <stdint.h> #include <stdbool.h> #include <stdio.h> #include <ctype.h> #include <string.h> void checkOne(char *wideChar); int main(void) { char *titles[] = { "Back to Org?", "Leibniz Formula for π", "Why Bother With Markdown?", "Yak Shaving and Data Munging (with J)", "Poor Man's System Monitoring", // small hiccup, don't turn thing's into thing-s "Linear Congruence Generator (in Forth)", "Wow, 2015 has \"Come and Gone\" already! It's amazing." }; for (int i = 0; i < (sizeof(titles) / sizeof(titles[0])); i++) { checkOne(titles[i]); } } void checkOne(char *s) { size_t n = strlen(s) + 1; char out[n]; int i = 0; bool previouslyIgnored = false; while (*s) { if (ispunct(*s) || isspace(*s)) { if (!previouslyIgnored && (*s != '\'')) { out[i++] = '-'; previouslyIgnored = true; } } else if (isspace(*s)) { out[i++] = '-'; previouslyIgnored = true; } else { out[i++] = tolower(*s); previouslyIgnored = false; } s++; } out[i] = '\0'; while ((i > 0) && (out[--i] == '-')) { ; } out[i + 1] = '\0'; printf("%s\n", out); }
back-to-org leibniz-formula-for-π why-bother-with-markdown yak-shaving-and-data-munging-with-j poor-mans-system-monitoring linear-congruence-generator-in-forth wow-2015-has-come-and-gone-already-its-amazing
I am more than a little surprised that with the above poorly considered code I seem to have reproduced the existing "slugify" use cases with 100% compatibility.
I'm thinking of taking the easiest possible way out on this one and simply baking the header into a string within the program and writing it out to a file before the post body. After all, why not?
The result would be totally gross but I'm trying to disabuse myself of some ideas I have about "clean code". I imagine it would look something like this:
char* firstHtmlPart = "<html><head>...<title>"; char* title = post.title; char* LastHtmlPart = "</title></head><body>..."; FILE *fp = fopen("...", "w"); fprintf(fp, firstHtmlPart); ...
It is difficult to see why this would be especially burdensome when, to my recollection, I have changed the site header perhaps twice in five years. Further, there doesn't seem to be any real need to materialize the whole output as a string in memory. I don't see why incremental writes wouldn't work (yet).
The above actually encapsulates another necessary piece of the puzzle which is writing the post body. There is no need to materialize the entire generated content into memory so just sling it out to disk piecemeal!
This one should be relatively easy because I have been using a lexographically sortable date format for years:
#include <string.h> #include <stdio.h> #include <stdlib.h> typedef struct { char *date; } Post; int post_comparator_chronological(const void *v1, const void *v2) { const Post *p1 = (Post *) v1; const Post *p2 = (Post *) v2; return strcmp(p1->date, p2->date); } int post_comparator_chronological_reverse(const void *v1, const void *v2) { const Post *p1 = (Post *) v1; const Post *p2 = (Post *) v2; return strcmp(p2->date, p1->date); } int main() { Post p0 = { "2021-11-30" }; Post p1 = { "2020-12-03" }; Post p2 = { "2020-11-30" }; Post p[] = { p0, p1, p2 }; printf("ascending:\n-----\n"); qsort(p, 3, sizeof(Post), post_comparator_chronological); for (int i = 0; i < 3; i++) { printf("%s\n", p[i].date); } printf("descending:\n-----\n"); qsort(p, 3, sizeof(Post), post_comparator_chronological_reverse); for (int i = 0; i < 3; i++) { printf("%s\n", p[i].date); } return 0; }
ascending: ----- 2020-11-30 2020-12-03 2021-11-30 descending: ----- 2021-11-30 2020-12-03 2020-11-30
On the surface this sounds pretty onerous and I am not particularly excited about it. While I have considered pulling in a dependency like html-tidy or gumbo, I don't yet feel like I really understand how they are intended to be used.
The current HTML parsing is necessary for:
I am currently considering the (generally) regular nature of all of the HTML that make up those three pieces of information. Surveying all of the posts written to date reveals the following structure:
<h1>text</h1> <span>YYY-MM-DD</span> (whitespace)? (div)? <p>text</p>
The instances of whitespace are unimportant in HTML and can be
ignored. The div
encloses a table of contents in two different
posts1,2. While
the paragraph tag does not nest, the div
does and the entire
contents of the div
should be skipped in capturing the first
paragraph.
So far I've got something like the following laid out:
#include <stdio.h> struct Post { char *title; char *date; }; int main() { struct Post p[] = { {.title = "first post", .date = "2021-11-30" }, {.title = "second post", .date = "2020-11-30" }, {.title = "third post", .date = "2020-12-03" } }; for (int i = 0; i < (sizeof(p) / sizeof(struct Post)); i++) { printf("%s\n%s\n---\n", p[i].title, p[i].date); } return 0; }
first post 2021-11-30 --- second post 2020-11-30 --- third post 2020-12-03 ---
Just a title string and a date string, stuffed into a structure with no other consideration. Additionally, I need the first paragraph for the index page and the body of the post (which is really the entire file contents). It occurs to me though that the first paragraph (or "leader") is really a substring of the body and I might avoid duplicating data by instead tracking the span of such a substring:
<h1>Bulk Data Generation in SQLite</h1> <span>2021-07-18</span> <p> Browsing the internet this afternoon I saw <a href="https://avi.im/blag/2021/fast-sqlite-inserts/">an article about fast SQLite inserts</a> posted online. As a fun exercise I thought to try things out myself. </p> <h2>Premise</h2> <p> …
#include <stdio.h> typedef struct Span { int start; int end; } Span; typedef struct Post { char *title; char *date; char *body; Span leader; } Post; int main() { Post p[] = { {.title = "first post", .date = "2021-11-30", .body = "a title string\n2020-01-02\n<p>Imagine this is drawn from a file</p>\nPretend it is very long", .leader = { 26, 66} }, {.title = "second post", .date = "2020-11-30", .body = "another title\n2020-03-04\n<p>It might be multiple lines long \nand of indeterminate length</p>Maybe it goes on for pages and pages", .leader = { 25, 92} }, {.title = "third post", .date = "2020-12-03", .body = "the last title\n2020-05-06\n<p>But even so the span or extent should be able to index into the body of the whole</p>\n\n\nwho knows", .leader = { 26, 114} } }; for (int i = 0; i < (sizeof(p) / sizeof(Post)); i++) { printf("%s\n%s\n---\n", p[i].title, p[i].date); for (int j = p[i].leader.start; j < p[i].leader.end; j++) { printf("%c", p[i].body[j]); } printf("\n\n"); } return 0; }
first post 2021-11-30 --- <p>Imagine this is drawn from a file</p> second post 2020-11-30 --- <p>It might be multiple lines long and of indeterminate length</p> third post 2020-12-03 --- <p>But even so the span or extent should be able to index into the body of the whole</p>
It seems as though I will obviously have another level of indirection for the strings but I'm not yet sure whether this is a premature move to optimize things or a truly better solution. I guess I'll try it out and see how much pain it causes (presumably in getting the "leader" out later).
This seems totally sufficient, just set some starting size and double it in the case you reach that point.
#include <stdio.h> #include <stdlib.h> int main() { size_t someSize = 10; int *arr = calloc(someSize, sizeof(int)); if (!arr) { exit(EXIT_FAILURE); } for (int i = 0; i < 50; i++) { if (i >= someSize) { printf ("resizing array from %zu to %zu based on current index %d\n", someSize, someSize * 2, i); someSize *= 2; int *tmp = realloc(arr, someSize * sizeof(int)); if (tmp) { /* what is the correct way to do this? */ arr = tmp; } else { exit(EXIT_FAILURE); } } arr[i] = i; } }
resizing array from 10 to 20 based on current index 10 resizing array from 20 to 40 based on current index 20 resizing array from 40 to 80 based on current index 40
#include <stdio.h> #include <string.h> #include <stdbool.h> #include <ctype.h> void slugify(char *s, char *out); typedef struct Span { int start; int end; } Span; typedef struct Post { char *title; char *date; char *body; Span leader; } Post; int main() { Post p[] = { {.title = "First Post", .date = "2021-11-30", .body = "a title string\n2020-01-02\n<p>Imagine this is drawn from a file</p>\nPretend it is very long", .leader = { 26, 66} }, {.title = "SECOND POST", .date = "2020-11-30", .body = "another title\n2020-03-04\n<p>It might be multiple lines long \nand of indeterminate length</p>Maybe it goes on for pages and pages", .leader = { 25, 92} }, {.title = "Third post", .date = "2020-12-03", .body = "the last title\n2020-05-06\n<p>But even so the span or extent should be able to index into the body of the whole</p>\n\n\nwho knows", .leader = { 26, 114} } }; printf("<h1>Index Page</h1>\n\n"); for (int i = 0; i < (sizeof(p) / sizeof(Post)); i++) { size_t n = strlen(p[i].title) + 1; char out[n]; slugify(p[i].title, out); printf ("<h2><a href=\"FIXME/%s.html\">%s</a></h2>\n<span>%s</span>\n<p>\n", out, p[i].title, p[i].date); for (int j = p[i].leader.start + 3; j < p[i].leader.end - 4; j++) { printf("%c", p[i].body[j]); } printf("</p>\n\n"); } printf("<h1>Archive Page</h1>\n\n"); printf("<ul>\n"); for (int i = 0; i < (sizeof(p) / sizeof(Post)); i++) { size_t n = strlen(p[i].title) + 1; char out[n]; slugify(p[i].title, out); printf("<li>%s <a href=\"FIXME/%s.html\">%s</a></li>\n", p[i].date, out, p[i].title); } printf("</ul>\n"); return 0; } void slugify(char *s, char *out) { int i = 0; bool previouslyIgnored = false; while (*s) { if (ispunct(*s) || isspace(*s)) { if (!previouslyIgnored && (*s != '\'')) { out[i++] = '-'; previouslyIgnored = true; } } else if (isspace(*s)) { out[i++] = '-'; previouslyIgnored = true; } else { out[i++] = tolower(*s); previouslyIgnored = false; } s++; } out[i] = '\0'; while ((i > 0) && (out[--i] == '-')) { ; } out[i + 1] = '\0'; }
Imagine this is drawn from a file
It might be multiple lines long and of indeterminate length
But even so the span or extent should be able to index into the body of the whole
Obviously I have overlooked a detail of the post hierarchy within my site which is that each post is nested within a directory named for the year in which the post was written. I think the easiest thing to do is stack allocate a fixed size character array derived from the full date:
#include <stdio.h> #include <string.h> #include <stdbool.h> typedef struct Span { int start; int end; } Span; typedef struct Post { char *title; char *date; char *body; Span leader; } Post; int main() { Post p[] = { {.title = "First Post", .date = "2021-11-30", .body = "a title string\n2020-01-02\n<p>Imagine this is drawn from a file</p>\nPretend it is very long", .leader = { 26, 66} }, {.title = "SECOND POST", .date = "2020-11-30", .body = "another title\n2020-03-04\n<p>It might be multiple lines long \nand of indeterminate length</p>Maybe it goes on for pages and pages", .leader = { 25, 92} }, {.title = "Third post", .date = "2020-12-03", .body = "the last title\n2020-05-06\n<p>But even so the span or extent should be able to index into the body of the whole</p>\n\n\nwho knows", .leader = { 26, 114} } }; char year[5]; printf("<ul>"); for (int i = 0; i < (sizeof(p) / sizeof(Post)); i++) { for (int index = 0; index < 4; index++) { year[index] = p[i].date[index]; } year[4] = '\0'; printf("<li>%s <a href=\"%s/%s.html\">%s</a></li>\n", p[i].date, year, "placeholder", p[i].title); } printf("</ul>"); return 0; }
I tried using clang-tidy
to get a better sense of where I was
misunderstanding C or relying on things working only
incidentally. While it did catch two places I was over-allocating
memory (allocating for sizeof Post
instead of sizeof Post*
, it has
generally proven more annoying than productive. Specifically, every
instance of this message is not particularly actionable:
warning: Call to function 'memcpy' is insecure as it does not provide security checks introduced in the C11 standard. Replace with analogous functions that support length arguments or provides boundary checks such as 'memcpy_s' in case of C11 [clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
Or this one:
warning: Call to function 'snprintf' is insecure as it does not provide security checks introduced in the C11 standard. Replace with analogous functions that support length arguments or provides boundary checks such as 'snprintf_s' in case of C11 [clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
Neither of the suggested replacements are available, as far as I can
see. I could probably work out how to either include a 3rd party
implementation or suppress the warning but I am not overly concerned
enough to bother. I do wonder if I am using the tool incorrectly,
although compiling under -Wall
seems sufficient with the level of
(manual) testing things have undergone.
That's everything! At this point, if you're reading this then I've got things working to the point where I can (not to say I absolutely will) replace my prior Python program with a C version. I did "cheat" in a few places, replicating a problem with datetimes that the prior version had as well. The issue is only that I don't really handle the time portion of dates and ignore UTC offsets. I don't write frequently enough for it to matter much in the first place, so sue me.
Obviously the newer version of the program is faster owing mostly to
the fact that it has to do less work. Not bothering to duplicate
static assets and directory structures means the program really only
smashes together a couple of files and strings. It is interesting to
have spent time working on a problem-space I know well and finally
having the tools to measure things more meaningfully. With an
understanding of when and how files are read, and the memory
allocations involved it is possible to observe the effect of the
operating system caches in practice. Writing to
/proc/sys/vm/drop_caches
allows for purging inodes and pages in a
way that provides a (reasonably) clean slate to test against. In a
non-cache run with the C program there are hundreds of context
switches and the CPU is largely idle, presumably waiting on memory
accesses. With a cached run the CPU runs around 90% and has
single-digit context switches.
Interestingly the Python program does not suffer as greatly from cache misses, presumably because of how poor the memory access is in the first place. Regardless of the cache state there are always thousands of context switches.
Overall I am surprised to find there was no single point that brought progress to a halt. Laying out all the pieces I required and simply reading documentation lead to a workable solution in about 500 lines of code. The dependencies are (to my knowledge) all basic system header files. The program probably only runs on UNIX-like systems (or perhaps POSIX systems?) and I'm perfectly content with that limitation.
I understand why you wouldn't munge strings together to form HTML and XML in a serious or real project but I am coming around to the idea that fewer projects fit such a description. Is it worth pulling in half the internet for dependencies when you can just write only what you need yourself? I'm not sure.
I didn't bother calling free
on any of the memory allocations in my
program. The whole thing is a one-shot and done which requires the
operating system clean things up afterwards anyhow. This might not be
standard practice but it is novel to have the tradeoff available.