Teknikal's_Domain

#<ENT:NTA:NnT:SSrgS:H66-198:W300:CBWg>

Teknikaldomain.me Website Architecture Overview

2020-06-22 16 min read Behind the scenes Teknikal_Domain Unable to load comment count

So I just checked in with initialcommit.io , the website run by Jacob Stopak, the same person I collaborated with to help explain the internals of version control systems, not once , but twice even. And he published an overview of how he made the site, and what tools he used. As I was looking, I noticed, we took a very different approach to get to two similar endpoints.

If you just want to see what I did, read on. If you want to see the differences, or are just curious about the various ways that sites can be built, read his first, then come back with that knowledge in mind.

A Foreword

This is.. just a blog. Jacob is running a lot more than just a blog, so his needs are different than mine. However, for the blog-focused portion, those are capable of being compared on a more eye-to-eye level… though not completely. There is one major difference: Mine is a static site, it has no backend processing, it has nothing to it other than the content that you see.

My Setup

Goals, Tools, and Reasoning

See, I know from the beginning what I wanted to build, just a place for me to put fun facts, findings, and other trivia that gives me the nickname “Thornpedia” in some friend circles (one can never have too many nicknames), and “that guy that I swear just knows everything” in others. Really, if I wanted, I’d grab MediaWiki core, make a wiki of everything I know or everything someone has asked me, eventually It’d get so populated that searching anything on there is just opening the Pandora’s Box of knowledge.

But no, I didn’t want something like that. While possible, I’d be using the wrong tools. Plus, I like my learning experiences, I wanted my chance to try something I hadn’t before, see where it lead.

My main goals were to have something that’s 100% customizable, lightweight, and didn’t require that much backend work, if any. I wanted something mainly targeted at blog-type sites, also because that means they don’t have many back-end requirements.

For a while, I had tried Octopress , the sort of hacker version of Wordpress, but, well, take one look at that homepage, the last post is about the “upcoming” version 3, a little over 5.5 years ago. The GitHub repo was last touched in early 2016, this project is, to me, officially dead.

I wanted a CMS tool not platform. Things like Wix, Squarespace, all the good ones, are targeted at people who, for the most part, have no clue how this works, want to type some text, and look good. Aesthetically they may have me beat with modern templates and the like, but in terms of making my site mine? Oh no, they’re nowhere near with a little tool that takes easy-to-write content in, and puts good and ready HTML out.

I could’ve used another standard CMS (Content Management System) like Drupal, or any of the few that Cloudflare have rules for (more on that later), but I really didn’t. PHP based ones are… well, not the best, mainly because keeping PHP in check and up to date gets to be a real pain, and for many (especially WordPress), it gets obvious quick that they’re build with a basic out-of-the-box CMS. Additionally, a common CMS will have exploits, I see a lot of firewall hits for known WP and Drupal exploits.

Or maybe go start-to-finish with a custom back and front, say a standard web javascript framework (oh god no), and a backend like a Node.js server, something where the backend program is it’s own server. Besides being massively complex, I decided to eliminate these the moment I decided I wasn’t having a backend-reliant system.

Hugo

I did not require a static generator, but then Hugo fell into my lap. Hugo describes itself as “The world’s fastest framework for building websites,” and in my experience I can concur that it’s fast - most builds take less that one second, and always less than 3. The most recent one looked like this:

Building sites …
                   | EN
-------------------+------
  Pages            | 295
  Paginator pages  |  23
  Non-page files   |  66
  Static files     |  77
  Processed images |  34
  Aliases          | 199
  Sitemaps         |   1
  Cleaned          |   0

Total in 836 ms

0.8 seconds to build out 295 pages, 66 other files, 77 static files to just copy over, process 34 images, and generate a full sitemap. Friends, that is what I call fast. Hugo generates static content, which means nothing more than HTML and JavaScript. Hugo takes it’s input in the form of Markdown files with a bit of YAML, TOML, or JSON front-matter, and a directory full of templates: HTML with special markup that tells it where to put things, repeat things, the like. Hugo’s templating system is very robust, but those templates are the core of Hugo, without a good set of templates, there is nothing.

A template set is called a theme, I use a (modified) theme called Bilberry , and there’s always a link in the very bottom-right corner to my fork of it.

Here’s the raw file for the index page, the one that shows if you go to teknikaldomain.me itself:

{{ define "main" }}

{{ $paginator := .Paginate (where (where .Site.RegularPages "Type" "ne" "page" | intersect (where .Site.RegularPages "Params.excludefromindex" "==" nil)) ".Params.draft" "ne" true) (index .Site.Params "paginate" | default 7) }}

    {{ if .Site.Params.pinnedPost }}
        {{ if (and .Site.Params.pinOnlyToFirstPage (ne $paginator.PageNumber 1)) }}
            {{/* Do nothing if the pinOnlyToFirstPage flag is set and we're not on page 1. */}}
        {{else}}
            {{ range first 1 (where .Data.Pages "URL" .Site.Params.pinnedPost) }}
                {{ partial "article-wrapper" . }}
            {{end}}
        {{end}}
    {{end}}

    {{ range $paginator.Pages }}
        {{ partial "article-wrapper" . }}
    {{ end }}

    {{ partial "paginator" . }}
{{ end }}

That’s an HTML file… or it would be.

The template for a “gallery” post is this:

<a class="bubble" href="{{ .Permalink }}">
    <i class="fas fa-fw {{ or .Params.icon "fa-camera" }}"></i>
</a>

<article class="gallery">
    {{ if and (isset .Params "gallery") (ne .Params.gallery "") }}
    <div class="flexslider">
        <ul class="slides">
            {{ range .Params.gallery }}
            <li><img src="{{ . | relURL }}" /></li>
            {{ end }}
        </ul>
    </div>
    {{ else if ne .Params.imageSlider false }}
    <div class="flexslider">
        <ul class="slides">
            {{ if and (.Site.Params.resizeImages | default true) (.Params.resizeImages | default true) }}
            {{ range .Resources.ByType "image" }}
            <li><img src="{{ (.Fill "700x350 q95").RelPermalink }}" /></li>
            {{ end }}
            {{ else }}
            {{ range .Resources.ByType "image" }}
            <li><img src="{{ .RelPermalink }}" /></li>
            {{ end }}
            {{ end }}
        </ul>
    </div>
    {{ else}}
    {{ partial "featured-image" . }}
    {{ end }}

    {{ partial "default-content" . }}
    {{ partial "article-footer" . }}
</article>

I could keep going, but you get the point: everything is a template. Hugo is kinda like a programming language, it has a lot of functuonality in the core, but you need to write the right programs templates, and give them the right input to get some amazing results out.

Hugo’s functionality is pretty much near infinitely-extendable, only limited by what the templates allow. And uh… I can just change those, modify my colors, modify the search plugin (which is just a JS library), modify comments (another JS library), modify whatever. Hugo’s “preview” mode uses LiveReload and WebSockets, so that every time I save a document in the tree when it’s enabled, my localhost:1313 window will automatically trigger a page refresh with the new content.

So here I can write my articles, with a block of “front-matter” at the beginning, a lot of key-value data for things like date, author, when to publish, title, are comments allowed, the works, and the folder I put them in determines their type, of which I have a few to play with in this theme, and I can always make more.

The next step of development is, after writing everything in pretty standard Markdown, getting stuff to the server. Now I could if I just wanted, rsync the data, and only transfer what I need. But since I need Hugo to run every time a post is due for release if it was scheduled, well then I need something more aware. For this I have my own Git repository on my servers, and the web side can git pull, which I’m in the process of making a web hook, so it’ll pull automatically every time I push.

Anyways, onto how your browser gets what you read.

Request Ingest

All requests go through a Cloudflare Pro plan, providing caching, DNS, a Web Application Firewall, rate-limiting, and all the good Cloudflare tools like image optimization, Argo routing, Workers, the fun stuff.

Cloudflare holds authority over the DNS zone, but they cannot be a registrar for the .me TLD, that’s Ionos’s job. They still call me asking me who my web provider is… I pause for a seconds then say “you’re speaking to the head of the development team for teknikaldomain.me's provider, how can I help you?” They catch on pretty quick that the sales pitch I was about to hear would be 100% wasted. Be nice, they do things for you.

Cloudflare also provides the TLS cert with certificate transparency monitoring, and is set to require a valid cert from my origin, which they can provide.. or I can generate for free with LetsEncrypt like I have everything else. Just looking, over the last 24 hours, 64% of my traffic was served over TLSv1.3, 29% was TLSv1.2, and 7% was… insecure.

Their firewall is configured with plenty of common rules, they have pre-defined sets for Drupal, Joomla, Magneto, Plone, WordPress CMSes, and for things like WHMCS, PHP, Flash (wait really, Adobe Flash?!), and other stuff, as well as a number of sets of OWASP rules.

Next, I use Polish to help compress my images and serve them as WebP where possible, and Cloudflare will on-the-fly minify HTML, CSS, and JavaScript responses to make them smaller if possible. Cloudflare will also meddle with your HTTP/2 priorities, to make sure the most important resources are delivered first. They also provide Rocket Loader, a little script that defers running JavaScript until everything else is in place, so the page appears faster. In practice, it’s not that bad, it just means some things take a fraction of a second to pop up.

Their cache is set to respect the Cache-Control headers I send, and taking a look with how they manage it, 58% of all requests are “dynamic”, meaning cache bypassm because they do not cache the HTML content itself. Most responses that can be cached are misses, because the requests are either infrequent enough or from different locations that it has expired, but hey, it’s trying. I save on average 25-33% bandwidth letting Cloudflare do this without me optimizing things manually.

Argo is enabled, allowing smart routing of requests, which I’m usually just below the limit for it to actually tell me how effective it is, but it’s working on around 80% of connections, nice.

Finally, workers.

I’ve talked about them before, but any “backend” work I do is likely going to have to be a worker on an endpoint, transforming data. My image CDN for example, is a worker.

Of Note

A paid option gives your workers a Key/Value store, and believe it or not, it’s possible to host Hugo serverless with just a Worker and this K/V store. Really cool but not what I want to pay for right now, though this is something that they have a guide on how to do if you don’t want to pay for a hosting provider, just a few bucks a month for that, maybe more if you go over the daily request limit (100,000).

HAProxy

After Cloudflare gets your request, assuming it was not cached, it gets delivered to my network, which, disregarding the rest of the infrastructure I have in my house, is bounded by an HAProxy instance, that routes, filters, and TLS-terminates all my domains and services. This will eventually route to an Ubuntu Server VM, where the web server lives.

Nginx

The server VM is running Nginx with a slightly modified configuration, serving content straight out of Hugo’s public/ output directory.

Hugo itself is triggered off a script that I run via SSH for now, that updates the repo, runs Hugo, uploads the search index data, cleans that up, and queues any future posts with at where the same process gets run again. In theory, as long as I have one post still in the future, I never need to trigger a build manually, since the next upcoming one will do that for me, and add all pending ones to the list. If I need to mane a correction that’s more immediate though, I’ll have to.

The CDN

The only extra part about this is the image CDN. Just about any image that’s not part of an ImageSlider or FeaturedImage is removed from the repo to save on space, it’s found inside an Amazon S3 bucket. This is the only case of AWS usage on this entire domain, storing large pieces of data. Really it’s images, but anything larger than 1 MB that is not required to be here in the project is offloaded (the irony? my server is set to upload LFS objects to another S3 bucket that I have, so it’s the same place either way really, just one is also saved locally). Before uploading I run a WebP convert, then push them, so that the CDN script can also auto-upgrade PNG requests to WebP if possible, since Polish won’t on Worker routes.

From there I just replace any large files or images with a special link, and it gets fetched from there and into the Cloudflare cache.

Search is provided by Algolia, which takes a giant JSON array of my posts, tags, categories, and stuff, and is searchable instantly at, at this point, no extra cost to me. Every build will regenerate this file which I can then upload, then compare to Algolia’s set to find any items I need to potentially remove before the dead links start to pile up.

Comments

Comments are provided by Disqus, which I recently wrote about. I kinda don’t like their business model, but they’re everywhere, they do the job, and both Algolia and Disqus are built into this theme, so I’m using them for now.

Conclusion

In the end I have a nice static simple, and despite being laid out like this, it’s rather simple - a web server, a service or two, and my laptop with a Git repository. Nothing needs to keep state, or user data, or anything that needs any form of database or backend, and it’s easy enough to hugo server -D when I want to write something, take Atom (or emacs) and edit away, and then git add, git commit, and git push once I’m ready with everything. A quick ssh and ./update-blog.sh on the server-side, and we’re ready. And of course, if I try to do something that I do not have the functionality for… I can just write it.

Side Notes

HAProxy

HAP is also what’s responsible for setting cache TTLs and my Content Security Policy. Everything here I’ve talked about before, press “s” and then start typing to search for them, you’ll find an article pretty quickly if you check the name of something. Really, it’s the gatekeeper, or in my case, where almost all of the headers will come from.

Update Scripts

If you were curious, here’s what runs when I trigger a build:

update-blog.sh:

#!/bin/sh

cd ~/teknikaldomain.me/tekpro-blog
git pull
git submodule update
~/regen-blog.sh
~/check-futures.py

Ok, so the real magic is regen-blog.sh:

#!/bin/sh

cd ~/teknikaldomain.me/tekpro-blog
rm -r public/
hugo
for file in `grep localhost:1313 public/* -l -r`; do sed -i -e 's/http:\/\/localhost:1313/https:\/\/teknikaldomain.me/g' $file; done
for file in `grep http://teknika public/* -l -r`; do sed -i -e 's/http:\/\/teknikal/https:\/\/teknikal/g' $file; done
~/update-search-index.py

Those two for loops are cleaning up two mistakes I keps noticing in the output, one was that it was still using localhost:1313 for internal links, and the other was that it was giving insecure links. Not a direct problem, but why waste the redirect when you don’t need it? This also does the index updating:

#!/usr/bin/python3

import json
import sys
from algoliasearch.search_client import SearchClient

# Initalize Algolia client
print("Connecting")
client = SearchClient.create("XXXXXXXXXX", "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
index = client.init_index("blog_index")

# Read local search objects from index.json
print("Reading local data")
with open("/home/jack/teknikaldomain.me/tekpro-blog/public/index.json") as idat:
    local_dat = json.load(idat)

# Push local data to Algolia
ldat_len = len(local_dat)
print(f"Got {ldat_len} records")
print("Saving records to index")
index.save_objects(local_dat, {"autoGenerateObjectIDIfNotExist": False})
print("Done")

# Read Algolia's object set
print("Checking index for dead records")
print("Reading index data")
idx_dat = list(index.browse_objects({"query": ""}))
idat_len = len(idx_dat)
print(f"Got {idat_len} records")

# Exit if no data (no more processing to do)
if not idx_dat:
    print("Index is empty")
    sys.exit(0)

print("Now cross-checking datasets")
print(f"This may need a maximum of {ldat_len * idat_len} computations")

# Dead object calculation
local_recs = set(local_rec["objectID"] for local_rec in local_dat)
index_recs = set(rec["objectID"] for rec in idx_dat)
dead_list = list(index_recs - local_recs) # Objects that THEY have that WE do not (outdated)

# Exit if no dead
if dead_list:
    for object_id in dead_list:
        print(f"Dead record: {object_id}")
    print(f"Pruning {len(dead_list)} records")
    index.delete_objects(dead_list) # Deleting all at once batches the calls
else:
    print("Index up to date")
print("Done")

This uses the Algolia Python client to connect and upload my index data, then downloads a copy of every record that they have, packs both into sets, subtracts one from the other, leaving me with just the ones they have that I don’t, and then I send a delete request for all of them.

Algolia has a per-month API request limit, so I need to keep my request counts as small as possible.

..And yes, I know I need to work on my variable names, it was a script I made in 5 minutes that’s worked since and I’ve never found a reason to touch it again or refine it.

The only other script in here is check-futures.py, and security-conscious people may want to turn away now:

#!/usr/bin/python3

import re          # RegEx
import subprocess  # `at` and `hugo`
import datetime    # ....Guess
import sys         # Exit

### REGEX SETUP ###

# `atq` output example: 185       Fri Apr  3 18:00:00 2020 a jack
#         Matched area: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#        Captured area:           ^^^^^^^^^^^^^^^^^^^^^^^^
job_reg = re.compile(r'\d+\s+(\w{3} \w{3} +\d?\d \d{2}:\d{2}:\d{2} \d{4})')

# `hugo list future` output example: content/post/tcp-udp-and-sctp.md,2020-03-18T18:00:00-04:00
#                      Matched area:                                 ^^^^^^^^^^^^^^^^^^^^
#                     Captured area:                                  ^^^^^^^^^^^^^^^^^^^
future_reg = re.compile(r',(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})')

### DATE PARSING FORMAT STRINGS ###
job_date_fmt = '%a %b %d %H:%M:%S %Y'  # Wed Mar 11 21:19:00 2020
future_date_fmt = '%Y-%m-%dT%H:%M:%S'  # 2020-03-18T21:18:21
at_submit_fmt = '%H:%M %Y-%m-%d'       # 05:04 2020-03-13 (used for sending jobs to `at`)

### SHELL COMMAND DATA GATHER ###
at_proc = subprocess.Popen(['atq'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, universal_newlines=True)
hugo_proc = subprocess.Popen(['hugo', 'list', 'future'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)

at_out, _ = at_proc.communicate()
hugo_out, hugo_err = hugo_proc.communicate()

### ABORTS ###

# Exit if `hugo` produces error
if hugo_err:
	print(f"Hugo error:\n{hugo_err}\nProcess aborted")
	sys.exit(1)

# Exit if no output (no futures, therefore nothing to worry about)
if not hugo_out:
	print("No future posts, process complete")
	sys.exit(0)

### MAIN PROCESSING ###

dates = []

# Existing jobs in queue?
if at_out:
	jobs = job_reg.findall(at_out)
	print(f"{len(jobs)} jobs in `at` queue")

	# Append all current jobs to known job dates
	for job in jobs:
		dates.append(datetime.datetime.strptime(job, job_date_fmt))

futures = future_reg.findall(hugo_out) # Get all posts with future publish date
print(f"{len(futures)} posts with a publishDate in the future")
for future in futures:
	date = datetime.datetime.strptime(future, future_date_fmt)
	if date in dates:
		print(f"Job on {date} is already queued") # Job will already trigger at that date
	else:
		print(f"Queueing job for {date}")
		subprocess.call(['at', '-f', '/home/jack/regen-blog.sh', date.strftime(at_submit_fmt)], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
		dates.append(date) # Add new job to list

This one uses a lot of subprocesses to interact with two commands: at and hugo. We take the output of hugo list future, fail if hugo errored, and RegEx that to get a list of dates. This is cross-checked with the result of atq which is every time at has scheduled a job to run. If there are unscheduled times, we use at to schedule them. There is no need to do another git pull, just to run hugo again to add the now in-date article to the post feed.

And that, ladies and gentlemen, is the longest post I’ve created here, estimating a 16 minute read, 418 lines of source, 3455 words, and 22404 total characters.

comments powered by Disqus