Automating Algolia Search Indexing

If you haven’t noticed by this point, there’s a brand new search bar at the top of this blog now, and you can even get to it anywhere by pressing “s”, isn’t that neat! Anyways, there’s a bit of a story behind that, because getting those results in place, well… isn’t exactly the easiest.

Algolia

Algolia is a service that provides a backend for search systems. In essence, they’re the engine: you give them the content, and put up a little search box, and they figure out what to return when someone types something in. Their content storage (“index”) is a JSON based data store that items (“records”) are retrieved from. Their free tier gives you 10,000 live records, and 50,000 API calls per month to play with before you have to upgrade to a paid plan. (Disclaimer: not sponsored. I just use it.) Querying is really easy. The hard part is loading data in, since you have to format it right (at least on a free plan). Luckily, Hugo can do this.

Some Example Records

{
  "iconClass": "fa-folder",
  "title": "Hardware",
  "type": "category",
  "url": "https://teknikaldomain.me/categories/hardware",
  "objectID": "https://teknikaldomain.me/categories/hardware"
}

{
  "author": "Teknikal_Domain",
  "categories": [
    "Photography",
    "Tech explained"
  ],
  "iconClass": "fa-pencil",
  "language": "en",
  "tags": [
    "DSLR",
    "Raw Photos"
  ],
  "title": "DSLR Raw Photos Explained",
  "type": "post",
  "url": "https://teknikaldomain.me/post/dslr-raw-photos-explained/",
  "objectID": "9e594f813dfa431d94687166fb1f7b5a"
}

Hugo

Hugo can, with a configuration tweak, output an index.json that, with the help of an index.json layout, will contain an array of Algolia-ready JSON objects that can either be copied and pasted in by hand, or uploaded automatically with a few API calls and a simple program. And today… we discuss that simple program.

That Program

I, fun fact, have a script that is used to automatically build and deploy this blog from changes in the master Git repository. Currently it has to be run manually, but I’m working on that. Here’s what it looks like:

#!/bin/sh

cd ~/teknikaldomain.me/tekpro-blog
git pull
rm -r public/
hugo
for file in `grep localhost:1313 public/* -l -r`; do sed -i -e 's/http:\/\/localhost:1313/https:\/\/teknikaldomain.me/g' $file; done
for file in `grep http://teknika public/* -l -r`; do sed -i -e 's/http:\/\/teknikal/https:\/\/teknikal/g' $file; done
cd ~
./update-search-index.py

What this does, in order:

Changes to blog root
Pulls latest content from Git
Delete previous generated content (Hugo doesn’t remove any old files by itself)
Builds site HTML into public/
Remove all references to the localhost development server from hugo server (why does it even keep these in?)
Rewrite all HTTP links to self with HTTPS
Run the update script

So what we really want is in update-search-index.py. OK then:

#!/usr/bin/python3

import json
import sys

from algoliasearch.search_client import SearchClient

print("Connecting")
client = SearchClient.create("XXXXXXXXXX", "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
index = client.init_index("blog_index")

print("Reading local data")
with open("teknikaldomain.me/tekpro-blog/public/index.json") as idat:
    local_dat = json.load(idat)

ldat_len = len(local_dat)
print(f"Got {ldat_len} records")
print("Saving records to index")
index.save_objects(local_dat, {"autoGenerateObjectIDIfNotExist": False})
print("Done")

print("Checking index for dead records")
print("Reading index data")
idx_dat = list(index.browse_objects({"query": ""}))
idat_len = len(idx_dat)
print(f"Got {idat_len} records")

if not idx_dat:
    print("Index is empty")
    sys.exit(0)

print("Now cross-checking datasets")
print(f"This may need a maximum of {ldat_len * idat_len} computations")

local_recs = set(local_rec["objectID"] for local_rec in local_dat)
index_recs = set(rec["objectID"] for rec in idx_dat)
dead_list = list(index_recs - local_recs)
if dead_list:
    for object_id in dead_list:
        print(f"Dead record: {object_id}")
    print(f"Pruning {len(dead_list)} records")
    index.delete_objects(dead_list)
else:
    print("Index up to date")
print("Done")

I feel that the print()s in there mostly serve as comments, and I have had a little external help making it run (closer to) optimal.¹

The client and index objects hold a reference to my particular account (and API key), and the exact index I use on my blog, this is the trivial part.

What comes next is the index.json file is loaded using Python’s json library. The index.json file is all one big array at the top level, meaning it parses directly into an array of objects. Once this is generated, save_objects() is called to push them all to the index. One API call is consumed per write, but passing an entire array allows it to batch the actual REST calls for less latency and round-trips.

Once this is done, browse_objects() is called with an empty query string, or in other words, makes one API call that returns everything. This is checked for an empty case, which would be when you run this script for the first time. Since there is no data in the index to compare, it can end immediately without further processing.

For verbosity’s sake, I calculate the maximum, worst-case number of comparisons needed to run the dead link checking. Sure, Python can make thousands of comparisons a second, but this number will be roughly n^2 where n = sum of categories and tags. Therefore, it can grow quite a lot the more content I add.

With everything prepped, we can begin the comparison. First, two sets are created from the raw arrays of our local data and the data from the index, and then we just… subtract them. The index set will always contain every item that the local set has (we just pushed them up!), and that subtraction will leave the items in the index set that did not exist in the local set… AKA dead items, or ones that would lead to a 404 if you clicked on them.

Because this runs on every site deploy, that means that every time new content goes up, the search box up there will be up-to-date with new items, and old ones will cease to exist.

While it would be a lot less computationally heavy to delete every object and just upload a completely fresh dataset, that means I’d also take up n^2 API calls out of my (not really) limited pool, compared to n + 1 + m (m = number deleted) as I currently am now.

Special thanks to my friend 2ndBillingCycle for helping run some optimizations with language features that I wasn’t quite familiar with. ↩︎