Table of Contents
A first for me, this post is going to come in two parts. This one will cover the tech behind everything, and the second will actually be a new network tour. I’ve finally finished one of the biggest, if not the biggest project I’ve undertaken in a while: overhauling pretty much my entire network.
Well the major thing I just did was actually a hypervisor change: I moved from ESXi to Proxmox. I redid a few VM mappings and other things to sort out the mess that comes with an organically grown network, but the major change was that OS upgrade. The rest was, oh, not much, just moving from VMs to containers. This means that all my stuff is now on Ubuntu, not CentOS. And a minor note, I know when you’re seeing this. I actually finished all this work long before, before COVID, even, so I find it funny that had I not switched to Ubuntu more or less by force at this point, I’d have done it after Red Hat announced their plans with CentOS 8, basically violating the trust of the entire community and discouraging them from using the product completely. I’ve also got a new UPS up, and now it’s one that, get this, was actually meant for the kind of equipment and kind of load that I’m placing on it, imagine that!
Now I’ve never really talked too much about virtualization before, so let’s go over some basic terminology. Virtualization is, in this sense, effectively emulating physical hardware devices in software. Instead of having a closet full of computers, I have just one, which just acts like multiple, instead. Instead of real hardware, it’s virtual hardware, which, as it turns out, is a lot easier to manage, and also a lot more power efficient.
And for this purpose, of course we’ve made specialized software packages for this very task. These are called hypervisors. and they come in one of two types, aptly named type 1 and type 2.
Type 1 Hypervisors
Type 1 hypervisors are, usually, entire operating systems, but the defining fact is that they run directly on top of the physical hardware, which means they have direct access to hardware devices, and can result in better performance just by not having to deal with anything else except it’s own tasks. As a good example, ESXi is a Type 1 hypervisor.
Type 2 Hypervisors
Type 2 are closer to a conventional software package running as a standard process on the host hardware’s OS. While they’re a bit easier to deal with and usually easier to play with, competing with other programs and abstracted hardware access can create a performance impact. QEMU and VirtualBox are two good examples here.
Note that the lines here are kinda blurred, for example, Hyper-V is a Windows service that communicates at a bit lower of a level, meaning it has characteristics of both type 1 and type 2, and KVM for Linux uses the running kernel to provide virtualization, effectively acting like type 1, despite otherwise bearing all the classifications of type 2.
Now I used to be running VMWare ESXi as my hypervisor of choice, but now I’ve switched over to Proxmox Virtual Environment, also called Proxmox VE, PVE, or just… Proxmox. Proxmox is a Debian based Linux distro that’s built off a modified Ubuntu kernel, allowing for not only conventional virtual machines, but also containers using LXC, which, okay, let’s talk about that.
A conventional virtual machine is running a complete set of virtual hardware: a virtual motherboard with virtual CPUs, connected to a virtual hard drive over a virtual SCSI, IDE, SATA, whatever you prefer interface adapter… I think you get the point. When done correctly, a virtual machine’s OS would have very little to actually tell it that it’s, well, virtual. Of course, this does take a fair bit from the hypervisor to run everything like this, and as such, we created a concept called paravirtualization. Paravirtualization is where parts of the “hardware” interface have their true, virtual nature, exposed as such, so that if the guest OS is compatible with that, it can “help” of sorts, by acknowledging that it’s virtual, and co-operating with the hypervisor. Well, it turns out there’s something fundamentally different yet fundamentally better in some senses to virtualization altogether: containerization. Instead of working with virtual machines here, we work with containers. So far, there’s been two major ways of working with containers: Docker and Linux Containers.
Probably the more well-known of the two, Docker containers are really meant to be tight little self-contained boxes meant to do one thing and one thing only. If your app needs a web server and a database, make two containers, one for each, and link them together into their own little isolated network. In this sense a Docker container is really just running any regular command inside an isolated space where it can’t interact with anything other than what it’s given and what’s been explicitly allowed. And, well, that’s not the architecture I’m running on.
Docker uses a tiny little bit of a runtime,
containerd, that makes a slight bit of an abstraction layer. Each container is formed from an image, which is a filesystem and some extra configuration data.
That filesystem is a series of layers, each representing modifications (deltas) to the previous.
Each container also has an “entrypoint”", an executable program in the container namespace to use as process 1.
This can be a shell like
/bin/bash, but it can also be an app wrapper that does nothing else except start the service.
The two main ways a container can interact with the outside world are through volumes and ports.
Do note: Docker is mainly built to use Linux based systems as the host and guests, but it can support Windows in there too, albeit not as efficiently.
A volume is a named location in the container filesystem that the host’s filesystem can be mounted at, either as a raw path, or a named volume managed by Docker.
For example, to give a container access to the Docker runtime, you can map the host’s
/var/run/docker.sock to the container’s
A port is a network port that the container image has stated it expects traffic on.
An image for a web server might, say, call out ports
443/tcp as ones that it’s going to use.
These can be mapped to any available host port (through some Linux networking magic), but generally are mapped into the ephemeral port range of 32768–60999 (at least for Linux).1
Each image in Docker is defined by a Dockerfile, a list of steps used to build an image. As an example:
FROM ubuntu:latest RUN apt-get update RUN apt-get upgrade -y COPY . /app WORKDIR /app VOLUME /app/config EXPOSE 80/tcp ENTRYPOINT ["/app/main"]
This starts with a base image, being the official Ubuntu Docker image (and the latest version thereof),
RUNs a few commands, copies in the current build context to the folder
DIRectory, states that
/app/config is a mount point for a volume, and that TCP port 80 is going to be used, and, finally, when started, the file
/app/main is launched as the container’s main process.
On giving the
docker build command, you give it a directory, this is called the build context. Everything the Docker engine does in relation to
COPY operations and the like is relative to that context.
Also, each step in the build is a new image layer.
You have one for the base image, one as the results of the first
RUN, a second for the second, another layer after files are copied in… you get the picture.
Advanced Docker usage involves multi-stage builds that reduce layer counts by a fair margin, but do remember that each layer of a final, built image, is a separate “thing” Docker keeps track of with that image.
When starting a container from an image, one additional layer is added, that container’s modifications to its image, meaning multiple containers can share the same image (obviously).
After being built, an image can be published to a registry like Docker Hub, or, yes, a container running the registry image.
This is just a specialized HTTP API that allows for a Docker process to exchange images and layers.
Docker Hub is used as the general central location for public docker images, like, as the example showed before, the
Multiple images can be separated with a version tag, which is
latest in the example up there, but can be anything, really.
latest is the convention to use, and also the expected default if you didn’t actually name one yourself (
docker pull ubuntu would pull
One final note: Dockerfile directives like
RUN are actually executed in a short-lived ephemeral container, who’s resulting filesystem layer after it’s finished running is then committed to the in-progress image.
LinuX Containers (LXC)
This one, if you couldn’t tell by the name, is made specifically for Linux.2
LXC uses one additional program,
lxd, and native features of the Linux kernel to orchestrate everything (kinda like Docker, but more extreme).
A Linux container, conceptually, is meant more as a general-purpose Linux environment, and is also, conceptually, simpler: a filesystem archive and a configuration metadata file.
Yes, that’s all there is to it.
Every container is a full Linux userland: same
systemd, same file tree, same everything.
Unlike Docker images which are more meant to be specific to one “thing” at a time, a Linux Container is more like an entire VM that shares its kernel with its host.
In much the same was as Docker, LXC leverages some features of the Linux kernel, namely cgroups and namespaces. The difference is that LXC has almost no other overhead or runtime besides that, meaning it’s even lighter than Docker. Combined with the extra flexibility of not having a core idea of containers being specific pre-packaged apps, this gives it a lot more potential usage.
A Control Group (cgroup) is a way of limiting the amount of resources that processes can access.
systemd makes fairly heavy use of them.
For example, if you make a cgroup with a memory limit of 512 MiB, then start a process in that cgroup, that process (and any child processes) cannot use more than 512 MiB, allocating more will just… fail.
Disk IOPS , RAM limits, and even CPU cycles can be controlled with a cgroup.
Basically any resource that can be monitored and limited can be… monitored and limited.
And a namespace is more of a limitation in scope. There’s a few types, but, put simply, a process namespace can limit what processes or filesystem nodes a process (tree) is capable of viewing. Put a process in its own, limited, namespace, and it’ll think it’s the only process in the system. Specifically, namespaces can isolate:
- Entire process trees, a process in one namespace can see processes in all child namespaces, but none of its sibling or parent namespaces
- Networks and network interfaces
- Hostnames, meaning one set of processes may believe it’s operating on a host of a different name
- Mount points and entire chunks of the filesystem
- User and group IDs
So, for LXC, a container is a cgroup/namespace combination: the cgroup to set up the container’s resource limits, and a namespace that defines the container’s boundaries and filesystem access and limitations.
All containers are is a specified filesystem mount, and a configuration specifying what to allow.
Everything is going to be built off a template, which Canonical provides the one (and only, usually) list for.
A template (or image, if you will) is a .tar.gz of two files: a
metadata.yaml, and a .tar.gz of the initial filesystem.
When a container is made, this initial filesystem is decompressed into a specified location on disk, to which the container will run in.
With how I have this set up, each container gets its own LVM LV to itself.3
This makes managing container file system access easy, since you can just restrict a container to accessing its own LV, and only its own LV, and that it all that it requires.
Containers literally use the same kernel running as the host, meaning for the most part, they’re free to interact with the outside world as long as its within the bounds of their namespace. The only real control to give is extra filesystem mounts that are allowed into said namespace, and what network interfaces and network abilities are permitted within said namespace. One advantage of using kernel features like that is that resource allocations can be changed live, unlike a VM, or, for that matter, a Docker container, which, by defaults, has no upper limits unless explicitly stated.
Previously on ESXi, not happening.
You can’t pack up an entire virtual machine into something portable like that.
Now, if you tarball up the metadata file and filesystem contents, you can roll your own.
All one is is just that filesystem, and the YAML file containing the expected CPU architecture (
x86_64), creation date, OS name, and a description, again, that’s the
metadata.yaml that was mentioned earlier.4
With Proxmox, at least, it’s a simple matter to
tar czvf the entire file tree, then drop that file and the container’s config in a folder,
tar czvf those, and put it in Proxmox’s image directory on whatever datastore you’re using for images, and now it will just recognize that as a valid template that can be used when creating a new container.
You don’t even need to do any special prep, it will manage all that for you.
I made a container off the base Ubuntu 20.04 template, updated everything with
apt update && apt upgrade && apt autoremove, then stopped the container, zipped everything up like I just described, and dropped that in the folder, and started creating others off of that.
Works perfectly fine, no issues.
Why does it not go to the maximum allowable port number of 65,535? No idea. ↩︎
It’s also made by Canonical, the same company behind Ubuntu (and Launchpad. PPAs, anyone?) ↩︎
If you’re not aware, in LVM, multiple Physical Volumes (PVs) make up a Volume Group (VG). This VG may have one or many Logical Volumes (LVs) defined on it, each acting like fully independent drives (meaning it’s a M:N mapping of M physical drives to N logical drives). It’s more of a JBOD solution than like, RAID, or anything, but it’s still neat for managing things like this. ↩︎
Not an exhaustive list of possible options. ↩︎