CVS Is Dumb

2020-07-15 9 min read Programmer stuff Rants Software Tech Tech explained Teknikal_Domain Unable to load comment count

So I was going back through my list of version control systems, and something hit me. CVS isn’t just kinda old and worn out, it’s downright backwards and dumb. Heck, CVS over the network is something that, in all seriousness, should not exist.

A Quick History Lesson

CVS is one of the earliest version control systems, and it started as a front-end to a tool called RCS (Revision Control System), even uses the same file format. RCS is file-based, meaning it has no real concept of directories, and operates pretty much local to one folder, for every file, say, foo.txt, there is a foo.txt,v either right next to it, or in the RCS/ directory that’s next to it. This is it. This, of course, is from back when multiple users shared the same central system, therefore we didn’t need to communicate changes between systems, we just needed to know what user changed a file. Eventually CVS (Concurrent Versions System) came along, with the massive feature of allowing concurrent work: RCS required that a user would ’lock’ the file to prevent anyone else from editing it, make their changes, then ‘check-in’ the modified file, thus ‘unlocking’ it. CVS allowed multiple users to be working on the same file at the same time, by requiring that the user has seen the most up-to-date copy before checking in new work, somewhat similar to how Git functions.

For a while, CVS was just a wrapper around RCS, until it got pulled apart and made into a client/server architecture. CVS is centralized, meaning that without a connection to the repository server, you can’t do anything. You can’t revert your changes, check in new ones, or even compare what’s changed and what’s not.

CVS came to have Kerberos (called :kserver:) and GSSAPI (called :gserver:) support, and allowed server connections across RSH and, later, SSH, but CVS has always had its own protocol with nothing special on it: the CVS Password Server, or :pserver: protocol. But we’ll get to that later.

Repository Design

The CVS server that you’re working with is the CVS repository, which can have multiple “modules” within it that can be checked out and modified separately. In CVS terms, the “repository” is the upstream server, and a “module” is what we would now call a repository, or a project. A module stores all the ,v files in the exact same structure as a working copy, meaning that just looking in on a module will pretty much tell you exactly how it’s laid out. There is one exception, deleted files (or other special cases where one doesn’t exist in places) belong to a folder called Attic/ in the module root. One special module, CVSROOT, has all the “administrative files” for things like module definitions, hooks (script to run at certain points of operations), user passwords, the like.

This design really does reflect the tech that CVS is built on: it really doesn’t ‘understand’ folders. It knows that they’re there, and they organize files, but doesn’t really care about them at all.

For example, each RCS ,v file stores the revisions of that particular file and nothing else, in reverse order, starting with the most recent, and working back through time. Each revision is delta’d (only the differences between the two are stored) and listed as a series of ‘add’ and ‘delete’ commands, with the most recent revision being the full file, and each rev back being the delta from the previous to itself. In this way, the farther back in history you want to see, the longer CVS and RCS will take to calculate, since it needs to layer a delta transform on top of a delta transform for each revision crossed.

CVS adds almost no information on top of this whatsoever. What this means is that it doesn’t even know when you commit files, which files were in that commit. All you can do, the user, is notice that the time, log message, and user all line up, and can therefore assume they were checked in together.

This design decision right here means that CVS has to work completely backwards from just about every other VCS in existence.

Atomicity

This is a deliberate design decision, mind you. Modern VCSes have what is called atomic commits, either the entire commit is processed and recorded, or it fails with an error. There is no halfway point, meaning you can never be left stuck with a half-completed commit.

CVS doesn’t have this. CVS works on a file-by-file basis, processing them one at a time, meaning that if one transfer fails, or your connection drops, then that file will be lost and not committed (though you can try again), but everything else will. And it’s not just commits, every operation. You want to pull down the latest changes? Better hope nobody is checking in, you might get half their commit and half just before their commit, and the only way to know is to keep sending it the update command until it reports back that nothing is new.

I can admire the reason it’s like this though: many machines at the time had little memory, and holding an entire commit state in memory at the time to make sure it’s all correct could (and would) crash them, meaning that by opting not to make operations atomic, that CVS could actually run without some heavy requirements. However, this also can lead to some very bad consequences, and, while done with good intentions, is something that I think really should have undergone more consideration. Then again, my opinion comes from a time when we’ve had years since then to figure this stuff out, and CVS is already mostly old and dead at this point.

Protocols

CVS stores very little information client side, pretty much just the repository and module that a working copy belongs to. This means that literally no CVS command will work if you don’t have a connection. This also means that for comparisons, you have to send everything to the server. To diff a file, you send the server your file, it compares, runs the diff, and sends back the diff output. To check in, you send every file up, the server calculates everything, and writes it. To check the status of a file, you send the file up, it checks, and it tells you what’s going on. CVS is very, very network intensive because of this. Just about every command that operates on a file will need to transmitting that file to the server and wait for a response to come back before acting on it, or showing it to the user. Even better, most commands output the server response raw, meaning that the massively verbose cvs status command is being sent to your terminal because that’s verbatim what the server returned. WHAT

Even better than that, the password server protocol only “trivially encrypts” the password transmitted, using a simple lookup table. Example, here is a verification request from cvs login:

BEGIN VERIFICATION REQUEST
/srv/cvs/tkdmn
tkdmn
A}yZZ30 e4
END VERIFICATION REQUEST

The password here is }yZZ30 e4 (ignore the A at the beginning, always). } = P, y = a, Z = s, and so on. This is Password1. Anyone who has seen the CVS source (or literally searches for “cvs password encryption”) would be able to look that up easily.

Okay, yes, the protocol was not designed to be security hardened, but come on, you’re already masking it a little, what’s wrong with an MD5 or SHA-1 hash or something?

Revisions

We all know how Git handles revision identifiers, right? it’s the hexadecimal (base 16) encoded SHA-1 hash of the commit object, which we usually only need like the first 6-8 characters of one to uniquely identify a commit. If you’ve used Subversion, you know that revisions start at 1 and go from there. Rev 1, Rev 2, Rev 3 … Rev 48145, and so on.

CVS… well, this has roots in RCS. First, revisions are tracked, yes, per file since a commit is just the operation where each file is individually checked in. You can, as the same time, check in version 1.7 of one file and 1.16.1.2.3.43 of another. And if just seeing that notation doesn’t scare you off, let’s explain that. CVS uses dotted-decimal notation for revisions. an odd number of terms is a branch identifier, and an even number is a revision. For example, 1.7 is the 7th revision on branch 1 (default). 1.7.2.3 is the 3rd revision, on the 2nd branch, created off the 7th revision, of the default branch. If I branch off from 1.8, I create 1.8.1.1. Then the next revision is 1.8.1.2, and so on. If we did another branch off 1.8, that’s now 1.8.2.1, you get the picture? Every branch appends two numbers to its base rev — the sequential branch number, and the 1st revision. Yes, this does uniquely identify each revision, and yes, this does have the nice ability of being able to trace back the graph of a branch’s history just by reading the numbers backwards, but it’s just.. not portable or easy to communicate. And yes, these can be force-changed, like starting over from a base of 2 or whatever.

At least with Git, it’s known that the IDs encode no value in and of themselves, and their only use is being supplied to commands. In CVS, no, they do have a meaning when standing alone, and that, just… no. This gets unwieldy quickly, and probably why we started using symbolic names for branches. Even Subversion, a program that fundamentally doesn’t even have a concept of ‘branches’ like CVS does, does it better than CVS: you create a (shallow) copy of the main development ’trunk’ (the default ‘branch’, per se) folder into some other folder. For example, I can create the branch testbranch by copying (relative to the Subversion repository root) trunk to branches/testbranch. Furthermore if you’re anything like me, you can type 1.8.1.2 when you mean to type 1.8.2.1, and likely end up breaking a lot if that was a file-editing command that you just mixed your numbers up on.

So in the end, CVS… is dumb. Its custom protocol is dumb, its revision identifiers are dumb, some of it’s designs are dumb, and, at least in this day and age, most of the world has decided that its single core concept, dentralized version control, is, yes, dumb.

Tek's Domain