Teknikal's_Domain

#<NTA:NnT:SSrgS:H6.6-198:W200-90.72:CBWg>

How Git Efficiently Transmits Your Changes

2020-09-04 9 min read Tech explained Version control Teknikal_Domain Unable to load comment count

So here’s a fun one. Have you ever noticed that even for huge changes to a repository, a git push only sends over a few kilobytes, maybe a few megabytes at most? If you’re familiar with the internals of git , you know that git stores an entire copy of the new file on commit. So how are these changes so small?

While the simple answer is “delta compression”, the in-depth answer is a bit more complicated. Let’s take this one step by step

Git Object Database

The main backend that Git manages is a database of objects, which are linked together to form your commit history graph. These objects have a type, like commit, tree, blob, and so on. Those three are the most common though, and the ones you need to know about. A commit object links up with another commit object as it’s parent (or multiple if this is a merge), and then points to a tree object as the root of the actual committed data. A tree object is, effectively, a folder, containing references to multiple blobs. And finally, a blob is a file, containing the file’s contents. blobs do not contain references to other objects.

Object Storage

On disk (meaning, in the .git/objects/ folder), all of these are stored as loose objects, where each one is a separate file, the file name being the SHA-1 hash of the object’s contents, I’m going to skip over that point for now. . Every time you git commit, Git will make one commit object, one tree object, as many blob objects as you have staged files, and as many additional tree objects as are required to place files in the correct directory structure. So that leads to a lot of files.

These files are compressed with zlib, which is the same compression (algorithm) used in zip files. However, that’s relatively trivial. A commit object that’s 1,191 bytes raw is, in this case, 902 bytes on disk. That’s not the level of compression that you’re seeing when network transfers are involved, something else must be at play.

There is, However, a second method of storing objects: packfiles. A packfile is a single file on disk that contains many objects at once, and these objects can reference each other to take up less space. When git packs objects up (usually a periodic git gc call), it will perform delta compression on them.

Delta Compression

Because science reasons, we use the Greek letter delta, Δ to represent a changing quantity, for example, if you were measure the temperature, $T$, of an object, then taking two measurements and comparing them will tell you the change from one to the other, called $\Delta T$. As a mathematical example, $T_2 - T_1 = \Delta T$. In this manner, Δ is pronounced “change in”, as in “change in temperature”.1

How does this relate to Git in the slightest? Well some nerds created something called delta compression, which is a very efficient method of storing changes from one file to another. Delta compression (also called delta encoding), is where only the differences to a known base file is stored, discarding any similarities. To decompress this, you apply the stored changes (also called “diffs”) to the base file, leaving you with the new file.

As a trivial example, the linux command diff -u will show this in a pretty basic format, and so will git diff. For example, if you change something, one line will be marked as a removal, and one line will be marked as an addition.

When Git packs objects, it looks for “similar” objects (usually in size), and attempts to extract a delta. If you ever manually did a git gc --aggressive on a repository and it took longer, that’s because aggressive mode will delta anything with anything if it produces sufficient results. If this check succeeds, then the first (relatively speaking) object will only store the difference between its original self and the second object, and a reference to the second object as the base to apply the diffs to. And this can repeat more than once, with there being deltas of deltas of deltas. In the repository for this blog, there is an object that has a chain length of 28 — That means that said object references a delta, which references a delta, which references a delta… meaning that in total, there’s 28 deltas all told. Note that this “first is delta of second” method is traditionally backwards, but, like many a VCS before it (even RCS, the oldest of old, did this), Git assumes that the most recent version of something is what you’d want to access the quickest. The farther back in history you go, the more deltas you need to apply, resulting in a slower access time, a fair trade-off given the infrequent use of old objects (usually), and the space savings.

Counting Up

If you’re curious, you can run git count-objects -vH, which will show you some statistics, like this:

count: 1835
size: 7.54 MiB
in-pack: 0
packs: 0
size-pack: 0 bytes
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

Note that it’s taking 7.54 MiB to store 1,835 objects, about 4.3 KiB per object. After a git gc --aggressive (this is after I nearly killed the repo unpacking everything), I see this:

count: 0
size: 0 bytes
in-pack: 2868
packs: 1
size-pack: 10.99 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

The count of 0 means that there’s no loose objects, and even though my repository recovery operations added a lot of objects, but not that much overall size, because the delta encoding.

Network Operations

Also known as: explaining the numbers you see on a git push or git pull. Namely, this:

Enumerating objects: 33, done.
Counting objects: 100% (33/33), done.
Delta compression using up to 4 threads.
Compressing objects: 100% (14/14), done.
Writing objects: 100% (22/22), 5.32 KiB | 5.32 MiB/s, done.
Total 22 (delta 15), reused 12 (delta 5)

Unless you’re running the most basic of systems, just a static HTTP server, Git is actually doing some proper communication. Both SSH and the “smart” HTTP protocol use two sets of complimentary processes, send-pack and receive-pack for doing a git push, and fetch-pack / upload-pack for a git pull. In either case, the processes determine what needs to be shared, only what needs to be shared, and then crafts a custom packfile, doing the entire transfer in one file upload or download.

git push

In a perfect world (which is what Git assumes), when you’re pushing data, there is nothing the remote has that you do not. Therefore, we can calculate the data to send without needing to talk to the remote, by just following the refs . Tracing back from, say master, until it crosses the commit pointed to by origin/master, gives you every commit that you need to push. And by walking through the commits, you can count up all the additional trees and blobs, thus quickly obtaining a list of everything you need to cram into a pack. When you start the transfer, you tell the remote what refs your updating, their old SHA, and their new SHA. The remote will stop you and reject it if the “old” SHA does not match what it currently has, as that means that your copy is likely out-of-date.

git pull

With a pull, Git needs to tell the remote what it has and what it wants. On connect, the remote tells it the SHA-1 of the current HEAD and all known refs. You will tell the remote what you want (the ref to pull), and what you have (the ref you already have)2, and the remote will run the same counting process on its end, then give you a pack containing everything that you’re missing.

Packing and Transfer

Either case, the sending side has now calculated a list of objects to send. They’re going to all be delta compressed, and in packing for transfer, git is always aggressive, and will delta anything against anything. Once packed, the file is uploaded to the other side, who then unpacks it, confirms everything is valid (no broken objects, dangling refs, and the file itself isn’t corrupt), then the transfer ends. This is when the Total 22 (delta 15), reused 12 (delta 5) line gets spat out.

Now, the statistics are hard enough to understand as it is since none of the numbers seem to make sense, but you only need to understand that final line for me to get my point across. The total number represents the amount of data transferred in who even knows what, but that’s not the important one. The two delta counters indicate how much of the data transferred was delta-compressed (for their respective main counter), and the reused counter shows how many objects got reused for delta compression. And that is something we have yet to talk about.

Git reuses objects during transfers when it’s possible to do so. Say, for example, that two (or more) files committed can actually be represented as two (or more) different deltas from the same base file. In that case, Git will reuse the base object, applying the multiple deltas separately, and saving repeating the exact same base over and over again. The reverse can also happen — the same delta, multiple different bases. Store the one delta, multiple bases, apply them all, and save the results as the output files.

Conclusion

It’s possible (through a script I have yet to write, nor will I, likely) to parse through a commit object, tallying up the size of all its trees and blobs (and itself), and generate an estimate of the “uncompressed” (minus zlib) transfer size, which you could compare to the actual data transfer counter in the git push or git pull status to see savings… But that’s just getting pedantic. As long as the reused and delta numbers are anything other than 0, that means that Git has found a way to save data when transmitting changes.

At the end of the day though, as interesting as this may be, it’s not important for most people. Unless you’re GitHub, where the minutia are actually critical, you’re not likely to be dealing with pushes and pulls that take anything more than a few seconds (okay, minus clones. Clones are another story). At most, you only have a few megabytes to send in the first place, so with as aggressive as Git can be about compression, it’s still effectively only shaving fractions off the time, most of this is in place for transfer efficiency, not raw speed.


  1. Fun fact for any KSP fans, or just space / rocket people in general. The term “delta-v” ($\Delta \text{v}$) is usually used in spacecraft flight dynamics to mean how much impulse is required to perform a maneuver. Literally “change in velocity”. For example, if an object is traveling 5 m/s away from you and you want to match speed with it, that maneuver requires a $\Delta \text{v}$ of… 5 m/s. You need to change your velocity that much to have the desired effect. ↩︎

  2. The documentation says that you’ll send a have for every object that exists locally, but in my experience I doubt this is the case. At worst, you’d send over a have for each ref, since everything can be calculated from there, and at best, you send just one, the current object for origin/master or whatever. ↩︎

comments powered by Disqus