How has BitKeeper changed since it was used for the Linux kernel
We haven’t been idle so I thought it would be a good idea to highlight some of the changes in BitKeeper that have occurred since the open source community last used it.
I figure that bk-3.2.3 was the last version of bk released before the Linux kernel switched to git, so bk has had about 63 releases since then.
This is just a very high-level summary. The RELEASE-NOTES files at the top of the repository have the full details, but those per-release summaries are over 4500 lines.
The changes include lots of bugfixes and polish, but I am going to ignore all of those and focus on major new features.
Layout
Originally BitKeeper kept revision history files stored in SCCS files at SCCS/s.file.c
right next to the actual files. This caused lots of problems with various software tools that assume file extensions can be used to know about file contents. BitKeeper now puts all revision history in a .bk
directory at the root of the repository.
Nested
The bk-5.0 release introduced Nested collections. The idea is that sub-repositories (known as components) can be checked into a top-level repository (the ‘product’) and these are revision controlled in a similar manner to files. Any workflow that works with a single repository works the same way with a nested collection. Clones can include a subset of these components and changes to these components are propagated. Updates are atomic, they either all work or all fail, just like updates to files in single repository.
The bk source tree only includes some components by default:
$ bk here -v
PRODUCT
default:
./src/gnu/diffutils
./src/gnu/patch
./src/gui
./src/gui/bkgui
./src/gui/tcltk/pcre
./src/libc
./src/t
./src/tomcrypt
./src/tommath
PRODUCT
And other components are tracked but only populated when needed.
$ bk comps -m # -m == missing
./src/gui/tcltk/bwidget
./src/gui/tcltk/tcl
./src/gui/tcltk/tk
./src/gui/tcltk/tkcon
./src/gui/tcltk/tktable
./src/gui/tcltk/tktreectrl
./src/win32/dll/scc
./src/win32/dll/shellx
./src/win32/dll/shellx/src
./src/win32/msys
./src/win32/vss2bk
There is an aliasing system so you don’t have to remember all the component names:
$ bk alias
MSYS
TCLTK
WIN32_DLL
default
add -v to see what’s in those aliases:
$ bk alias -v
MSYS:
./src/win32/msys
TCLTK:
./src/gui/tcltk/bwidget
./src/gui/tcltk/pcre
./src/gui/tcltk/tcl
./src/gui/tcltk/tk
./src/gui/tcltk/tkcon
./src/gui/tcltk/tktable
./src/gui/tcltk/tktreectrl
WIN32_DLL:
./src/win32/dll/scc
./src/win32/dll/shellx
./src/win32/dll/shellx/src
default:
./src/gnu/diffutils
./src/gnu/patch
./src/gui
./src/gui/bkgui
./src/gui/tcltk/pcre
./src/libc
./src/t
./src/tomcrypt
./src/tommath
PRODUCT
BAM
The bk-4.1 release introduced BAM.
BAM is a large binary file manager. The SCCS weave is optimized for text files. (Aside: the weave is cool and I will post about that soon) So for binary files we have an alternative storage mechanism that can link large binary files from the revision history. These files can be backed by a cloud of separate “BAM” servers. If a repository uses a BAM server, then only the tip revisions that are actually checked out need to be transferred between repositories.
The cloud of BAM servers are managed such that any repository has access to any BAM data it needs.
This allows centralized storage of binary data, and replication to servers that are geographically close to the users. It is possible, and practical, to have terabytes of BAM data.
Performance
Performance is a feature
In bk-3, BitKeeper used pretty much the standard ASCII SCCS file format with some added metadata. The file had a whole-file checksum at the top and so the entire file needed to be read in order to validate the checksum. The file also contained a whole data structure for revisions that had to be parsed line by line, allocated piece by piece, and loaded into a linked list data structure in memory in order to process the file. When any change occurred the entire file needed to be rewritten to disk to replace the old file. This was especially harmful to the ChangeSet file at the top of the repository that linked the histories of all the individual files together.
In the new version of bk, the file formats have completely changed. The data is saved in a compressed, paged, data store where each block has a CRC so that data can be validated as it is read. This data store also has redundant blocks to recover from minor data corruption events. The revision data structure in memory was reworked so it exactly matches the on-disk file format. It can be read directly from the files into memory without any parsing. This in-memory data store is paged on demand so only those portions of the data structure that are actually accessed are loaded in memory. Everything is compressed with lz4 to save space. LZ4 is faster than not compressing at all when reading from disk.
Graph traversal algorithms throughout the system were optimized to walk the minimal portion of the file history DAGs to extract information. When combined with the paging datastore, many operations touch at least 1/10th the data compared to old versions of BitKeeper.
For the ChangeSet file, the format includes an append-only data store so new information can be added to the file very quickly and the cset data is stored in a custom data structure optimized for the common access patterns rather than using the original SCCS weave.
In bk-3.x, push and pull operations would make multiple passes over a file to add data to the SCCS weave. In current versions of BitKeeper, an arbitrary number of changes can be incorporated into the data weave in a single pass over the file.
Past versions of BK used to have to scan the entire repository to find changes (or extras, aka untracked files); in a large tree this can be quite slow. Current BK gives you a reason to use the checkout:get where files are read only, bk edit is required to lock the file for editing. When in that mode, BK remembers each directory where you locked or delta-ed a file and scans only those directories when looking for modified/extra files.
Also in current versions of BitKeeper, multiple CPU cores are used for expensive operations as needed. This is dynamic depending on the type of storage media used for the repository. SSD likes lots of random accesses, rotation disk is faster if you schedule all accesses in a predictable pattern. This is especially noticeable on NFS where parallel access patterns can be used to cover the large network file server latency. Larry has some NFS benchmark results that shows off the performance work.
Windows
Quite a bit of stability work also happened on the Windows platform. Windows does not provide POSIX file system semantics where files can be deleted or moved while being accessed by another process on the same machine. This is especially problematic on Windows where virus scanners and disk indexing utilities can access files at random times. A lot of work happened on BitKeeper Windows compatibility layer to allow BitKeeper to function reasonably in the face of a normal Windows machine. Requiring virus scanners to be disabled for correct operation is not acceptable.