NTP’s repo stopped working
Just fixed a somewhat interesting bug that impacted NTP’s repositories at bk://bk.ntp.org/ntp-dev
They have been using the commercial version of bk for many years and recently started looking at switching to using the open source version.
[This post is a somewhat experimental just to see if it is interesting to people, but I thought it might be helpful to see how I found and fixed a bug that a customer hit this week.]
Symptoms
They said that the last commercial release of bk worked fine, but the recent bk-7.3ce release fails like this:
$ ../dev-oss/src/bk -r check -ac
Missing delta (bk help chk2) in ntpdc/ntpdc-layout.c
ntpdc/ntpdc-layout.c: marked delta should be in ChangeSet but is not.
ro@xayide.techfak.uni-bielefeld.de|ntpdc/ntpdc-layout.c|20030804182811|24716
Missing delta (bk help chk2) in ntpdc/nl_in.c
ntpdc/nl_in.c: marked delta should be in ChangeSet but is not.
ro@xayide.techfak.uni-bielefeld.de|ntpdc/nl_in.c|20030804182811|03817
Missing delta (bk help chk2) in ntpdc/nl.pl.in
ntpdc/nl.pl.in: marked delta should be in ChangeSet but is not.
ro@xayide.techfak.uni-bielefeld.de|ntpdc/nl.pl|20030804182811|44146
Interestingly, the bk-7.2.1ce release worked fine. That error message basically says that some deltas in those files that are “marked” as being included in a cset don’t actually appear in the ChangeSet file’s weave. So some bookkeeping changed.
Finding the problem
So the bug was introduced during the 7.3 release cycle. I used ‘bk bisect’ to find the problem.
$ cat ntp-bug.sh
#!/bin/sh
cd src
make clean
make -j6 bk bk.script || {
echo build failed
exit 2
}
./bk --cd=$HOME/bk/ntp-dev -r check -a || exit 1
exit 0
$ bk bisect -rbk-7.2.1ce..+ --cmd=ntp-bug.sh
...
# Found it! [57604833FV7FTAfIMDCpCBhSzfEIKA]:
ChangeSet@1.2831.1.5, 2016-06-14 14:08:51-04:00, wscott@x99.wscott.bitkeeper.com
Fix some potential bugs with the USER() and HOST() macros
- HOST() could include [importer] at the end
(we removed $BK_IMPORTER since it was never really used)
- USER/HOST could return "user/realuser" and a number of the
callers were not really expecting that.
- Add new delta_user() and delta_host() functions that return
just the plain user and host.
This includes some user visible changes:
- annotations will only be username not user/realuser
- changes -uUSER will now match just the user name
- limiting by user in bkweb works more often
- findkey can now search for user in the user/realuser case.
That cset was removing some old cruft. Originally our CVS import scripts would record the user@host
of imported csets but we also wanted to record the user that did the actual import. We hid this information in an optional extension to the hostname field in the form of example.com[importuser]
.
In later versions of bk, we record the real username of the person that created every cset so this importer field was mostly unused. In the open source codebase, we have been all about removing cruft and simplifying the code and so the cset above was one of those code cleanups.
Why did that cset break NTP?
So it turns out that NTP has not upgraded their BitKeeper repositories in a long time and it still running a bk repository in the original repository SCCS compatible format. BitKeeper is fully backward compatible and can read and write old format repositories and so they didn’t notice I guess. The old format stores data in SCCS files with some metadata extensions (most of which are ^ac<char><data>
where the <char>
indicates what metadata this is an the <data>
is, well, the data. We could do this because SCCS didn’t look at the <char>
, it just assumed it was always a space). Current versions of bk use a newer binary format that is optimized to be loaded without parsing.
As a result, they are running a compatibility code path that reads the old file format and translates it into the new memory layout. That compat layer has a bug and is incorrectly setting the hostname field for some deltas. Unfortunately compat code gets less testing and the [importuser]
extension is pretty rare so we missed this case in our original code reviews ([importuser] didn’t follow the other pattern for extensions so it’s one more thing we have to get right).
In SCCS, the hostname is one of those metadata extensions and is only listed if is different that the parent, so this file looks like this (look for ^acH for the hostname lines and look carefully to see [ro] at the end of the hostname):
^AH16582
^As 1/0/21
^Ad D 1.2 10/01/27 18:45:26 davehart 3 2
^Ac #include <config.h> from all .c files and do not include it
^Ac from any .h files.
^AcC
^AcHshiny.ad.hartbrothers.com
^AcK26435
^AcZ+00:00
^Ae
^As 21/0/0
^Ad D 1.1 03/08/04 20:28:10 ro 2 1
^Ac BitKeeper file /amnt/figaro/volumes/adm-src/adm/src/ntp/ntp-dev-local/ntpdc/ntpdc-layout.c
^AcC
... no hostname line here so we use the 1.0 value below ...
^AcF1
^AcK24716
^AcO-rw-rw-r--
^Ae
^As 0/0/0
^Ad D 1.0 03/08/04 20:28:10 ro 1 0
^Ac BitKeeper file /amnt/figaro/volumes/adm-src/adm/src/ntp/ntp-dev-clone/ntpdc/ntpdc-layout.c
^AcBstenn@whimsy.udel.edu|ChangeSet|19990526004811|57482|8983e65c737bb465
^AcHxayide.techfak.uni-bielefeld.de[ro]
^AcK58440
^AcPntpdc/ntpdc-layout.c
^AcR5e5cedc7aa2e99ed
^AcV4
^AcX0x821
^AcZ+02:00
The fix
The code to propagate missing data when reading SCCS files is slib.c:sccs_inherit()
and looks like this:
if (HAS_USERHOST(s, p)) {
if (!strchr(USERHOST(s, d), '@') &&
(phost = HOSTNAME(s, p)) && *phost) {
/* user but no @host, so get host from parent */
sprintf(buf, "%s@%s", USERHOST(s, d), phost);
USERHOST_SET(s, d, buf);
}
}
The code in the problem cset was to change the HOSTNAME()
macro to actually return the hostname and not the contents of the old hostname field in the file. So it would strip the importer information from the hostname field. ([ro]
in this case)
So we were not propagating the right information and needed to send the raw data and that looks something like this.
if (HAS_USERHOST(s, p)) {
if (!strchr(USERHOST(s, d), '@') &&
(phost = strchr(USERHOST(s, p), '@'))) {
/* user but no @host, so get host from parent */
sprintf(buf, "%s@%s", USERHOST(s, d), phost+1);
USERHOST_SET(s, d, buf);
}
}
Once I create a testcase that reproduces this bug and fix and we have passed all tests on our build cluster, the fix will be pushed out the to -dev tree.
Workaround
Meanwhile if the NTP project just upgrades their repositories to use the current file format this problem will go away and everything will be faster.