Bk-7.3 broke ntp's repository


#1

NTP’s repo stopped working

Just fixed a somewhat interesting bug that impacted NTP’s repositories at bk://bk.ntp.org/ntp-dev

They have been using the commercial version of bk for many years and recently started looking at switching to using the open source version.

[This post is a somewhat experimental just to see if it is interesting to people, but I thought it might be helpful to see how I found and fixed a bug that a customer hit this week.]

Symptoms

They said that the last commercial release of bk worked fine, but the recent bk-7.3ce release fails like this:

$ ../dev-oss/src/bk -r check -ac
Missing delta (bk help chk2) in ntpdc/ntpdc-layout.c
ntpdc/ntpdc-layout.c: marked delta should be in ChangeSet but is not.
	ro@xayide.techfak.uni-bielefeld.de|ntpdc/ntpdc-layout.c|20030804182811|24716
Missing delta (bk help chk2) in ntpdc/nl_in.c
ntpdc/nl_in.c: marked delta should be in ChangeSet but is not.
	ro@xayide.techfak.uni-bielefeld.de|ntpdc/nl_in.c|20030804182811|03817
Missing delta (bk help chk2) in ntpdc/nl.pl.in
ntpdc/nl.pl.in: marked delta should be in ChangeSet but is not.
	ro@xayide.techfak.uni-bielefeld.de|ntpdc/nl.pl|20030804182811|44146

Interestingly, the bk-7.2.1ce release worked fine. That error message basically says that some deltas in those files that are “marked” as being included in a cset don’t actually appear in the ChangeSet file’s weave. So some bookkeeping changed.

Finding the problem

So the bug was introduced during the 7.3 release cycle. I used ‘bk bisect’ to find the problem.

$ cat ntp-bug.sh 
#!/bin/sh
cd src
make clean 
make -j6 bk bk.script  || {
    echo build failed
    exit 2
}
./bk --cd=$HOME/bk/ntp-dev -r check -a || exit 1
exit 0
$ bk bisect -rbk-7.2.1ce..+ --cmd=ntp-bug.sh
...
# Found it! [57604833FV7FTAfIMDCpCBhSzfEIKA]:
ChangeSet@1.2831.1.5, 2016-06-14 14:08:51-04:00, wscott@x99.wscott.bitkeeper.com
  Fix some potential bugs with the USER() and HOST() macros
  - HOST() could include [importer] at the end
    (we removed $BK_IMPORTER since it was never really used)
  - USER/HOST could return "user/realuser" and a number of the
    callers were not really expecting that.
  - Add new delta_user() and delta_host() functions that return
    just the plain user and host.
  
  This includes some user visible changes:
    - annotations will only be username not user/realuser
    - changes -uUSER will now match just the user name
    - limiting by user in bkweb works more often
    - findkey can now search for user in the user/realuser case.

That cset was removing some old cruft. Originally our CVS import scripts would record the user@host of imported csets but we also wanted to record the user that did the actual import. We hid this information in an optional extension to the hostname field in the form of example.com[importuser].

In later versions of bk, we record the real username of the person that created every cset so this importer field was mostly unused. In the open source codebase, we have been all about removing cruft and simplifying the code and so the cset above was one of those code cleanups.

Why did that cset break NTP?

So it turns out that NTP has not upgraded their BitKeeper repositories in a long time and it still running a bk repository in the original repository SCCS compatible format. BitKeeper is fully backward compatible and can read and write old format repositories and so they didn’t notice I guess. The old format stores data in SCCS files with some metadata extensions (most of which are ^ac<char><data> where the <char> indicates what metadata this is an the <data> is, well, the data. We could do this because SCCS didn’t look at the <char>, it just assumed it was always a space). Current versions of bk use a newer binary format that is optimized to be loaded without parsing.

As a result, they are running a compatibility code path that reads the old file format and translates it into the new memory layout. That compat layer has a bug and is incorrectly setting the hostname field for some deltas. Unfortunately compat code gets less testing and the [importuser] extension is pretty rare so we missed this case in our original code reviews ([importuser] didn’t follow the other pattern for extensions so it’s one more thing we have to get right).

In SCCS, the hostname is one of those metadata extensions and is only listed if is different that the parent, so this file looks like this (look for ^acH for the hostname lines and look carefully to see [ro] at the end of the hostname):

^AH16582
^As 1/0/21                                  
^Ad D 1.2 10/01/27 18:45:26 davehart 3 2
^Ac #include <config.h> from all .c files and do not include it
^Ac   from any .h files.
^AcC
^AcHshiny.ad.hartbrothers.com
^AcK26435
^AcZ+00:00
^Ae
^As 21/0/0
^Ad D 1.1 03/08/04 20:28:10 ro 2 1
^Ac BitKeeper file /amnt/figaro/volumes/adm-src/adm/src/ntp/ntp-dev-local/ntpdc/ntpdc-layout.c
^AcC
... no hostname line here so we use the 1.0 value below ...
^AcF1
^AcK24716
^AcO-rw-rw-r--
^Ae
^As 0/0/0
^Ad D 1.0 03/08/04 20:28:10 ro 1 0
^Ac BitKeeper file /amnt/figaro/volumes/adm-src/adm/src/ntp/ntp-dev-clone/ntpdc/ntpdc-layout.c
^AcBstenn@whimsy.udel.edu|ChangeSet|19990526004811|57482|8983e65c737bb465
^AcHxayide.techfak.uni-bielefeld.de[ro]
^AcK58440
^AcPntpdc/ntpdc-layout.c
^AcR5e5cedc7aa2e99ed
^AcV4
^AcX0x821
^AcZ+02:00

The fix

The code to propagate missing data when reading SCCS files is slib.c:sccs_inherit() and looks like this:

	if (HAS_USERHOST(s, p)) {
		if (!strchr(USERHOST(s, d), '@') &&
		    (phost = HOSTNAME(s, p)) && *phost) {
			/* user but no @host, so get host from parent */
			sprintf(buf, "%s@%s", USERHOST(s, d), phost);
			USERHOST_SET(s, d, buf);
		}
	}

The code in the problem cset was to change the HOSTNAME() macro to actually return the hostname and not the contents of the old hostname field in the file. So it would strip the importer information from the hostname field. ([ro] in this case)

So we were not propagating the right information and needed to send the raw data and that looks something like this.

	if (HAS_USERHOST(s, p)) {
		if (!strchr(USERHOST(s, d), '@') &&
		    (phost = strchr(USERHOST(s, p), '@'))) {
			/* user but no @host, so get host from parent */
			sprintf(buf, "%s@%s", USERHOST(s, d), phost+1);
			USERHOST_SET(s, d, buf);
		}
	}

Once I create a testcase that reproduces this bug and fix and we have passed all tests on our build cluster, the fix will be pushed out the to -dev tree.

Workaround

Meanwhile if the NTP project just upgrades their repositories to use the current file format this problem will go away and everything will be faster.


#2

AFAIK there’s no plan to switch to BK CE in general for the NTP project infrastructure. But I can run a self-made build of BK CE on platforms where the commercial version is not available – in my case, a banana PI SBC running the Bananian distro, a modified Debian/Arm7/gnueabihf. I use that for ARM specific stuff and regression tests. Without BK available there I had to go through some contortions that are no longer necessary.

And thanks again for the quick fix and the hint that the repos at bk.ntp.org use the old format!


#3

BTW, I pushed a collection of csets to the bk://bkbits.net/u/bk/dev repository that includes the fix to handle the ntp repository in the old format.

If you haven’t upgraded then you can build a bk using that version. (or build bk-7.2.1ce)


#4

BTW, I pushed a collection of csets to the bk://bkbits.net/u/bk/dev repository that includes the fix to handle the ntp repository in the old format.

Great, thanks a bunch!

If you haven’t upgraded then you can build a bk using that version. (or build bk-7.2.1ce)

That’s what I did ASAP after reading your wiki entry when you showed
that the regression happened after 7.2.1ce, but thanks for the hint
anyway :wink:

Relocation of the PSP.NTP.ORG machines from a rack at ISC to a farm of
VMs hosted somewhere else caused other trouble, too, which is not
related to BK but blocks me there.

I’ll keep you (and/or the WIKI) updated, though I don’t expect much
trouble from the BK side.

Greetings,
Pearly