Bk fast-import (ALPHA release)

wscott · May 27, 2016, 6:18pm

bk fast-import

This post talks about the new fast-import function for importing git repositories. The top post here will stay updated with current usage instructions and the discussion below will talk about what is coming next.

This command is very fast, like 100x faster than the previous git2bk script posted.

pre-alpha

Currently, this code is not integrated and should be considered a pre-alpha state. It is a great way to be able to play with BitKeeper with real data, but it should be validated very carefully before using the results in production.

ie, I just got this to work and I am letting people take a look to help me find bugs.

Repository

The code for this is hosted at:

bk://bkbits.net/u/wscott/dev.fast-import

Typically you would clone a local copy of bk and then pull from that location.
Or try this:

bk clonemod bk://bkbits.net/u/wscott/dev.fast-import LOCALBK dev.fast-import

(That command will get a copy of the change without transferring the whole repository.)

Usage

mkdir bk-repo
cd bk-repo
git -C ../git-repo fast-export master | bk fast-import

Limitations and known bugs

This will only do an import of a single git branch
It doesn’t do incremental imports yet
doing the import twice will generate different metadata so you can’t currently import two different git branches and have they talk to each other.
This is currently taking git’s data directly without trying to enable rename detection.
- Since bk has first class rename events that propagate across merges this is actually very difficult. That is a planned feature.

How to validate an import

I found that a quick way to prove to yourself that the bk import was correct is to turn around and use bk fast-export to put bk’s data back into your git repository. Something like this:

cd bk-repo
bk fast-export --branch=bkcmp --no-bk-keys | git -C ../git-repo fast-import

Then the commits won’t be exactly the same, but the tree objects should match exactly:

# bk has 2 new csets at the beginning we need to skip
BKROOT=`git log --pretty=%H bkcmp | tail -2 | head -1`
FMT="tree: %T%ncommitter: %ae%ndate: %cD%n%B"
git log --date-order --pretty="$FMT" --raw $BKROOT..bkcmp > BK.log
git log --date-order --pretty="$FMT" --raw master > GIT.log
diff -u BK.log GIT.log

wscott · May 27, 2016, 6:46pm

Right now it needs a bunch of testing so I can find other problems

Future plans

The importer needs to support incremental updates so you can re-export the same git repository and only a transfer new objects.
The incremental code will also correctly handle importing multiple git branches and correctly sharing csets.
The code is written to be threaded in the future and when multi-threaded is added it could be a bunch faster.
We really need a ‘bk git2bk’ driver command to make the incremental, and the validation steps automatic. But I started with the fast-import plumbing since people might want to use it for other systems.
git2bk should also support importing directly from a git URL

wscott · May 27, 2016, 8:55pm

OK. The current code has problems with git -ff merges. It works in some cases, but not all.
I know how to fix this but won’t have time to do that before the weekend.

mcvoy · May 27, 2016, 9:26pm

Pretty sure I saw that too. If it’s reproducible then I can take a look.

wscott · May 28, 2016, 12:41pm

It is a graph transform doing an all-pairs reachability test. Traditionally a Rick problem. Also at the same time I want to re-layout the csets to account for bogus timestamps that are common with git.

This will take some thinking so I won’t get there until Tuesday.

wscott · May 31, 2016, 12:33pm

In our git importers, I usually add a “GIT: <sha1>” line to the end of the imported cset comments to mark that this cset was imported and to provide some bookkeeper to use when doing and incremental update later. I just started to add that feature and realized that fast-export doesn’t actually tell you to rev of the exported csets.

I may need to add that feature to a ‘bk git2bk’ command that uses ‘git fast-export --export-marks=file’ and ‘bk fast-import’ and then goes back and edits the cset commits to add that information. Then incremental can pass --import-marks to only export part of a repository.

ob · May 31, 2016, 5:00pm

Another option would be to formalize that marks file and save it for future imports. That is, remember the mapping mark -> md5key and when git is done use it to create a mapping sha1 -> md5 that lives outside of the comments (perhaps BitKeeper/etc/import-marks?) and make fast-import load it for incremental.

wscott · May 31, 2016, 7:44pm

OK. Pushed another round of changes to the fast-import repository.

Now I am working on a bug where the import appears to work successfully but check fails:

...
Writing new ChangeSet file...
Renumber...
Generate checksums...
Rename sfiles...
14535 marks in hash (10803 blobs, 3732 commits)

$ bk -r check -a
BitKeeper/deleted/31/SCCS/s.autogen.sh~dd805b1bf9dc52e6: unmerged leaf 1.1.10.1
BitKeeper/deleted/31/SCCS/s.autogen.sh~dd805b1bf9dc52e6: unmerged leaf 1.1.9.1
BitKeeper/deleted/31/SCCS/s.autogen.sh~dd805b1bf9dc52e6: unmerged leaf 1.1.8.1
BitKeeper/deleted/31/SCCS/s.autogen.sh~dd805b1bf9dc52e6: unmerged leaf 1.1.7.1
BitKeeper/deleted/31/SCCS/s.autogen.sh~dd805b1bf9dc52e6: unmerged leaf 1.1.6.1
...

(going to push a related fix to not use the SCCS/s. filename here)

Basically, this is a file converging issue where multiple files come together and want to occupy the same space in the file system. I need to keep track of the files that lose and make sure they are properly merged together and committed.

hasanihunter · June 6, 2016, 7:12pm

@wscott How do I remove the 1970 change sets from imported git repos?

wscott · June 6, 2016, 7:34pm

Well, you can’t. That was kind of a placeholder.
I will work on that part next.

wscott · June 6, 2016, 9:06pm

OK @hasanihunter. Pushed a cset to set the date of the cset that creates your imported repository to 5 minutes before the first git cset.

Today I also pushed a big rewrite that fixed the ‘unmerged leaf’ bugs I reported last week. So again if people find some examples of smaller git repositories that fail to import correctly, send them to me.

Tomorrow I will mess with the git2bk command:

usage: bk git2bk [--no-validation] <git url> <bkrepo>

example:
  bk git2bk https://github.com/borgbackup/borg.git borg

This will create a local bk repository that is a copy of a remote git repo and do the validation steps to prove that the bk repository is an exact match. The validation is tricky because I often find git repos with all kinds of CRLF ending combinations and bk will normalize text files.

In the future, that same command line will do an incremental import to pull in changes.

On IRC I told people I would work on a correct import first. Then I asked what me to do next, faster, incremental, renames (aka smaller), or submodules. The vote I heard was incremental so that is where I am headed next.

hasanihunter · June 7, 2016, 12:43am

Thanks @wscott , I will give this a shot and let you know if I see any issues

wscott · June 7, 2016, 9:32pm

Pushed another huge speedup that appears to be about 2-3X faster.
It eliminates the pause at the beginning while reading the incoming data.

I still have some correctness corners to track down.

wscott · June 8, 2016, 9:05pm

Another day, another pile of bugs.

The current version is working very nicely on a number of repositories. I still need to add octo-merges so I can take a shot at importing the kernel.

Please, try it on your favorite repos or suggest github URLs I should use for my debugging targets.

wscott · June 15, 2016, 5:38pm

OK. A milestone of sorts.

I am pretty happy with the current state of the fast-import code and so it has been pushed to bk://bkbits.net/u/bk/dev and will be included in the next release.

Still no incremental imports, but it does a pretty good job with handling most git repositories.