Yes, so that should be pretty easy to replace.
When you say ‘index’ BAM files, I’m guessing that you mean: BitKeeper hashes the contents of files synced into BAM, and uses that hash as an index to lookup, store/download BAM files later. Is that idea correct?
If it is, then would you say MD5 is actually speed sensitive, in this case? As in, you may need to hash large (multi-GB) files. If that’s so, it’s worth spending a bit of extra time working in optimized implementations. But that’s still preferable, I think.
(Alternatively if that’s true, you could choose a faster, stronger, and much better hash function, too, like BLAKE2, which could also double as a MAC if you wanted to keep your HMAC-based verification code below. However, if my guess is correct, I’m also guessing that’s a backwards incompatible change, so it’s off the table.)
That’s easy enough to replace and I imagine fast-import probably isn’t bottlenecked very much on this.
Do these numbers require high entropy and secure generation for e.g. key material or something? If so, I’ll spare you a whole bunch of hand wringing and future complaints from the peanut gallery: this is very easy to replicate on both platforms. On Windows, you want to use
RtlGenRandom, while on Unix, you want to use
/dev/urandom. This is the same methods libraries like
libsodium, etc generally use, and people seem to be somewhat standardizing on it.
I bring up hand wringing and peanut galleries because like many, many security things people are very dramatic and stern about how to generate random values. But it’s generally widely accepted to be a good idea to use /dev/urandom and stick with it on almost every modern Unix. So, my suggestion is to just call
read on an fd to generate the bytes you need, and get word-values out of that with some twiddling.
On top of that, for systems like OpenBSD, and Linux 3.17+, you can use direct syscalls (see
man getrandom(2) for more) to avoid have to even opening a file descriptor.
And anyway, even if you don’t need cryptographically secure entropy,
RtlGenRandom are easily available, good quality, and require no external dependencies. I doubt this is a bottleneck. This is all a very small amount of code, traditionally, which I’ve written before (for my own libraries in Haskell, where I needed to access the system entropy source, ~50 lines of extra C).
HMAC is easy enough to replicate, but just to make sure I’m on the same page - should this be removed or kept? If it’s sort of a work around for the error suppression, it’s never to late to rip that bandaid off and then drop HMAC. BK does advise strong data consistency/checks though, so maybe you want to keep it!
OK, so it could just be nuked in that case?
As you may be able to tell - I’d be willing to write a few patches for this, if the above sounds like an amenable way to drop the
libtom dependencies and make the codebase leaner, and you all agree. I probably won’t get to it for a short amount time, but it sounds like an easy enough way to start contributing.