[bkd] HTTPS, cloning, separating the web UI from the repository daemon

thoughtpolice · June 27, 2016, 8:24pm

Hi,

I’ve been playing with the bitkeeper daemon on my server, and I had a few semi-related questions on how the daemon operates (including what may be considered a feature request).

Right now, when you run the Bitkeeper daemon, it binds to a port and serves repositories over that port as specified in the manual. You can also visit this port inside your browser you get a view similar to https://bkbits.net

What I’m wondering is this: is there a way to split the daemon, so that it binds the Web interface to one port, while binding the actual daemon allowing clones to a separate port?

My use case is that I’d like to serve only the repository for clone operations, and front the actual Web UI with nginx. Not only is nginx more suitable for this (with features like Geo-IP and load balancing – not to mention proxying all my other HTTP services already), but I’d like to use Let’s Encrypt to add certificates to serve the repository browser over HTTPS. Essentially, I’m already running one web server and know how to configure it, so just being able to shove it in there is ideal.

Relatedly: does BitKeeper allow HTTP/HTTPS clones?

One of the more useful features of git is allowing HTTPS clones, actually. The major bonus of doing this is that it allows git to work well in firewalled environments since port 443 is basically always allowed. It also has two other minor knock-on benefits, I think:

You can use a separate authentication mechanism on the web server if you want to provide authenticated web access (e.g. you can use HTTP authentication over HTTPS, where the server may use e.g. LDAP to field login requests with a custom backend, before passing on the request – with github you have separate passwords for HTTPS authentication, vs ssh)
This scheme also works well with providers like CloudFlare – over HTTP you never have to reveal any information since the CDN proxies you and shadows your real IP (and by cloning over HTTPS, you get authentication/encryption as well for privacy)

Finally, separating the web UI means I can put another web UI in its place (for example, while git ships with gitweb, many people prefer cgit for its nicer UI and awesome performance), e.g. one that supports syntax highlighting (which, IMO, falls outside the scope of BitKeeper itself and what it should do).

thoughtpolice · June 27, 2016, 8:28pm

Just at a quick browse and read of the source in clone.c and bkd{,_http}.c: no, I don’t think any of this is quite possible yet (admittedly I don’t understand everything quite yet). So I guess consider this a giant feature request with some motivation!

wscott · June 27, 2016, 9:01pm

Some of these are possible. You can run multiple bkd’s and put them on different ports. And you can use the -xCMD option on your different bkd’s to have certain commands disabled on the different ports. So it is a bit awkward, but you can setup a bkd that only serves the web API and another on a different port that will talk to repos. Or a bkd that only provides read-only access.

Also on bkbits, we are serving the web interface to the bkd via https by configuring apache as a reverse proxy. (nginx could do the same thing)

But the reverse proxy won’t work for cloning because the bk client side doesn’t have https support.

thoughtpolice · June 27, 2016, 9:29pm

Thanks, you’re right. I started bkd with -xpush -xhttpget -xget and it seems I can still clone, but there’s no HTTP interface. Here’s my full command for running bkd under systemd on Ubuntu:

/usr/local/bin/bk bkd -D -C -p14690 -xpush -xhttpget -xget

Can I ask what the difference between get and httpget is? It seems I need both, but it’s unclear why.

Also, for the inverse, just serving the HTTP server (to be served by nginx), should I exclude everything else besides -xget -xhttpget -xcd? That is, like

-xabort -xcheck -xclone -xpull -xpush -xpwd -xrclone -xrootkey -xstatus -xsynckeys -xversion

Also, one final Q: I start bkd with -C to prevent bkd from moving upwards from where it’s started (in my case, inside /srv/repos/bk), but I want to serve a set of repositories under a nested folder structure. In the docs, -xcd is mentioned for individual repositories, as a way of stopping any cd command. Just to be clear:

-C stops any moves upwards, meaning you can start it at the root of a nested set of folders and it won’t ever go above that.
-xcd stops all moves to any other directory, which means it’s strictly a superset of -C, and only meant for daemons serving a single repository (no nested structure possible: you can’t cd into subdirs)

Is this interpretation roughly correct?

(Finally, I have a nice little systemd service file I’ll share sometime shortly if I can get this working. It might be nice to have a single service file for systemd based distros, perhaps inside BitKeeper’s repo itself.)

mcvoy · June 27, 2016, 10:10pm

If you are planning on running a bkd on a public machine where you are concerned about security, I wouldn’t. I would use ssh:// urls instead, they are secure. The bkd was never designed to be used on a public machine where there is secret stuff.

On the other hand, we used it for years on bkbits.net and had zillions of projects hosted there back in the day, it worked fine. We allocated a Unix user for each project, did 755 mode on the directories, never had a problem.

Your interpretation of -x/-xcd is correct but I wouldn’t lean on those too heavily. It’s open source, I’m sure someone could find some exploit in there.

It’s been a long time we had a cmd_get and a cmd_httpget. Whatever the get command did is long
gone, both map to httpget. I suspect we had/have code that uses both so we just left them there. You might tinker and see if you need both.

Love to see your systemd file, we can stick that in src/contrib/systemd?

mcvoy · June 27, 2016, 10:11pm

On https, that would be something we’d love to see contributed at this point. Any chance of that?

thoughtpolice · June 27, 2016, 11:01pm

Right, so I’m not really worried about secrecy, I just want a read-only mirror anyone can clone from, similar to bkbits.net. Any authenticated pushes would be behind ssh; but right now I’m just playing with fast-import mirrors so I haven’t set that up yet. So, the scheme would be:

Let users read-only clone from bk://...
You can read the source in your browser at https:// (since bk’s default port is 14690 this is more ‘obvious’ for users coming from gitweb, hgweb, etc where you can just replace the bk URI scheme with https).
Any authenticated pushes are behind ssh with a user/authorized_keys file associated (not so worried about this at the moment)

So I think that basically answers most of my questions. The above should be OK, yes?

Also, yes, I will share my unit files shortly, but I’m trying to reorganize them a bit so they can handle these cases pretty easily. On the note of security, one convenient thing is that systemd can put several restrictions on daemons under its control, including things like prctl(PR_SET_NO_NEW_PRIVS), but the more important one is it can put bkd inside a private namespace/cgroup, and do things like:

give bkd a private /tmp directory that is unavailable to any other process (this hardens a bit against accidental /tmp race conditions, and should be fine unless bkd uses /tmp for some shared IPC nonsense. But on modern Linux, you should use /run for shared IPC, only in the very narrow, security audited scope that is necessary, and /tmp is for all the other stupid junk.)
Restrict /dev access by the same idea (the setting closed disallows access to any devices, except /dev/null, /dev/zero, /dev/full, /dev/random, /dev/urandom. If bk doesn’t even need those, it can be further restricted).
Stop directory changes; for example, the bk read-only daemon can be run with a policy saying ReadOnlyDirectories=/srv/ (let’s say your repos are in /srv/repos/bk), meaning that systemd forces bkd into a namespace where /srv/ is non-writeable. You can even further restrict this by banning /etc, /boot, and anything else not necessary as completely unreadable, even.

I’ll work out some of these changes and report back in a separate thread. Also, I could in theory contribute HTTPS support, but there are some open questions (what SSL libs to use, etc). Also, As Yet Another Random Inline complaint, I’m not entirely sure how to share patches. It seems there needs to be yet another yak shaved: a simple code-review tool for BitKeeper in the mean time where I can just post .diff files But that bridge can be crossed later.

thoughtpolice · June 27, 2016, 11:03pm

Side note: since this is enforced with a namespace, I guess it really depends on bkd not needing to call write(2) at all on the specified path, or things will go wrong. But assuming I specify bk bkd -xpush, and bkd in this mode never needs to call write(2) - you can at least get the Linux kernel to enforce the read-only property for you.

wscott · June 27, 2016, 11:24pm

Even read-only operations in bk need the ability to write to the repositories in order to acquire read locks. We use file based locking for compatibility. (locks in a shared NFS directory with ancient unix hosts)

One TODO idea that has been on my wish list forever is to add tags to that bkd command table so you can do -xWRITE to turn off all write operations. (You did -xpush but did you remember -xrclone?) And -xWEB to turn off the web interface. That sort of thing.

Running the bkd in a container with limited access is a good idea. Nice to have that all wrapped up.

SSH access to the bkd has two options (see bk help url)

ssh://user@host/dir/repo
This does a ssh connection to a normal user shell account and then runs 'bk bkd'
bk://user@host/dir/repo
This does a ssh connection but assumes the login shell just exec’s bk bkd directly. We often use a shell script for the login shell like this:

#!/bin/sh

exec bk bkd -l.log 2>.errors

The manpage says the second form is deprecated, but I don’t think that is really true.

thoughtpolice · June 29, 2016, 11:11pm

FWIW, here are some systemd unit files for BitKeeper: a read-only web daemon and read only bkd instance. These are hardened against unwarrented file system access, and assume bk user with read/write perms only allowed on /srv/repos/bk. Everything else on the system is completely denied write access in both cases.

Unfortunately, working in similar ‘jail’ features for the login shell is a bit trickier and probably not possible with systemd without doing something strange like using a container, which would be slow and weird. Login shells/SSH imply a level of trust already though, so maybe that’s OK.

gist.github.com

https://gist.github.com/thoughtpolice/6c096b6666532893a1354b46f7b64ef9

bitkeeper-read.service

[Unit]
Description=BitKeeper Daemon (read-only access)
Documentation=https://www.bitkeeper.org/man/bkd.html
After=network.target

[Service]
ExecStart=/usr/local/bin/bk bkd -D -C -p14690 -xpush -xrclone -xhttpget -xget
User=bk
Group=bk
WorkingDirectory=/srv/repos/bk

This file has been truncated. show original

bitkeeper-web.service

[Unit]
Description=BitKeeper Daemon (Web browser)
Documentation=https://www.bitkeeper.org/man/bkd.html
After=network.target

[Service]
ExecStart=/usr/local/bin/bk bkd -D -C -p0.0.0.0:8080 -xabort -xcheck -xclone -xpull -xpush -xpwd -xrclone -xrootkey -xstatus -xsynckeys -xversion
User=bk
Group=bk
WorkingDirectory=/srv/repos/bk

This file has been truncated. show original

Also, if you haven’t seen it before, I strongly recommend taking a look at gitolite, which I heartly recommend (and can describe), but briefly it allows you to do things like have multiple repositories served by a single Unix user. It uses a similar approach to a login shell - but instead of merely relying on SSH, it also keeps a plain-text database that maps SSH keys to abstract user IDs. These user IDs can have more granular enforcement, like being able to write only to a subset of repositories.

That means you only need a single unix user allocated for all incoming push requests, the daemon is mapped only to a single root directory containing all repositories. The login shell reads the database and only allows access as specified by the config language.

There are a lot of ways to do this, but it would be awesome if bk supported something like this easily through bkd (and I think it also aligns with the ideas of providing strong access control policies and security for repos).

thoughtpolice · June 29, 2016, 11:14pm

Oh, and those unit files are only tested on Ubuntu 16.04, systemd v229. They have also had relatively light testing, so it’s possible the access controls are too restrictive. Caveat emptor, etc.

mcvoy · June 30, 2016, 1:38am

On the gitolite approach, maybe I’m not getting it. Years ago we had a bk hostme command that was clunky as heck but it somehow created a new unix user, installed the ssh keys, and cloned your repo up to bkbits. We had thousands of users (I think) and it worked fine. uid_t is 32 bits on Linux (at least on my laptop) and I’d love to have the problem where we overflowed 32 bits of users. Github is at 15 million last I checked.

So I get that you could do it differently but why reinvent the wheel? Linux already has a concept of a user.

That said, I’d love to have this: a bk rti (aka pull request) that worked like

cd my_new_feature
bk rti http://bkbits.net/u/bk/dev

what that would do is create a clone of dev on bkbits, figure out the repo gca and undo to that, push my_new_feature to that clone.

It’s a way to do a pull request without registering or creating an account. There is some crud that needs to be done, there needs to be some info filled out like what is this, who did it, etc, and some verification so that there is a valid email associated with the request. How I imagine that is there is a web page associated with the repo at http://bkbits.net/u/bk/dev/ that shows you all the active RTIs. In order for your RTI to be on the list you have to do the little form, it sends you one of those confirmation links, you click on that, now we know you are real and your stuff gets added to the list.

To me, that seems like the big problem with creating an account, etc. It’s too heavy weight if all you want to do is say “here’s my cool stuff, consider it”.

thoughtpolice · June 30, 2016, 3:33am

It’s not so much about overflowing unix user IDs, but more about having fine grained access control and centralized management of permissions that’s convenient. In my experience, Gitolite is a lot more robust and convenient than sort of DIY Unix management utilities for user accounts.

There’s also a lot of philosophical principles tied up in these sorts of ideas, so I’ll try not to sound too weird.

One thing that immediately stands out is: the gitolite approach does not require any level of unix permission on the machine to manage repositories, meaning it can be done completely out of band from system administration of user accounts.

When you install gitolite, you give it the initial “administrator” public key. Then, it creates a “configuration” repository with a default config, lets your key have “admin access”, and lets you clone from it. This means the configuration is directly managed by gitolite itself. Again, this is always under a single actual unix account, something like git. So you clone:

$ git clone ssh://git@hostname.domain/gitolite-admin.git
$ find gitolite-admin -type f | grep -v \.git
gitolite-admin/conf/gitolite.conf # configuration
gitolite-admin/keydir/austin.pub  # my pubkey

Now there are a few things here: one, I’m only allowed to clone it because by default, gitolite’s configuration restricts clones of the configuration to people with ‘administrator’ pubkeys only. (During install, you say “Here is the pubkey for austin who is the default admin”). Two, the pubkeys are managed as direct files in version control, making them much more easily auditable, etc.

So if I want to add any more administrators, all I need to do is simply add a line giving me read/write capability to gitolite-admin, and they can manage the configuration and their own keys. Git management is completely locked down to a single user account. As a consequence, all the repositories are under that users $HOME, owned by the git gid/uid, which is all that’s needed.

As both the system administrator of the box that hosted GHC repositories (with 30 user accounts and some level of automation), and gitolite, I would take gitolite every day over managing individual accounts directly with some sort of devops/deployment tool for the vast majority of deployments, and it has made management a lot easier.

In the bkd documentation, one suggestion is to run multiple copies of bkd if you want to have anonymous read-only repositories depending on your directory layout. But this is a bit weird especially because if you want to make it ‘easily’ transparent, you need a frontend load balancer. Indeed, for many version control systems, the scheme thing clone thing://host.fqdn/thing/repository is very familiar, almost muscle memory, so hosting on different ports and multiplexing or something is a bit weird of a hosting requirement if it’s not “almost transparent” by the daemon itself.

Instead, you can just put everything in a single directory, run one instance of ssh, and the login shell keys access control based on A) the rules matched with B) the incoming SSH key. This doesn’t require any sort of fiddling with the overall init system, either - the rules and matching key alone determine access.

gitolite also allows a level of access control that I don’t think unix permissions can replicate easily, at least not without a weird amount of integration with the underlying machine (which may not be OK).

So for example, in GHC, we don’t let people delete any branches in a repository, unless they have the prefix wip/. Anyone can collaborate but you can’t delete it if it doesn’t have that prefix, unless you’re an admin.

OK, so in bk, what you’d do is create a separate user who could just use their own ‘wip’ fork, right? After all, since bitkeeper uses directories as branches, you can form these as a real hierarchy. Then someone else forks them and the original person pulls back, etc etc.

But a lot of times you may have more than a single person working in one repository, even a fork. How do you manage the permissions then? If both Bob and Alice are working on alice/cool-feature (they might sit next to each other in the office, so it’s easy enough to just work on the same upstream, and alice is leading), and Greg wants to join, how do you add permission? You’d have to have a group for alice/cool-feature with all the right users; but it still doesn’t model the right level of granularity: maybe greg and all his keys should be able to clone (because he’s in QA), but can’t push (bob or alice have to, since they’re developers). But he does need to push to other repositories. So having bkd outright restrict writes for every one of the keys greg has is wrong. He needs to read some, and write only to a limited subset of those.

All of this is very easily expressible with gitolite as a series of read/write permissions, and it doesn’t require any integration with the overall Unix system. It’s a simple, declarative ruleset that is fairly easy to understand and audit, and version controlled.

Changing group or unix permissions may be actively more difficult to automate depending on the outside system. Such things would be disallowed in a system like NixOS - in NixOS, you write a configuration file, and ‘compile’ your system to a known state. It’s completely reproducible, deterministic, etc. Including group and user management - that’s all in a text file. Automating BitKeeper permissions on top of the unix model when this is in play is much more difficult; you have to actually commit a text file with new code to build a new group, then run ‘the compiler’ to build everything so the user is part of the new groups, etc.

Basically, at least for git, collapsing actual user management into the repository management tool itself gives you a lot of freedom and simplicity that abstracts you away from the underlying host system. It’s much, much easier to use a declarative ruleset to describe things like “the group foo is only allowed to write to refs foobar/dev*, interpreted as a regex” than script the unix permission model for these things. At least IME.

EDIT: This also seems to be made more complicated by the fact Bitkeeper permissions and unix permissions don’t align. In the previous example, Greg needs “unix write” permissions for read locks, as indicated earlier by @wscott. But he does NOT need “bitkeeper cset write” capability, i.e. the ability to push csets. So how do you have one directory, with three people who have unix read and write capabilities (again, for read locks), but different permissions at the BitKeeper level? Greg can only read, not write. Alice/bob can read and write. This rules are also per repository, or per groups of repositories. You can sort of square this circle somehow I’m sure, but it all feels very weird.

EDIT TWO: As an example in comparison, in Gitolite you’d say something vaguely like:

# alice, bob and greg can have multiple keys, in the files
# alice@1.pub, alice@2.pub, etc, alongside this config
# file
@devs = alice bob
@qa = greg

# rules for @devs-repos, which may be a single repository,
# or some grouping of repositories
repo @devs-repos
    R = @qa    # QA has no write ability
    RW = @devs # developers have full capability
    ... other rules here ...

And that’s it: this use case is completely satisfied.

In fact, gitolite does not go far enough. IMO. I see no reason why gitolite has to read its database from a flat file; indeed, it could easily query some “oracle” every time it sees a new git push, and the oracle (a script) might query MySQL, a REST API, etc to assess authorization - is it the right key, can it push here, etc. Indeed, this is exactly how modern self-management UIs mostly work. The login shell (or ssh itself, using AuthorizedKeysCommand in OpenSSH 6.2+) is what queries for access to SSH keys and rules about who can do what, from some endpoint.

Why wouldn’t the theoretical bk-olite allow that? For example, it could read a flat file in the basic mode, but there’s no reason an exec mode couldn’t run some external program. It expects that program to output some information on stdout to assess authorization. So someone tries to push, theoretical bitkeeper-olite invokes /usr/bin/test-bkolite-creds.sh, which might call a script to query MySQL and return a ruleset. I think this would be pretty nice, actually! Wire that database up to an account UI, and boom - you can do things like self-manage SSH keys!

However, all that said, it might be best if bk itself doesn’t get into this business. It’s somewhat open ended. Note that this all comes from the role of “I help 30+ developers work on this project, we need management/ACL tools to help them and stop us from shooting ourselves”. At the same time, I also deal with the “we need to merge drive by patches, without too much activation energy” problem. But using a tool like gitolite in spirit, you could definitely build out the core of more flexible user/role control mechanism, on top of which you could build something like bkbits.net.

Maybe I should write my own bk-olite as an exercise!

Indeed, as above comes from experience managing our own project with 30+ developers and needed rulesets, access controls are really useful. BitKeeper actually does several things right here - like let you check in triggers with the code itself (huuuuugely useful for things like precommit hooks to lint your code).

Anyway, I’ll stop talking now before I go in circles and you stop listening. I hope this isn’t too inane sounding. I’m afraid the rti is something I’d have to think about more, but I do have opinions “Pull request” stuff, too, so I’ll spare you those.