PROPOSAL: Add field for email addresses

ob · May 17, 2016, 4:53pm

Currently BitKeeper records only two pieces of information about the user:

The UNIX username.
The hostname of the machine where the change was made.

Maybe in the past this could be used for deriving a useful email address, but in modern times it’s very likely these fields will be vagrant@localhost.localdomain or something equally useless.

BitKeeper should have a config variable called email that is stored in the ~/.bk directory and recorded as part of the delta/commit operation. I think it could be recorded in the same way that BitKeeper records the host information.

Implementation Proposal

BitKeeper records the username in a format compatible with the original AT&T SCCS format, that is, as part of the delta header in the graph portion of the SFILE.

dirac src $ bk _scat zone.c
H24854
bk-filever-5
s 16/0/37                                 
d D 1.20 16/02/25 14:20:36 ob 23 22
c Apache License 2.0
c ---
c Use proper date range for copyright
cC
cHwork.bitkeeper.com
cK39679
cZ-08:00
e

The d prefix that represents a delta has a field for username (in this case ob), and the host is recorded in the cH section.

I propose adding a cE section for the email. This should be optional (as existing repositories will not have it) and I think it would be backwards compatible since it appears existing BitKeeper will ignore unknown fields.

wscott · May 17, 2016, 7:09pm

That sounds about right. I would make a tweak that we record email only in the ChangeSet file.

Note that currently BK_USER and BK_HOST can be used to override user and host in the environment, but what is stored use that overridden name AND the original name separated by slash(/). We refer to that as :USER: and :REALUSER:

When discussing the file format change we also have to nail down the binary file format difference. This will be per-delta, so it would be stored in the d2_t type in src/sccs.h. It we make this ChangeSet file only then we can reuse something that isn’t used for the ChangeSet file, like perhaps ‘mode’ or ‘symlink’. We would need to consider what old versions of bk would do if one of those fields where non-zero.

Git has 2 configs user.name and user.email that are combined together to yield something like this:

Wayne Scott <wscott@bitkeeper.com>

Should we do the same with ‘email’ and ‘username’, or just have a single ‘email’ config option that is expected to look like the above?

Perhaps this is a hard file-format change with a new repository feature bit and so older versions of bk would be blocked.

ob · May 17, 2016, 8:47pm

I’d just break compatibility and provide an upgrade path where you can give it an “Authors” file. I’d also use two fields, one for email and one for name, since that would avoid having to parse when you only want one of the fields. Parsing emails/names is notoriously complicated

wscott · May 17, 2016, 9:01pm

Not sure I buy the parsing argument. On all our performance critical paths we are not parsing this at all. And having a second offset doubles the space used for this in the delta table. If you strictly use NAME <EMAIL> and disallow <> in either field, then it is pretty easy to extract. Or we store “NAME|EMAIL” in the heap then extracting email is even easier as we don’t need to make a copy. That second one is more bk-like.

(we do need to define how this extends the patch format, but I think that is also just 'E Name '. (see slib.c:do_patch()))

That is one part Larry will want to bring up. When running ‘bk changes’ or looking at revtool, what do you use for ‘username’ or the user@host.

People who have ‘first.last@company.com’ my prefer keeping username as the small identity of the user.

wscott · June 21, 2016, 11:29am

The more I think about making this change the less I am sure about it.

Current Implementaion

Above @ob shows a sample SCCS output, but that is not the actual file format. Currently, we save a single string per file delta to save the. This string is like this:

$ echo hi > foo
$ BK_USER=user BK_HOST=host.com bk new foo
foo revision 1.1: +1 -0 = 1
$ bk log -nd:FULLUSERHOST: -r+ foo
user/wscott@host.com/x99.wscott.bitkeeper.com

Without the BK_USER & BK_HOST then just wscott@x99.wscott.bitkeeper.com would have been saved.

So BitKeeper records the actual local username and hostname where the commit was recorded and the requested user@host from the env overrides. That BK_USER name is that is usually displayed when the history is being browsed.

We normally use BK_USER=name when creating a commit that was actually written by another person. So in a why this is like the committer/author split used by git. But not really as I hope to explain/

Internally these are called HOST and REALHOST (same for USER):

$ bk log -nd:HOST: -r+ foo
host.com
$ bk log -nd:REALHOST: -r+ foo
x99.wscott.bitkeeper.com

The REALUSER@REALHOST is very important internally because the delta uniqueness guarantee is based on the assumption that a hostname is a unique name for the current machine and that user’s home directory is the same for all csets made by this user in any repository with this hostname.

So while USER@HOST could be a valid email address and corresponds to git’s ‘author’, the REALUSER@REALHOST is unlikely to be a valid email address and certainly not the canonical name for the ‘committer’ field.

Proposal

Above @ob proposed we add a new email field in addition to the existing :FULLUSERHOST: field. I don’t think that is really necessary. Just embrace the fact that we already have store two names and use USER@HOST as the email address of the user who created the cset.

We don’t need to save the user’s name with each cset since the email is a unique key for that user. We can have a BitKeeper/etc/authors file that is automatically maintained giving the mapping from email addresses to names if we want to include the user’s name in some reports or have a place to import/export data from git repositories. (Yes there will be some inaccuracy as git could have multiple names for the same person.)

So in the $HOME/.bk/config file we can record the Name/email for csets that are created on this machine. And perhaps make bk require that this be set in normal operation. The existing BK_USER and BK_HOST could also be set in the environment, but I would probably extend BK_USER so it can take the whole email address if needed.

Questions

Do we need a separate committer email identifier other that just to unix user and hostname?
I think that is from Linus’ model of committing email patches.
What about dual credit for csets developed by multiple people? It the past we have done stuff like BK_USER=ob+wscott, but that only works if all the tools expect that.
Unlike git, in Bitkeeper you can have a different author for each file in the cset and that works pretty well. Files changes are owned by the person who made most of those changes and the overall cset has a single owner.