Reflections on going distributed

It's nearly traditional at this stage in an introduction to DVCS to demonstrate several different workflow scenarios that you can build with a DVCS. Which makes the important point that a DVCS can be adapted to your workflow in a way that is at best unwieldy with a CVCS. I intend, though, to break with tradition here.

By this stage, I hope you can see that distributing version control works by introducing branches where development takes place in parallel. Mercurial treats these branches as arising naturally from the commits made and transferred between repositories. Both Git and Bazaar take a slightly different viewpoint, and explicitly generate a fresh branch for work in a particular repository. But in both cases the underlying principle of identifying changes by a globally unique identifier and resolving parallel development by merges between overlapping changes is the same. And all three can be used in a truly distributed manner, with full history and the ability to commit being available locally.

So instead of chatter on about workflows, I want instead to reflect on the consequences all this has for that all-important question of whether a DVCS is a suitable vehicle for your data.

The first is a minor and rather obvious point. If you want to store files that are very large and which change often in your DVCS, then all the compression in the world is unlikely to stop the storage requirements for the full project history from becoming uncomfortably large, particularly if the files are not very compressible to start with.

The second, and main, point is that there is an important question you need to ask about your data. We've seen that a DVCS relies on branching and merging to weave its magic. So take a close look at your data, and ask:

Will It Merge?

The subset of plain old text which comprises program source code requires some human oversight, but will merge automatically well enough for the process to be well within the bounds of the possible.

Unfortunately when we move further afield mergeability becomes a rarer commodity. I nearly began the previous paragraph by stating that plain old text will merge well enough. Then Doubt set in -- what about XML? Or BASE64 encoded content?

Of course, merge doesn't necessarily have to be textual merge. I am told that Word can be used to diff and merge two Word .doc files, a data format notorious for its binary impenetrability. As long as some suitable merge agent is available, and the DVCS can be configured to use it for data of a particular type12, then there is no bar to successful DVCS use.

Before this reliance on mergeability causes you to dismiss DVCS out of hand, reflect. A CVCS can only handle non-mergeable data by acting as a versioned file store; in other words, having as the only available merge option the use of one or other of the merge candidates in its entirety. Useful though a versioned file store can be, it cannot be considered a full-featured version control system. By treating the offending unmergeable files as external to the DVCS, or with careful workflow -- disabling the distributed and mergeable potentials -- a DVCS can deal with these files, but only at a cost of its distributedness or its version control system-ness. In this it differs little from a CVCS.

So, for all data you want to version control, let your battle cry be:

Will It Merge?

At this point, I have an urge to don lab coat and safety goggles and be videoed attempting to mechanically merge data in a variety of different formats. Frankly, this is unlikely to be as exciting at blending iPhones13, but from a system development point of view it's rather more important. And, I think gives us a large clue as to one of the reasons for the continuing popularity of Plain Old Text as a source code representation mechanism.


... type12
Mercurial can have the merge and diff tools specified with reference to the file extension on which they operate -- I assume Bazaar and Git are similar.
... iPhones13
Jim Hague 2009-05-22