Mercurial Project

LESLIE: mercurial revision
control systems, sans slides for the moment, but
soon with slides. By Bryan’s own admission,
he builds large distributed stuff. BRYAN O’SULLIVAN: Not large
by your standards. LESLIE: Not large by our
standards, though. And I just wanted to let you
know this talk is going to go up on Google Video, so if you
have any questions that you think might contain
information that’s particularly Google-y, let’s
hold those until the end after the camera is off. BRYAN O’SULLIVAN: OK,
thank you, Leslie. Let me see what a good
distance is here. All right. So about 12 months ago, I was
casting about for a revision control system that I could use
to write the next late, great desktop email system that
nobody was going to use. And unfortunately, during my
passing around the place, I found that there was nothing
entirely suitable to my needs. So I found a piece of software
that was almost suitable, which had just gotten started
on by a guy that I happen to know, and this was a tool
called Mercurial. And the origins of Mercurial
are steeped in the great BitKeeper debacle of 2005,
when Larry McVoy took his marbles and left
the playground. So at that time, Linus Torvalds
was left without a revision control system, and
started writing his own. At the same time, Matt Mackall
started working on a revision control system. And they converged on fairly
similar designs, although in substantially different ways. And so now the world has two 14
month old revision control systems, instead of one, to join
a field of a huge pile of free revision control
systems already. So why am I actually
interested in this particular one? Why am I in here to talk
to about this? Well, there’s a couple of
different reasons that it’s interesting to me, here. And one is that Mercurial kind
of matches a few aspects of the Google state religion, as I understand it from the outside. And those are that it’s written
in python, it is distributed, and it does things
very fast. So these are kind of nice properties
to have. Now, it’s not completely
python. It’s only 95% python. There’s a couple of
core routines that are written in c. But what we’ve done in the 12
months or so since people have started actually using this
stuff is a fair number of third party people, both open
source and commercial projects, have actually
started using it. Which is really kind of
an interesting vote of confidence, right? Normally, revision control
holds the crown jewels of whatever it is that you’re
doing, so it had, by God, better work properly, or else
you’re going to have serious problems. But in that time we’ve had a
couple of interesting people start to use Mercurial. We have the XenSource people,
those are doing the open source Linux hypervisor. We have the One Laptop per Child
project, the OpenSolaris project, the MoinMoin wiki. A pile of other people who are
playing around to various different extents, some
of them large, some of them small. But it’s pretty fun to be
working on something that’s young, and yet, has
interesting and active users already. So from a developer’s point of
view, you’re sitting down in front of your computer, and
you’re about to start your magnum opus. You’re going to work on the next
great novel, or the next great huge source space, or
whatever it’s going to be. You’re looking for criteria to
choose one of these revision control systems. There’s got
to be at least 50 of them extant at the moment. Ah, thanks, Chuck. AUDIENCE: We’ll see if this
actually does the job. BRYAN O’SULLIVAN: So what
are some criteria? Well, I can tell you what I
used, at least, for my own personal purposes. I wanted something that was
straightforward to understand, so I was going to be able to
spend my time thinking about my problem instead of my
revision control problem. I wanted something that was
pretty quick, so that my tea would still be warm when I was
finished with a particular revision control operation. And I wanted something that
would help me to work efficiently with other people,
that being kind of a motivating factor. Now all you want to get this
onto a Windows box? I think we’re going
to be slide-less. MALE SPEAKER: It’ll work. BRYAN O’SULLIVAN: It’ll work? MALE SPEAKER: Yeah, Open
Office, you say? BRYAN O’SULLIVAN: Yup. MALE SPEAKER: It’ll work. BRYAN O’SULLIVAN: OK. So the Mercurial revision
control system has a conceptual model that’s
very simple. I’ve had it described to me as,
you can carry it around in your head, and I like to think
of it that way myself. There are basically three
different things that you need to pay attention to if you’re
thinking about revision control, in Mercurial and in
many other distributed revision control systems. The
first is that you have a repository, which is where
your stuff lives. The second is you have a working
directory, which is where the stuff you’re working
on lives, and the last one is a change set, which is a
snapshot of what’s going on. So in Mercurial terms, a
repository is not a heavy weight thing. It does not have a database
behind it, it’s just a pile of files. So these are things that are
easy to create, they’re easy to administer, you could put one
together and blow one away in a matter of moments. And that’s how people
prefer to work. They are, as I mentioned
lightweight, and they are pretty much everywhere. Everywhere that you happen to
be doing some work in a working directory, there is a
repository that happens to be wedded to it. So inside a repository, there
are only really three different things. There’s a change log, which
says, I’ve done this work. There is a manifest, which says,
I did the work on these versions of these
files, and then there’s the file metadata. That’s all you need to know to
understand the underpinnings of the revision control
software. It’s very, very straightforward. Now you contrast that model with
the internals of a more traditional large scale revision
control system, like ClearCase, or Perforce, or
something like that. And if the internals are
visible to you at all– they’re typically not– they’re
going to be a big pile of different things. The Mercurial source code is, I
think it might have grown to almost 12,000 lines. But it’s a fully functional
system, and it’s 12,000 lines of python. So you can keep all of it in
your head as a developer, you can keep all that you need to
know in your head as a user. It’s got a couple of very good
properties in that respect. Now a change set, as I
mentioned, is a snapshot of your project as it stands on
a particular point in time. It corresponds to a revision in
Perforce, or Subversion, or any of these other tools. And the terminology that we use
for creating a change set is committing it. Now unfortunately, I have some
pretty graphics here that you’re not able to see at the
moment about how you go about creating these things. And what starts being
interesting then is how you go about merging and branching
with people. So I’ll have to hand wave since
I have no graphics to offer here. When two people work, they will
create clones of each other’s repositories. Yes? AUDIENCE: Would a
whiteboard help? BRYAN O’SULLIVAN: A whiteboard
would actually help, yeah. When I start working, I create
a repository, and it’s just a little directory called
.hg somewhere. And outside that repository
is the actual directory. Let’s say I’m working on
a project called foo. So I have foo/.hg, that’s where
all my metadata lives. I don’t need to actually
care about anything that’s in there. The working directory, where I
have my files, like copying, and readme, and foo.c lives
around this .hg directory, and that’s where I do all
my actual work. So somebody else will make a
clone of this, and they’ll have an identical copy of the
working directory, and an identical copy of the
.hg directory. And they start working
there, and they start creating a revision. And let’s say they create
revision one. And I create something, and I
call it revision one, too, because I wouldn’t know about
their changes, because this is a distributed system, right? So we go on, and we create
revision two, say, as well. Now the next thing that we need
to do here is we need to be able to communicate with
each other to say, we’re working on the same project,
and now we have diverging views of the world. How do we cause them
to reconverge? Well, what I do is, I pull my
changes from one repository into the other, and that
literally means I just take the direct [UNINTELLIGIBLE]
graph that is all of my revisions, and I plop
them in here. And after I do a pull,
I do a merge. And what the merge does is it
just creates a structure in the working directory such that
I have my revision one, your revision one, my revision
two, your revision two. And then, in the working
directory, the working directory has a notion of
there being parents. So you can think of the
working directory as a floating revision. It’s the last stuff that I had,
the merger of these two guys, and the stuff that I’m
about to commit as a result of the merge. So the working directory gets
all these things, I do my commit, and I’m done. And that’s all there is to it. So a branch in Mercurial is just
a revision that has two different parents. Now I haven’t actually
drawn a branch here. Look, here’s a branch. This is a revision that has two
children, and this, being a merge, is a revision
that has two parents. That’s all a merge is. So there’s no special
sauce there. If you think of a branch and
Subversion as being something very simple, it’s just a copy,
the analog in Mercurial is also it’s very simple, it’s just
two revisions that happen to have the same parent. And of course, you can have an
arbitrary number of branches. We only allow merges one at a
time, because nobody’s really found a good way to explain
multi-lane merge to people, and their heads explode. Heads exploding, not so good. So when people work in parallel,
and they make these changes, and they commit them,
and then they merge after they change, that’s kind of a nice
property to have. Because if you think back to what you used
to have to do, in the days of CVS, history
was linear. People would do a change,
they’d do a change, they’d do a change. And if you won that commit race
before I managed to make a commit, I had to merge with
your changes before I could do a commit. And that meant that there
was no permanent record of my changes. Now with this model, you
don’t end up having that kind of risk. With CVS, it was very
straightforward to shoot yourself in the foot by
screwing up a merge. It happened all the time. You’d see changes that got
checked in with conflict markers, badness ensued. Subversion sort of avoids this,
but the default policy is to work the same
way that CVS does. In Subversion, you have to
explicitly choose to work on a separate branch. It’s the same thing
with Perforce. The distributed tools
necessarily don’t work that way, and it’s a slightly safer
way to do things, because one default policy is less
safe than the other. In practice, I don’t know that
makes a huge difference. You don’t hear people saying
that I threw away three weeks of work in Subversion or in
Perforce, or in any of these other tools very frequently. So with many of these things
what you get is a matter of degree in terms of loss or
gain of functionality. OK, a couple of things that are
interesting to know about Mercurial for getting going
and keeping going quickly. If you’re working in a small,
lightweight environment, it’s useful to have, for example, a
built-in web server, which we have. So this is a web server
that you can both interact with as a person, and view
revisions in your tree, and annotate things, and
download tarballs. But it’s also what Mercurial
uses to fetch data and to, very soon now, push data over
the network as well. It also works over SSH if you
happen to prefer security. In addition, Mercurial has
essentially a single way to stream bytes onto the disk and
to stream them off the disk. We call that abstraction
a bundle. So you can put bundles onto USB
drives, or you can send them around to the email, and
you get a complete set of change history that you can
transfer without having to be online at the time. This is kind of a useful
property for people doing distributed development. So sharing is also a, what
would you call it, a symmetrical operation. In other words, when I make a
clone of a repository, and I start making changes, and I push
my changes back to the parent repository– I don’t know where this buzzing
is coming from– I end up with both repositories
being identical afterwards. OK, the buzz has stopped. I can step back and look
at my slides again. MALE SPEAKER: You’re probably
standing on a wire. BRYAN O’SULLIVAN: Ah. So the final thing, and the
thing is sort of the special sauce in Mercurial is that
Mercurial is, albeit written in python, an extremely
fast system. We’ve benchmarked it under
various different scenarios, and there are only a few other
revision control systems that compare in terms
of performance. Now performance is interesting
in its own right, because it means that you don’t get
distracted while you’re waiting for the tool
to do things. But it also enables you to do
certain kinds of operations that are not necessarily
otherwise possible. So in order to give a little
bit of an apples-to-apples comparison, the other day
I went to the Subversion self-hosting repository, where
they’ve had Subversion developed for the past four or
five years, and I just did a Subversion check out of the
Subversion source tree. And the head of the source tree
as a working directory is about 72 megabytes in size. So out of curiosity, I sucked
all of the Subversion history into Mercurial, and created a
local Mercurial repository that’s an identical
copy of what’s on the Subversion website. And it turns out that the
Mercurial repository plus working directory is about the
same size as the Subversion working copy. So I have 50,000 revisions and
a working directory in 76 megabytes, versus just a working
copy in 72 megabytes, which I thought was kind
of interesting. It means that you don’t have to
pay much in order to get a complete history of everything
onto your machine, where you are no longer talking
to the network. One of the things that you’ve
probably noticed if you’ve been dealing with Perforce
service here is that things take an awfully long time once
your servers are busy. Central servers don’t
scale terribly well. We prefer not to talk to them
if you don’t have to. In a distributed tool, you tend
to not talk to them very often at all, maybe a couple
of times a week. And so also out of curiosity,
I ran a couple of a very simple performance tests. Now the Subversion repository
is very small. It’s only about 1,200 files, and
the actual working copy, when you ignore all the .sbn
directories and so on, it’s only 25 megabytes in size. And pound for pound, Mercurial
and Subversion worked out about the same. There were a few instances where
one was faster and a few instances where the other
one was faster. But what’s interesting about one
being faster or the other being faster on a small test
case is primarily that you can write something in python,
and have it be as fast as something that’s written in pure
c, if you’re clever about how you do it. And I assert without immediate
proof that in fact, Mercurial will scale to large projects
better than many revision control systems that are written
in c, because of the underlying obstructions
that we use. And the implementation
techniques, might I add. So there are a couple of
different things that we do in order to make the implementation
go fast. The primary one is that we’ve
desperately tried, at every step, to avoid seeks. Disk seeks are not
your friends. Disk seeks are things that cause
you to just sit around and wait for good things
to happen. Good things that happen
are streaming I/O linearly off your disk. That’s what we really like. So in order to stream I/O
linearly off your disk, there are a couple of things
that you would like to be able to do. The first is you don’t want to
write more than you have to, and the second is that you don’t
want to read more than you have to. So what are the necessary
properties for revision control systems to not write
more than you have to? Well, it’s been dogma in
revision control for a while that what you really want to
do is you want to store the most recent revision of your
file as the very first thing on the disk, and then everything
else wants to be reverse deltas based on that. Because that means you get the
nice property of something that you’ve accessed recently
you can just read with a single read. Now, Mercurial sort of turns
that a little bit on its head. We do forward deltas from
the very first revision. But that sounds like it’s a
terrible implementation plan. It’s actually rather better than
that, because what we do is we have 0 of 1 retrieval
properties. So instead of having a very
first revision and 10,000,000 little tiny deltas on top
of that that you have to reconstruct the final revision
out of, what we do is we have two different techniques. One is that every so often, when
the accumulated quantity of stuff that you’ve deltaed
gets to be too big, we store a full text. So you pay something of an extra
space cost on disk, but you end up with 0 of 1
retrieval properties. The other thing that we do is,
rather than applying each delta as we go, we compose our
deltas, and then we just apply one single union delta
at the end. And that gets rid of some of the
nasty properties that you have when you’re composing
deltas and [? munching ?] strings in python. So these are two things that
make it quite fast and quite efficient to get nice
linear accesses. Another thing that you can
do that gives you linear accesses, at something of a cost
in space is, you might think that if you’re doing a
delta, you want to do a delta against your parent. So if I’m at revision three,
and I have a child that is revision seven, that is really
based on revision three, I want do to a delta against
revision three? Not necessarily. Because if you’re doing a delta
against revision three and you’re revision
seven, there might be a seek involved. So what if at revision seven,
you do a delta against provision six, and you don’t
actually care whether revision six was related to you all? Well, in that case, you pay
something of a greater space penalty, but you end up with a
linear space, or pardon me, a linear disk access, and
a guaranteed lower probability of seeks. So again, it’s another case of
make the seeks go away, do linear stuff instead, even if
it costs me slightly more space, even if it looks
like it ought to be a less good choice. Many times these are not the
case when you subject them to a little bit of inspection. Right, back to my hints
as to where I am. The file formats that we use
are very straightforward. They’re binary files, but
they’re easy to parse in python, and the reason that
they’re easy to parse in python is that we use
the [? struct ?] unpack and pack methods a lot,
and we use the string splitting and unsplitting
a lot. And the reason that these are
good things to you is that they don’t go through the python
interpreter at all. They go straight from the python
interpreter into c. Fast things occur, and you get
dropped back into python land. We also try and avoid things
like conditionals in our inner loops, because those tend to
cost in performance as well. So if you look at the Mercurial
inner loops, they tend to be quite tight. They’re still written in python,
but they do very, very simple things, and then they
deal in terms of two [? bulls ?] or a raise, or
whatever happens to come out the other side. We’re out of luck? MALE SPEAKER: I can open the
file, but it doesn’t show me any content. BRYAN O’SULLIVAN: You have
Open Office 1, and I have Open Office 2. I’m sorry. MALE SPEAKER: That’s OK. BRYAN O’SULLIVAN: OK, it
continues to be a mime show for the rest of the
presentation, I’m afraid. The final thing that’s kind
of interesting about implementation techniques
is that I mentioned avoiding reads. Well, why read at all
if you can stat? What if you were to store,
instead of just spewing your files out onto disk, you also
stored the information as to what the stat of each file was
when you wrote a the file? Now, what do I mean by that? I mean, you store the
modification time, the size, and the access time of each
file, and the owner. And then, when you’re looking
through a tree to see what was the last thing that happened to
my tree, instead of reading the file, you do a
stat of the file. You look at the last time you’ve
started the file, you see if they are are different
in any way, and then you a read to see if you really need
to say yes, this file has been modified, or no. So what this means is in a
40,000 file tree, Mercurial does a stat 40,000 times,
but it doesn’t read and reconstruct each file 40,000– or it doesn’t really reconstruct
40,000 files. Again, this is a significant
space win when compared to doing things by hand. So I’ve mentioned that
things are fast. Yes? AUDIENCE: It sounds like you are
designing for local disks, not an NFS home directory. BRYAN O’SULLIVAN: Things
are not as fast over NFS, this is true. So an obvious place that you can
make things fast over NFS is by using an implementation
technique like Perforce, where you have to tell a server
everything that you’re doing. That’s a potentially less
friendly thing to do. I was able to live with p4 edit
back when I had to use Perforce, but having to not do
it, and being able to do things on a local disk happens
to suit me pretty well. So speed is an end in itself. All other things being equal,
it’s nice to have something fast instead of something
slow. But what speed also lets you
do is it lets you do things that are not straightforward
using other tools. So Mercurial was originally
developed by kernel hackers, and the kernel is a reasonably
large tree. It’s not super large, it’s
like 20,000 files, or so. But one of the things that
kernel developers tend to have to do a lot is deal with
patches, because there are certain gatekeepers for pieces
of subsystems, and you may be developing something that
isn’t ready to go off to somebody else yet, so you
maintain a pile of patches that your software
has to work with. And working with patches is
traditionally a kind of a painful thing, because revision
control tools don’t normally have a concept
of patches. But what if you did have
a concept of patches? Well, Mercurial has
an extension called Mercurial Queues. If you’ve been working with
open source tools that you have to patch in order to get
working, you may have come across a tool called Quilt. And Quilt basically lets you
maintain a stack of patches on top off a source tree. It doesn’t care what the source
tree is, it has no notion of there being an
underlying revision control system of any kind. So Quilt has the nice property
that it will sit on top of anything. It doesn’t matter whether
it’s Perforce, CVS, or an exploded tarball. And what Quilt lets you do is
it lets you push a pass onto your stack, edit some files,
refresh the patch, push another patch on top of to your
stack, refresh the files, et cetera, et cetera,
et cetera. You can pop and push, and
work on the top of stack arbitrarily. Mercurial Queues works the same
way, but it’s integrated into Mercurial. So what this means is after
you’ve pushed a pile of stuff, patches, onto your stack, you
can now use the regular log commands, or the annotate
command, to find out which revision that has turned into a
change set in the tree made a particular change. When you pop your patches, the
change sets go away, and once you’re done, you can push the
change sets off to somebody else as regular Mercurial
revisions, and they see them as regular Mercurial revisions,
and then they start distributing your changes,
and goodness occurs. But the nice thing
about this is the integrated nature of it. So for example, if you’re doing
a bug search, one of the things that people have started
using in the past year or two with tools like Git and
Mercurial is dichotomic search of your revision graph. Now what that means is you’re
essentially doing a bisectional search. I know I have a revision
that was bad. I know I have a revision
that was good. And I want to narrow my way down
to the revision that was the thing that caused the
stick to flip, as quickly as I can. So if you can do this using your
revision control tools rather than having to worry
about the join, the boundary, between where my patches start
and where my revisions end, if everything is just seamless and
you don’t have to worry, that’s kind of a benefit. As an example of how well
Mercurial Queues works, Andrew Morton maintains a pile of
patches against the Linux kernel that is, I think it’s
about 1,500 patches at the moment that are just a single
Quilt patch series. I can apply those on top of a
Linux kernel repository in, I think, something on the order
of three minutes. So I can create seven change
sets per second for three minutes to get all of these
patches into my tree. And then, as far as Mercurial
is concerned, it’s just a regular old tree that I’m
working in that has 1,500 new revisions in there that I can
deal with, that I can serve up, that I can communicate with
other people, but that I can also modify after
the fact by popping, editing, and repushing. And then I can continue to share
those changes with a Quilt user, because Mercurial
Queues is Quilt compatible. This is a fairly tremendous win,
not just for dealing with upstream software projects, but
also for prototyping and developing your own code. Quite often, if I’m working on
a feature, what I’ll do is I won’t actually start committing
changes, because I’m kind of an idiot. I don’t tend to know where I’m
going a lot of the time. I’ll start off, and I’ll go down
a blind alley, and I’ll go down another two blind alleys
along the way, and after a couple of days I’ll have
converged on something that looks like a solution. But along the way, what I’ll
have done was I left fault in terms of here is the
underpinnings of my work, one patch. Here is another thing that
I need to have in place, another patch. And at each point when I’m
refactoring my code, or I’m moving something between one
layer to another layer, I just moved [? hunts ?] from one patch
into another patch, and I push and pop my context at
various different times to be working on a different
patch at each point. So this is not just for working
with other people. It’s for collaborating with
yourself as your clue evolves over time. A very good property to have. So finally, there are a couple
of things that are very interesting to me about
distributed revision control that are not specific to
Mercurial, but that might be of interest to people who do
things with free software and open source tools in general. I have this idea that choosing
a particular revision control tool is actually making a
statement about how you want your project to evolve, right? So if you work in a large
company, and everybody has essentially a level playing
field, everybody is more or less likely to have commit
access to much the same stuff. And everybody can pull the same
changes, everybody can push the same changes, or
integrate changes, or whatever the particular tool’s language
lets you do. But out in the open source
world, that’s really not the case, right? If you’re using a tool like CVS,
or you’re using a tool like Subversion, there’s a world
of haves, and there’s a world of have-nots. There are the people who have
commit access to the one central repository that
everybody has to use, and then there are the people who can
maybe read from that repository, they can check
out a working copy, but they can’t commit. And they may not be able to earn
the right to commit until they’ve proven themselves, over
the course of a number of patches that they’ve submitted
and had to maintain. Now if you’ve been in the
position of having to maintain a patch against an upstream
source tree, it’s kind of a painful thing to have to do. Tools like Quilt will make it
more straightforward, but really, what you would like to
be able to do is work with other people, speaking
the same language, using the same tools. With a distributed tool,
you can do that. With a centralized tool, unless
they’re willing to create a little sandbox that
people can go wild in, which Wiki’s history, and the history
of other collaborative commons on that have proven
as not necessarily very scalable thing. They can’t do. So you think of it as being
analogous to the ascent of man, right? In the days of RCS and SCCS,
everybody crawled on all fours and they had to be on the
same machine in order to get any work done. And then, suddenly, everybody’s
tail fell off and they started going around, and
hunched backs, and they were able to talk to the central
repository over the network. But they had to be on the
network in order to get any work done. Right? If I unplug your workstation
from the Perforce server, there’s nothing you can do. If I unplug your workstation
from a Subversion server, you can run diff, and
nothing else. With the distributed tool,
that’s no longer the case. I can work on the train, I can
work on a mountaintop. So long as I have history,
and I can access my hard disk, I’m set. So I can work anywhere, I can
contribute with, to any project, I can work
with anybody with a distributed tool. The tools don’t make the
boundary, it’s the social norms of your project that you
explicitly choose that make the boundary. You choose the model yourself
with a distributed tool. You don’t have imposed on
you by the technology. So I would encourage you to
give this stuff a try. Mercurial, as I said, it lets
you work as a centralized system, if that’s
what you prefer. If you want to work in a
distributed fashion, you can. And one of the reasons that
people have cited to me a few times for using centralized
tools is that they’re afraid of forks. So Guido works here, for
example, and he does not like Stackless Python at all. Stackless Python is this project
that started about eight years ago, that went off
in a direction that was fundamentally different to
the way he wanted to bring python itself. And for example, the GCC people
have had the same problem, right? EGC has forked off from GCC
many years ago, and they eventually managed to reconcile
their differences and start working together. But what’s interesting about
using a tool that has good support for merging and good
support for branching is that, everybody forks all
the time, right? Forking is just what you do. And merging is what you do
when you’re done forking. So once somebody has decided
that they want to play nice again, and they want
to cooperate with you, you just merge. Whereas in a central tool, what
somebody’s got to do is they’ve had to suck all the
history of your central repository. They’ve had to shove it all
back into another central repository. If you want to reconcile your
differences, you’ve got a serious problem on your hands
all of the sudden, because there’s no way to make
the two communicate. That’s just not the case with
a distributed tool. So what would you say, it makes
it easier to reconcile your differences. A final couple of comments
that I have before we finish off. There’s a number of people who
work on Subversion here at Google, and I’ve been very
conscious along the way, as I’ve been trying to shepherd
people into sending patches into Mercurial, and so on, of
the great job the Subversion people have done in terms of
building a good community around their tool. If you’re working on open source
software, the only thing you have going for you is
A, technical merit, and B, credibility. And Karl Fogel and Jim Blandy,
and Brian Fitzpatrick, and Ben Colinsussman, and Garret Rooney,
and all those other people who’ve worked on
Subversion over the years, they’ve done a very good job
of making themselves accessible, and making the
Subversion community being a place that is a good place
to contribute to. You send a patch in, somebody’s
going to review in and say, yeah, could you
tweak this, yeah, could you tweak that. It’s a nice properties
to have. And I’ve been very explicit
in trying to emulate their example as we’ve been building
Mercurial, because it’s always nice to try and learn from
somebody else’s good examples, rather than to try and blaze
a trail of your own. And that’s, I think, stood
us in good stead. I actually ran a survey of our
users a couple of months ago, just to get a sense of where
people thought we were. It’s very easy when you’re a
developer to stay down in the trenches and look at the next
line of code you need to write, or look at the next patch
you need to issue, or the next bug that
you need to fix. But it’s nice to get a sense of
what your users think about you, and people have been pretty
complimentary about us. People are also quite happy with
the software, in terms of the fact that it’s easy to
install, easy to use, not too different from tools like
CVS and Subversion. And it’s been very rewarding to
actually be able to talk to people and say, look, here’s
this nice shiny toy that we have, that you can use
for something, be it small or be it large. It’ll scale, and it’ll work for
you across all of these different sizes of thing. That comes to the end of
my prepared comments. Thanks very much for bearing
with me as I’ve had to essentially hand wave
my way through. And I’d be happy to take any
questions people have. Yeah? AUDIENCE: So [INADUBLE PHRASE] BRYAN O’SULLIVAN: So the
question is, what’s the computational cost of a merge
when one person does a large amount of work, and another
person does a small amount of work? And the answer is that there’s
not very much cost to it. So bringing in changes from the
outside is essentially a linear operation in the number
of changes that you’ve made. So cloning a repository, all
20,000 revisions or just pulling two changes, they have
approximately linear costs. So one costs about 10,000 times
as much as the other. So if you’ve done 10,000 things
and I’ve done two, and I pull in your 10,000 things, it
takes me about 10,000 units of time to process
those changes. The actual merge afterwards is
primarily a matter of updating the working directory. Most of your changes are not
going to conflict with most of my changes, and then what
happens with the few conflicts that there are is really, it’s
almost not a matter for Mercurial itself. We have a couple of different
merge strategies that people can use when there
are conflicts. AUDIENCE: But can you
tell the common ancestor between the branches? BRYAN O’SULLIVAN: Yes. Yes, can you tell the original
common ancestor? Yes you can. I’m sorry if I’m repeating your
questions, but I’m just trying to make sure that
the folks back home can hear what you say. Yes? AUDIENCE: You talked about using
patch sets as sort of a local mechanism for yourself. Why wouldn’t you just clone the
repository, and actually use the real Mercurial revisions
to commit, and then remerge them [UNINTELLIGIBLE]? BRYAN O’SULLIVAN: So the
question was about using a set of patches for doing development
of your own stuff rather than using regular
Mercurial tools in order to capture all of the history. And what would you say, that’s
partly a style thing, and it’s partly a wanting to not clutter
the history thing. So you may have heard me allude
to the fact that I go down blind alleys? Well, I don’t necessarily want
my idiocy caught on the permanent record if I can
necessarily avoid it, right? So it’s nice, for example,
particularly if you’re dealing with an environment like the
Linux kernel, where there is quite a high standard for your
changes to meet in order for them to get in. You really want things to
be packaged cleanly. That’s one environment where
it makes sense to think in terms of patches, because you
want to submit the pristine final thing, not the 45
different idiot things that you did along the way. For my own purposes as well, it
helps me to think in terms of patches because then,
I’m both putting layers on my thought. I’m linearizing my thought in
space and in time, right? So each patch captures a layer
that I’m worrying about. I can actually revision
control the patches themselves, so I do capture the
history of the changes, but just not the actual
repository that they’re eventually going to end up in. But I also have the ability to
go back and erase history and make myself look better,
which is very nice. Yeah? AUDIENCE: How well does
it handle refactoring [UNINTELLIGIBLE] and file
renames and moving code, lots of code? BRYAN O’SULLIVAN: The question
is how well Mercurial handles refactoring, that
handles renames, and moving code around. So there are three
answers to that. One is that it doesn’t, the
second is that it does it really well, and the third
is that it’ll be really all there soon. And that these are all true
at the same time. So right now, Mercurial
has a shell script that handles merging. So it knows how to figure out
the basis for doing a three-way merge, for example. And it will hand those off
to the shell script. Now, we have a couple
of different shell scripts in place. One is a shell script
that will run a three-way merge tool. But three-way merge is not very
satisfying when you’re doing distributed development. Because you frequently have
cases where there are crisscrossed merges, right? I pull your changes at the same
time that you’ve pulled my changes, and we both commit
the results of our mergers, we end with this thing where
we have to merge again. And that can iterate
a few times. In those kinds of cases, you
would really like to have some sort of more history sensitive
merge that will cause us to converge more quickly. There is a branch of
Mercurial that has that facility available. In terms of handling renaming,
though, right now we track rename information, and Matt
Mackall, who is the guy who wrote much of the Mercurial
code, who originally started the project, is working on
actually having your changes follow across renames. And by the way, the question
that is sort of implicit there is if I make an edit to a file,
and you’ve renamed the file to a different name, you
really want the changes to show up under the different
name, after we’ve resolved our differences. So that’s not there yet, but
the actual machinery underlying it that’s necessary
is there, and the future will be present soon. Sorry, I’ll give somebody else
a chance, first. Yeah? AUDIENCE: What platforms
are supported? BRYAN O’SULLIVAN: What platforms
are supported? Pretty much anything that python
runs on, that has a file system behind it that
looks vaguely Unix-y. So Windows, you name the brand
of Unix, and Macs, and so on. Yeah? AUDIENCE: How do your shell
scripts work on Windows? BRYAN O’SULLIVAN: The shell
scripts work on Windows like they run as .bat files. Of course, they’re not actual
shell scripts there, they’re .bat files. But nevertheless, they work. Yes. Yes? AUDIENCE: Do you support the
partial [UNINTELLIGIBLE] bring over, whatever? Partially bringing over? BRYAN O’SULLIVAN: Do we support
partially bringing stuff over? Somebody is working on
that at the moment. And there are two different
kinds of partially bringing stuff over, right? One is, you want maybe only the
last 2,000 revisions of your project history, because
you don’t want to be carrying around the [UNINTELLIGIBLE] gigabytes of earlier stuff that
was done ten years ago. And the other thing is that
perhaps you’re only interested in working in a certain portion
of the tree, so you don’t have to check out the
other ten gigabytes of stuff that you don’t care about. And by the way, when I say
gigabytes, some people are actually using Mercurial
to work on multi-gigabyte source trees. For example, the FreeBSD ports
tree has a port that sits in Mercurial instead
of in Perforce. And it contains, I think,
something on the order of 150,000 files and 150,000
change sets. So that’s a reasonably large
amount of stuff, and you really want to be able to focus
on a certain aspect of it rather than do all of it. It’s in progress. Yes? AUDIENCE: A generalization of
the rename scenario is if you have a block file, let’s say you
start splitting it up into [UNINTELLIGIBLE] files. Will those scripts handle
that as well? BRYAN O’SULLIVAN: Yes. So the question is, if you split
a file up into multiple files, will Mercurial
handle that? And I can’t speak for Matt,
because it turns out that I’m not actually him. But I believe that his plan of
record is to, when a rename or copy has been detected, do
a merge into each of the children, the descendants
of the original file. So if you copy one file into
three different files, and then you whack off the first
third, middle third, and the final third in those three
different files, somebody else makes an edit in the original
file, the logically appropriate thing
ought to happen. Now, I’m not actually doing
the implementation. So I can say yes, it will be a
better world, and everybody will be happy. I don’t know how hard
it’s going to be. Yes. AUDIENCE: On a smaller scale,
can you customize Mercurial so that a small number of
[INADUBLE PHRASE] so that they can, when
they do updates, [UNINTELLIGIBLE PHRASE] if people are writing to it,
and people are reading what other people have written, but
they don’t need to worry about merge, or anything like that? BRYAN O’SULLIVAN: Sure. So the question is, can you
extend Mercurial so that it essentially behaves like CVS, so
that when you do an update, it does the logical equivalent
of a pull, so that you don’t have to worry about merging
so that you have a straightforward, simple
way for people to get their feet wet. And the answer is, yes,
Mercurial is extensible in those terms. No, nobody
has explicitly done the work to do that. I do know that there’s at least
one other distributed revision control system that
does exactly what you described, because they want to
give people who have used CVS essentially training
wheels. So it is possible to do that,
and would be quite straightforward. Contributions of code
to do things like that are always welcome. Yes? AUDIENCE: Can [UNINTELLIGIBLE]
push a patch back into history? BRYAN O’SULLIVAN: Can I push
a patch back into history? AUDIENCE: [UNINTELLIGIBLE] I released version
2.1, and now I’m working on version 2.10. And then I found a bug that’s
already existed in 2.1. I pushed [UNINTELLIGIBLE PHRASE] BRYAN O’SULLIVAN: OK. So the question here is, let’s
say I’m working on a revision 2, and a revision 2.1. And revision 2 I’ve frozen,
because I’ve released it, and there are CD-ROMs out in the
wild, or tarballs, or whatever the bits the kids use
these days are. And I found a bug in revision
2, I’ve fixed it in my 2.1 branch, and I want to
backport that fix. This is something that revision
control weenies tend to call cherry-picking. I’m a self-labelled weenie,
by the way. This is not a pejorative term. The answer is you can
do it using a patch. As something that you would
support as a first-class operation, cherry-picking is a
very difficult thing to do. There are maybe two or three
revision control systems that handle it relatively well. Perforce is one. Another would be, actually,
Subversion almost handles it well, because Subversion
has no notion of merging at all, right? Another one would be arch, which
is explicitly built in those terms, but I wouldn’t
recommend that anybody use arch. A final one would be Darcs,
which is one of the theoretically interesting but
not practical ones, that is written in Haskell,
of all languages. And Darcs has this wonderful
quantum mechanical, I kid you not, theory of patches that it
is built up on, so that you can talk in terms of patches
commuting with each other, and boundaries beyond which
they cannot go, and so on and so forth. And it tends to go exponential
in space and time quite frequently. So it’s got some fundamental
theoretical problems that are not addressable. Yes? AUDIENCE: So now it seems with
Mercurial that first, an engineer does some work in his
local area, passes it out, and in a corporate setting, or even
in a project setting, you have to then publish your
changes out to the world. But since it’s now, it feels
like it’s a two-stage commit, you have to make your changes,
and they you have to actually publish them, it seems like it’s
easier to accidentally forget to publish them. Is there any way to make
that less painful? BRYAN O’SULLIVAN: The question
is, is there an easy way to publish your stuff with
Mercurial, or presumably, by extension, with other
distributed tools so that other people can find them? And the answer is that right
now, you have to do stuff by hand, because we’ve been
focusing on the core of the software rather than on these
larger usability questions. That sort of thing, where you
want to be able to see, oh, I haven’t actually published this,
even though I wanted to, or oh, this repository that I’ve
made changes based on is actually, has diverged
for me by this much. You want to be able to tell
those kinds of things without having to explicitly do it
by hand all the time. Those are things I would
really love to see, but they’re not quite there yet
because we’ve been preoccupied with just getting the core
functionality into one .0 form so far. Do remember that we’ve only
been around for about 14 months, and that
the set of core developers is quite small. Yes, more questions? Yeah? AUDIENCE: Do you have any ideas
on how you would use Mercurial to supplement other
sorts of control systems? BRYAN O’SULLIVAN: The question
is, would be possible to supplement an existing
revision control system using Mercurial? And the answer is there are
various different ways that you can do that. So somebody has, for example,
written an incremental Perforce importer
for Mercurial. So there exists a proof that it
is possible to do what you want today. I don’t know how
well it works. AUDIENCE: One directional? BRYAN O’SULLIVAN: I imagine
that it is one directional, yes. There is also a tool called
Tailor, written by a guy in Italy whose name is
[? Emmanuel Guyfax ?]. And Tailor is sort of the
Rosetta Stone of revision control tools. It will convert between
arbitrary revision control tools up to a point. It doesn’t have a very good
notion of branching or merging, so it tends to lose
information when talking between distributed revision
control tools. But if what you’re looking
to do is an incremental conversion, and then stuff some
things back into a host revision control tool, it is
actually pretty nice, and it’s relatively straightforward
for maintaining patches and submitting them, whether it
would be suitable for that. In that kind of a case, probably
the easiest thing to do would be to use a Perforce
importer to pull your stuff into Mercurial, maintain things
as patches, then commit them back to the native revision
control tool, perhaps by hand or perhaps
by automating it. I don’t speak Perforce very much
anymore, so I can’t say whether it would be completely
trivial or not. My imagination tells me that
would be a relatively small amount of scripting to do. More questions? MALE SPEAKER: For those of you
who have 12:00 meetings, we’re running over our time right now,
which is not a problem, or currently not a problem in
this room, but you may have your own priorities. BRYAN O’SULLIVAN: Yes? AUDIENCE: So in the format of
Mercurial, is python objects [UNINTELLIGIBLE]? BRYAN O’SULLIVAN: The question
is, is the Mercurial data stream python objects? The answer is no. And the reason that it’s not
is that python has been somewhat willing to change the
data stream format, and that’s not a terribly good thing. Also, it’s not very efficient
storage mechanism. Instead, what we do is we
explicitly lay out the bytes ourselves, using tools like
struct.pack, and using string operations, and just
plain old write. So we know exactly what the
bytes are supposed to be. Yes? AUDIENCE: How does it deal
with authorization? BRYAN O’SULLIVAN: How
do we deal with authorization is the question. And there are two or three
answers to that, depending on how you want to look at it. The first is that we have no
notion of authorization at all, because Mercurial
doesn’t care. The second, which is a more
satisfactory answer, is that if you want to be able to
share changes with other people, you can push to a shared
repository, which you can use, using, for example, if
you’re all on the same file system, Unix groups, or
Windows permissions. You can also tunnel over SSH,
so that you can do that over the insecure internet. Somebody is in the process of
adding support for pushing changes over HTTP, which will
use, I presume, some form of user authentication, whatever
patch he happens to provide, and will be secured over SSL. So Mercurial itself doesn’t
have to care, but it has various different transports
that do allow you to specify things in different ways. And for example, there is an
extension to Mercurial available that will let you
lock down individual user accounts, and put [? ackles ?] on the subtrees that people
are allowed to push to. So if you have changes that push
stuff into a tree that you’re not allowed to push to,
you will be forbidden from doing that, and other people
won’t be able to pull those changes, because they won’t
get in the first place. So there are various different
ways that you can lock things down. Yes? AUDIENCE: Is SSH tunneling
built-in in Mercurial like it is in Subversion, or is it
manual to open up SSH? BRYAN O’SULLIVAN: The question
is, is SSH tunneling built-in, and the answer is yes. We use ssh://
blah-de-blah-de-blah URLs. Any more questions? OK, thank you all very
much for listening.

11 Replies to “Mercurial Project”

  1. I prefer Mercurial to git, but I wish his presentation is as inflammatory^W interesting as Linus though… =)

  2. @hasenj I agree that it might be a little more boring, but I actually found this more generally informative, and personally I think Linus is pretty irritating so this was much more watchable. Plus, I love hg 🙂

  3. maybe a bit boring, but this is an actual tech talk 😉 this guy makes relatively long introduction, but then explains in detail how hg works, i like it.

  4. I personally like both Mercurial and Git depending on the project type. I work at a small game company and like using Mercurial because it handles large binaries pretty well and we can push revision by revision easily via revision numbers. So, if I have 10 100MB revisions we can push 100MB at a time instead of 1GB. If it fails somewhere, we continue pushing from that revision. We use Git for our web based projects since it works at super sonic speed and fits in with our auto deploy system.

  5. hacking hg right now. the guy said this was boring must be silly. this is tech talk, how he used mercurial core.

  6. Though I prefer Git, Mercurial is still a great SCM. It's also MUCH better than CVS or Subversion.

  7. What I find really amazing is that merging in Hg usually makes you loose some changes. That never happened with TortoiseSVN. I've read that Git merge is not as good as Hg merge.

    Another thing that amazes me is that if you keep 2 branches, one for development and one for production, you usually need to merge and resolve the same changes and conflicts over and over.

    A third thing that amazes me is that Git is blazing fast, even when accessing the network, I would say it is 10 times faster than Hg. I haven't measured it though, because both are pretty fast, so I see no point on seeing which is faster, what I'm concerned though is that merge has improved because you can merge the same branches several times, but if you create a branch for each issue, and the merge to production and development, you will end up resolving the same conflicts several times, That's weird given that both claim merge has been already been solved.

    BTW, I'm not complaining, but it seems there is a conceptual confusion in the merge design. They fixed one thing SVN couldn't do, namely to merge the same branch several times, but they missed one thing that was already fixed in SVN, once you resolve a conflict, it never asks you to resolve the same conflict again.

Leave a Reply

Your email address will not be published. Required fields are marked *