Related link: http://www.linuxsymposium.org/2005/
It’s been another Ottawa Linux Symposium, and before it fades into a
daze, let’s see whether I can extract some themes and threads.
The appeal of large systems and Xen
The type of Linux deployment that dominated this symposium was that of
a large, mission-critical system. Xen virtualization in particular was
heavily featured and attracted large numbers of attendees. More than
one attendee joked that Friday was “Xen day.”
Xen lets one piece of hardware run multiple operating systems,
controlling their access to the hardware through a kind of
meta-operating-system called the Hypervisor.
There is nothing new about the idea of virtualization, of course. It
was associated with the IBM 360 further back than most of us can
remember (in fact, IBM executives have told me they think Linux has a
major role to play keeping old 360-series computers going by running
in their virtual machines), and it’s now making good money for
VMWare. (More about them a bit later.)
Like VMWare, the major uses of Xen seem to be server consolidation
(which means running several instances of Linux on one piece of
hardware, a useful deployment because Linux seems to work best running
only one server daemon) and virtual hosting. Speaker Mike D. Day also
showed that Xen could be used to deploy Linux quickly to a large array
Ian Pratt gave a comprehensive overview of Xen’s goals and
implementation. He defined Xen’s main achievements as two-fold
(although his talk really focused on the first): isolating different
processes in a secure manner, and controlling resources so different
Quality of Service options could be offered to different processes.
Pratt then laid out some of the ingenuity that makes Xen more
efficient than VMWare or User Mode Linux. For instance, Xen divides
page tables among its guest operating systems and gives each guest
full control over its page tables, so that the hardware doesn’t slow
down under the load of two levels of page management (one by the guest
and one by the Hypervisor).
There is one necessary exception to Xen’s practice of handing full
control over paging to a guest: the guest is not allowed to write the
pages that contain its page tables. If it could do that, it could give
itself access to the other guests’ pages. On the other hand, the guest
must be able somehow to indicate that it needs a new page. So the Xen
team has found some tricks to make it easy for Xen to trap a guest’s
writes to the page tables, make sure they’re legitimate, and let the
guest go ahead with the writes.
Another tour de force Pratt illustrated was how Xen eases
failover. Whether scheduled or in a panic situation, hardware
sometimes has to go down. Xen can make it easier to migrate processes
to new systems with minimal downtime.
This is done by doing a series of pre-copies while the guest system
keeps running and updating its state. The first copy takes a long time
because it starts from scratch, but each subsequent copy has less
state information to transfer and thus takes less and less time. (One
weakness of Xen is that it maintains a lot of state and therefore has
a lot of information to copy.) The amount of CPU time devoted to the
copying can also be titrated to leave plenty of time for the process
to continue handling incoming requests. Pratt actually drew applause
when he showed the CPU utilization of a highly loaded Apache server
during transfer to another node, and added it was down for only 164
milliseconds during the transfer.
As I mentioned, Pratt really concentrated on Xen’s goal of isolating
processes, but the second goal of doling out resources was touched on
by a Rik van Riel in a BOF that evening. He divided the resources
worth tracking into four types: CPU utilization, memory, data I/O, and
network I/O. CPU utilization, he said, was easy to track without
intruding on the guest operating system. So was I/O. Memory, which a
Hypervisor could easily give too little or too much of to a guest
operating system, was a much harder nut to crack. He suggested clever
ways that patterns of waiting for reads, waiting for writes, and the
length of request queues (way-stations for reads and writes) could
tell the Hypervisor whether an operating system was underprovisioned
or overprovisioned with memory. But more experimentation is needed to
see whether these are valid measures.
Xen offers lot of features already on 32-bit x86 hardware, with 64-bit
x86 and AMD coming along too. A number of operating systems can be
guests, including Linux, Solaris, FreeBSD, and OpenBSD. An upcoming
facility called VT-x should allow Xen to run operating systems that
haven’t been instrumented for it–so even Windows will someday show up
on the list.
VMWare is not passively accepting the limits that have long been
assumed on virtualization. They know that if they can break down the
barrier between Hypervisor and guest operating system, and learn just
a bit about what the operating system is doing (taking a spin lock,
for instance, or releasing a disk block), they can achieve fantastic
speed-ups in virtualization. An unannounced speaker from VMWare
presented some of their innovations at van Riel’s BOF, under the name
Para-virtualization’s goal is to blur the barrier between operating
system and Hypervisor enough to obtain useful information, while
minimizing engineering costs and the risk of breaking the operating
system. VMWare knows that Linux developers would have little tolerance
for a development process that required them to slow down so VMWare
could keep up, and that Linux distributors would push back if VMWare
slowed down performance or introduced risk.
Their solution is to introduce a new layer (named VMI) that would
cause some 30 to 50 instructions in the operating system to trap into
the Hypervisor instead of executing as normal. This is reminiscent of
the trap instructions introduced by a debugger into a binary, but
would be even less intrusive, requiring no change to the binary of the
The solution is unique to each processor being emulated, but could
apply to any operating system compiled for that processor. The speaker
claimed that para-virtualization had been easy to introduce into the
Linux development tree and could be maintained as open source with a
typical open source development and testing process.
A narrow range of other topics
Clustering–which, as speaker Bruce J. Walker pointed out, is the
converse of virtualization because it makes many systems act as
one–also turned up a lot at the symposium. Walker said the topic goes
well with virtualization, because if sites want to use virtualization
as an aid to handling failover, they need to coordinate the computer
nodes between which the operating system crosses. He presented a
proposal at the symposium for adding a pointer to the kernel’s task
structure that (with a very small footprint and no impact on
non-clustered systems) would help clustering systems handle the need
to discover which processes were running on remote systems and to
communicate with them.
The wide range of sessions on kernel changes–including solutions to
improve storage management, such as multipath device access–were part
of evidence that every kernel task (caching, filesystems, etc.) is
being examined under a microscope to determine how it can scale
better, adapt to future evolution, and shave off waste.
Other Linux deployments received less attention at this symposium. I
noticed nothing about interesting but arcane deployments such as
robotics or carrier grade (telephony) applications. Unlike the
symposium I attended four years ago, this one gave just a nod to the
desktop. Instead, the desktop formed the subject of its own two-day
conference preceding this one, as reported in
a recent blog of mine.
A bit more at the symposium was offered on embedded systems. The
developers give a lot of attention to power management, which I
suspect is done for the benefit of embedded developers, but also
benefits desktop users who have laptops.
Concerns about power management have a major impact on support for
hyperthreading, as discussed by Suresh Siddha in his talk on Chip
Multi Processing. The driving factor is that power consumption is the
same on a chip regardless of whether just one thread is active, or
both. If optimal performance is your goal, you want processes
distributed among all processors, even if only one thread is active on
each. But this maximizes power consumption. So if power management is
a concern as well, the algorithm must be quite different, and must try
to fill the threads of each active chip while leaving some chips idle.
I was told that the 2.6 kernel is much larger than the 2.4 version
because developers honored the feature request list of sites running
big iron. The losers in this exchange are embedded developers, many of
whom insist on sticking with 2.4. The 2.6 version’s slow boot-up is
particularly detrimental to adoption for embedded systems.
Several talks offered practical advice on the use of debugging and
instrumentation tools to make developers more effective.
I attended the Fedora and Gentoo BOFs partly to see whether I could
detect any demographic or cultural differences in the attendees, but
they seemed pretty comparable. The Fedora BOF was much larger, of
course. Gentoo has an impressive following, though; it’s BOF was led
by two IBM employees who say it’s gaining adherents among developers
at IBM, and someone pointed out that the Mozilla Project runs its
servers on Gentoo.
To capture some of Red Hat’s and SUSE’s followers, the Gentoo
developers are considering a slower moving, more stable Enterprise
edition, but an Enterprise edition seems like an oxymoron to me for a
distribution that is known as the most adventurous, cutting-edge of
the popular distributions, and for a team that prides itself on
letting each user customize his or her installation.
Some of the talks at this conference were the most information-rich
I’ve ever attended. When I look over my notes, I am amazed how much
valuable information the speaker conveyed in just one hour.
The detail could sometimes become tiresome, to be sure. I don’t think
an audience was well-served by a description of a feature that goes
field by field or function by function. What made the talks useful
were their summaries of a feature’s requirements, history, alternative
or rejected implementations, and subtle implications of the chosen
I sensed less concern at this conference about political trends that
could have an impact on Linux and open source. I just didn’t hear much
talk about them. Perhaps the vote of the European Union against
patents eased the worries of attendees. New respect in the business
community and larger public for open source (note the groundswell of
praise for Firefox) and the gradual receding of the SCO case may also
contribute to this lull in political hyperalertness–although more
threats are likely to arise.
A lot of the folks here are extraordinarily intelligent and capable of
extreme levels of dedicated effort. We’re lucky they’re obsessed with
such things as reverse engineering old video games or getting every
feature of power management to work on Linux. If one of them set his
mind on evil, he could take over the world. (On the other hand, he
couldn’t be as evil as the people who are taking over the
The deeper theme at this symposium is that open source is constantly
being revitalized by the astonishing energy and intelligence of those
drawn to it for whatever reason. And it’s making inroads in
little-known places. I mentioned earlier the reverse engineering of a
game that runs on Windows: the hacker discovered along the way that
this game uses Ogg Vorbis files for audio and Python scripts to
implement many of its rules. It’s hard to imagine a computer field
without open source–but then, no such field will ever exist.
Earlier blog on this symposium: