October 2005 Archives

Andy Oram

AddThis Social Bookmark Button

Douglas Engelbart, pioneer of the GUI and of computer-supported
cooperative work, has received a couple awards of late. About 35 years
late, in fact. But he hasn’t let neglect (and perhaps worse, empty lip
service to his accomplishments) curb his spontaneous love of
exploration. Spending a few minutes with him–at a ceremony
celebrating an
award
given last Saturday by
Computer Professionals for Social Responsibility–convinced
me that he remains an American original with a vast scope of
interests, a bit like Edison or Feynman. I wonder what the modern net
of sensors and cameras and GPS devices and wireless networks would be
like if we had integrated his 1960s-era insights from the start.

I had a chance to tell Engelbart how I had first come across his
reputation and ground-breaking work; it was at a conference about
human communication that I attended in the early 1980s. A keynoter
managed to get his hands on a film of Engelbart’s famous 1968
demonstration of computers as augmentations of human intellect, and
showed us five or ten minutes I will never forget.

The part of the demonstration we viewed started with some voice
conversation. Then a piece of paper came up on the screen and some
marks appeared as someone drew on it. More marks appeared as the other
person drew comments on the first one’s marks. Then a small video
opened up in one corner and the head of an assistant, looking very
much like a Californian grad student of the 1960s, popped up and
started talking. The merger of voice, video, and whiteboarding was a
creative implementation of the kind of teleconferencing I suggested
recently in my article
A googol of teleconferences.

This demonstration, as I saw it in the early 1980s, blew my mind. That
the demonstration actually took place in 1968 was almost beyond ken.
But that was how ahead of us Engelbart was, and remains,

The 1968 demo reportedly cost $100,000 (which in 1968 dollars is also
nearly beyond ken). The computing world of 2005 has much more of the
infrastructure that could make Engelbart’s vision a common experience.

But suppose we had listened to Engelbart back when he began? I can
imagine what would the modern network be like if we had integrated his
humanistic approach to augmenting intellect into technology each step
of the way:

The dignity and capacity of humans would remain central.

Modern sensor systems, such as smart dust and MIT’s
Project Oxygen,
scoop up data somewhat indiscriminately, while the projected uses of
this new “Internet of things” suggest computers switching each other
to new tasks without human intervention. It’s all kind of scary,
suggesting a world out of our control, a kind of science-fiction
Terminator reign of the machine. (To be fair, the designers of Project
Oxygen claim their mission is to be human-centered.) If we already
had a highly interactive network centered on human interaction, with
people tied closely by wire, we could build the new capabilities
asking at every turn, “How are we enabling people to do more of what
they are best at doing?”

Protocols might be in place for the integration of new instruments.

Click around your computer system–or the web sites of many
organizations, including the internal ones they establish perportedly
to increase the productivity of their employees–and you’ll come
across a lot of information with no apparent use. Even the experts
will admit that some of it is pointless. Sensors and other
participants in the Internet of things are likely to suffer from the
same problem. If we had established a human-centered network over the
years, we might know more about what knowledge we need and provide
frameworks for incorporating valuable new devices.

User interfaces would be richer and more subtle.

Right now we’re stuck with the legacy of the Bell telephone system and
that of the typewriter, a nineteenth-century mechanical device with a
bias toward the characters of the English language. Had we stressed
communication and the contributions of individuals to each other’s
endeavors, we might have a plethora of different ways to react with
the computer by now.

Perhaps we might even have adaptive interfaces, which watch what users
do and change over time to present each user with the functions he or
she is more likely to want. I’d be very reluctant to use an adaptive
interface in our current state of computing, because our knowledge of
human-computer interaction hasn’t achieved the sophistication an
adaptive system needs to be productive rather than annoying.

We might have new solutions to the storage and retrieval of massive
amounts of data.

For a long time, the focus of the computer field was on providing
applications. Now we’ve shifted toward a focus on
services, which are more fine-grained and can be combined in
innovative ways by the users. I tracked this evolution in an article
titled

Applications, User Interfaces, and Servers in the Soup
.

Each shift in the use of data–as well as the amount collected and
searched–has brought with it sophisticated research into databases
and storage. We’re seeing another leap in size and search requirements
as people become used to storing images and videos. The Internet of
sensors will lead perhaps the biggest scaling problem we’ve ever had.
But a network based on communication might have given us a head start
in understanding and adapting to the onslaught of data.

I think Engelbart’s vision involves a move beyond applications and
services to a new focus: support for the most distinctive features of
human intellect, including communication with other people and
devices. Engelbart’s vision has remained a beacon for us over the
decades, and I am hopeful that a future decade will instantiate it.

Geoff Broadwell

AddThis Social Bookmark Button

Related link: http://www.oreillynet.com/pub/wlg/8097

Over the last
two
weeks,
I’ve been talking about optimizing code, starting with the information
gathering phase (choosing an optimization target, profiling, benchmarking,
and so on), and continuing with theoretical reasons code is slow (such as
trying to solve a bigger problem than the one at hand, or choosing a poor
algorithm). This week we’ve finally gotten down to the wires and chips:
why real hardware makes seemingly good code run dog slow.

As I mentioned last week, big-O notation ignores the constants that
determine how long each primitive operation takes, and even how many
primitive operations go into each “high level” operation. Even worse,
these constants aren’t really constant. (Osborne’s law: “Variables won’t,
constants aren’t.”) Why? To answer that, let’s take a little tour
of a modern computer.

The heart of each computer is of course its processors. I use the
plural there because the average desktop or video game console contains
at least two powerful processors — the CPU doing most of the system
management and application logic, and the GPU slinging pixels with
abandon. Workstations (and even high end desktops) often contain two
of each, and servers may have dozens of CPUs. There are also various
smaller processors scattered through the system, including DSPs, I/O
processors, network processors, and so on.

All of these processors need to be able to talk to each other, so
the computer is also filled with communication links, either in the
form of shared channels such as PCI, or point to point links such as
HyperTransport. The processors also need external I/O, so another
set of communication channels exist for that; once again, some are
shared channels (SCSI), and some are point to point (SATA).

These communication channels bring us to the first big bottleneck:
communication is slow. Every channel has a maximum bandwidth capacity
as well as a latency involved in transferring data from one end of the
channel to the other. There is also a certain amount of overhead to
initiate a transfer, and often additional overhead to keep a large
transfer going. These three limitations add up to big problems:

  • Transferring a large chunk of data will be slowed significantly
    by the bandwidth limits of the channel. Transferring 2 GB over
    a 1 GB/s link will take (at least) 2 seconds.
  • On shared channels, when more than one device wants to transfer
    data, they not only have to share bandwidth, but often pay additional
    overhead to switch channel users, so each only gets a fraction
    of the bandwidth of the channel, adding up to well under 100%
    of the total available.
  • Transferring small chunks of data will be limited by both the
    latency of the channel and the overhead to set up each transfer,
    especially if that overhead requires a handshake (a protocol in
    which multiple endpoints have to agree to set up the transfer).
    In the handshake case, the latency cost will be paid several
    times over while the various devices reach agreement.

The data flowing across all these channels needs to be stored somewhere,
so each processor has a chunk of memory to call its own. Slow devices
produce (and consume) little data, so this may only need to be a few KB
or even just a few bytes, and it can be stored right on the processor chip.
More powerful processors, including the CPUs and GPUs, can produce and
consume truly massive piles of data. These processors have sizeable
local memories, from a few dozen MB to a few GB.

That’s too much memory
to put on the processor itself, so it gets separated onto other
devices which — you guessed it — have to communicate with the processor
over a slow channel. Worse yet, big memory is slow to access anyway;
channel latency adds to memory chip latency and memory transaction overhead
to make off-chip memory quite slow.

Because big off-chip memory is slow, fast processors have on-chip cache
memory as well, which tries to keep copies of data that is likely to be
used again soon, so that the processor will not have to go across the
channel to get it. Of course, cache is not a panacea. It must be
smaller than the main memory, of course, and the heuristics used to pick
which pieces of main memory to keep copies of, and which data can be tossed
to make room because it won’t be used again soon, are often wrong.

Processor designers can compensate for this weakness by making the
cache bigger, but that makes it slower (big memory is slow, remember).
Pretty soon the silicon spent on cache takes as much room as the
processor itself, so the processor and cache are once again forced to
communicate over a channel that will slow things down. (Sure, the
channel happens to sit right on the processor chip or perhaps right
next to it in the same plastic package, but even this is a long
enough channel to notice the slowdown.) On very fast processors,
the big caches themselves have smaller caches to try to hide the
performance problems inherent in a large cache. Lather, rinse, repeat.

Even if the heuristics were perfect, cache can’t help at all in some
situations, such as when streaming large chunks of data. In this case,
data is only used once, and then either sent to main memory or to another
processor; keeping a copy of the data won’t speed that up. It’s worse
than that, of course; the cache itself adds latency to every memory
access, so keeping a copy of streamed data actually slows everything
down. Also, in order to simplify and speed up the cache design, caches
typically only copy data in chunks of a certain minimum size. Algorithms
that jump around memory grabbing small bits of data tend to be slowed
significantly by caches, as the cache unnecessarily copies more data
from main memory than it needs to (sometimes a lot more), which chews
up much of the bandwidth of the main memory channel.

Still, there are enough cases in which cache is a win (and sometimes a
significant win) that virtually every processor has one or more
of them. These strata of caches of different sizes and speeds create
what is known as the memory hierarchy. At one end are the
ultra-fast registers, tiny chunks of memory that add up to perhaps 1KB
or less. A little slower and an order of magnitude or two larger is the
L1 (level one) cache; slower and larger still is the L2 cache, and some
processors even have an L3. Beyond that comes the processor’s own main
memory, and then memory belonging to other processors, which must be
reached over yet more channels inside the computer.

Relatively often, all the memory belonging to all the processors in the
computer still isn’t enough to handle all the work the user demands at
once, so every modern operating system supports virtual memory,
in which some of the data is stored on a local disk or even across a
network on another computer. Unfortunately, disks and networks are
even slower than the slowest communications channel inside the computer.
Both have moderate to severe bandwidth limitations (especially networks),
and both have severe to insane latency issues. Processors count time
in fractions of a billionth of a second; disks and cross-country networks
take several milliseconds to set up a data transfer. This is a difference
of 7 or 8 orders of magnitude, and amounts to an eternity from the point
of view of the processor. Sadly, these limitations come partly from
physics and mechanical engineering issues, such as the maximum spin speed
of a disk drive platter using current materials and fabrication techniques,
or the speed of light in a long-haul fiber optic cable. There’s room
for improvement, but not orders of magnitude.

Of course, disks and network devices have caches of their own to hide
some of these issues. In reality, these are mostly just buffers; in
other words, their primary duty is to impedance match a fast, high
bandwidth channel to the processor with a slow channel to the disk or
network, so that the processor can spend as little time as possible
waiting for data to trickle in or out. Disk buffers are actually fairly
sophisticated, reordering incoming and outgoing data requests to
minimize the time spent physically moving disk heads back and forth.

Caching and buffering are great things, but essentially they
are kludges, attempts to hide some of the problems some of the time.
As I mentioned before, there are data use patterns that cannot
possibly be sped up by caching, no matter how sophisticated. And there
will always be data use patterns that will be slow in reality, even
if they could be cached in theory. These tend to trip over tradeoffs
in cache design in which the chip designer chose a different tradeoff
than the use pattern would prefer.

Knowing that processors will be forced to wait, the next
technique for keeping them busy is to do more than one thing at once.
That way, anytime the processor has to wait on a data channel (either
sending or receiving), it can switch to another task and try to get
work done on it. This of course only works until that task
has to wait, at which time the processor tries to go to a third task,
or perhaps just back to the first task to check if its data is ready.

As you might imagine, this works much better if there are lots of
tasks, so that there is likely to always be at least one ready to go.
The downside, is that all of those tasks have to share the
fixed resources of the system. Each will want to use some of the
bandwidth of the various communications channels, each will take some
room in various caches, and so on. Eventually there will be enough
tasks that the system slows down significantly because of contention
on these shared resources.

When the contention is over space in some
layer of the memory hierarchy, it’s known as thrashing, and
it is one of the biggest performance walls out there. There’s a
noticeable drop in performance as each layer of the memory hierarchy
is overfilled; the system essentially slows to the speed of the
next layer in the hierarchy. When this gets all the way to thrashing
virtual memory, it not only slows the machine to a crawl, but is
actually audible — even the quietest disk begins to make a good
deal of noise when it is forced to read and write data in many
different places on the disk as fast as possible.

Of course, thrashing can be caused by a single application, too; as
its data set grows, the cacheable portion will often grow too large
and overwhelm the smaller caches all by itself. Still, a programmer
may be able to control the cache friendliness of his own application;
trying to make many applications play nicely together is a much bigger
problem.

Processor designers have more tricks up their sleeve, however. In
particular, they take advantage of more kinds of parallelism than just
the multitasking described above. Current processors have
at least three more ways to get more work done. Simplest (and most recently),
they can simply add additional processor cores to each processor package.
Each core is essentially a complete processor of its own, able to handle
its own set of tasks. Adding cores has an advantage over adding more
processor packages, because the communications channel between the cores
can be very short and wide, and hence very fast. Multiple cores also
usually have their own copies of the fastest caches, but share the largest
caches. This sharing may raise contention issues, but often is a win
because some tasks will tend to be more cache-intensive than others, and
the same total amount of cache can be apportioned in a better way.

Before there was room for multiple cores, processor designers added
more functional units to each processor core. This allowed the
processor to be simultaneously performing several different calculations
within the same task at the same time. Unfortunately, it’s usually
not helpful to simply add dozens of additional functional units. In
most applications, later calculations depend on the results of earlier
ones, and pretty soon most of the extra functional units are idle
waiting for each other’s results. (It happens that computer graphics
is one very large exception to this, so GPUs have literally hundreds
of functional units in them.)

Before processors were made “wider” by adding more functional units,
they were made “longer”. By splitting every instruction handled by the
processor into a sequence of suboperations, and using dedicated (and
highly optimized) hardware for each suboperation, it is possible to form a
pipeline within the processor through which instructions flow.
At any given time, each piece of dedicated hardware, or pipeline
stage
, has a different instruction in it. The processor may be
decoding one instruction, fetching input data for another, performing
various stages of calculation for others, and finally writing the
output data for yet more.

Pipelining has two huge benefits — first, older processors still had
to perform these suboperations, but they could only do one at a time,
so every instruction took several clock cycles. A full pipeline can
overlap these suboperations, and so can complete a new instruction every
clock cycle. Second, breaking instructions down even more, into
smaller and smaller suboperations, allows the processor’s clock speed
to be increased significantly. Of course, this only works up to a
certain point; as Intel discovered, pretty soon physics and various
types of overhead get in the way of achieving faster processors this way.

Pipelines also have a major weakness, in the form of stalls,
also known as bubbles. These occur when one stage of the
pipeline has to wait on another stage, and so must sit idle for one or
more clock cycles. The problem is that since a pipeline is always
flowing, idleness moves downstream too; each stage will be idle at
least as long as its predecessor. If the first stage of the pipeline
stalls at cycle number 1, for example (waiting on data perhaps), then
every stage of the pipeline will waste a cycle later on.
The second stage will be idle at cycle 2, the third stage at cycle 3,
and so on.

There are a great many ways that pipelines can stall. Waiting for
data is an obvious one, but two more are extremely common as well. The
first is when a young instruction in an early stage of the pipeline has
to wait for a calculation being performed by a later stage for an older
instruction. The stages in between will idle out until the calculation
completed and the earlier stage can continue performing useful work.

The second common stall is when a branch occurs; by the time the
processor realizes that it must take a branch, the branch instruction
will be near the end of the pipeline. The instructions that were in
the pipeline behind the branch were from the wrong code path, so they
all have to be thrown away. Then there’s usually some extra time spent
figuring out where the branch destination is in virtual memory, trying
to find the proper next instruction in one or more caches, and so on.
During all of this, the whole pipeline sits idle. Ouch.

It should come as no surprise that processor designers try to hide these
problems as well. The first case is partially hidden by reordering
program instructions on the fly in a way that will calculate the same
results, but prevent as many of these cross-stage waits as possible.
The branching problem is hidden with a host of tricks, including trying
to guess the correct destination from previous history and executing both
paths after a binary branch and then keeping the results from the right
path once the final branch decision is made (this at least keeps the
pipeline half busy, instead of completely idle). Once again,
these tricks make the average case somewhat better, but there are many
reasons they can fail, and even make things worse rather than better.

I’ve really only scratched the surface of all the ways that real computers
can run much slower than algorithmic analysis might indicate. For
example, I’ve almost completely ignored the huge class of performance
gotchas that arise from processor designers trying to save a little space
here and there; these often appear as special cases that can drive you mad.
(”You can do two multiplies and an add, or two adds and a multiply, at
full speed; but it takes an extra cycle to do three in a row of either one.”)

So now what? If there are so many ways that code can run slowly on real
hardware, what can you do to minimize these problems? It turns out that
there are some broad techniques that should help you reduce the affects
of most of what I’ve written; those techniques will be the subject
of my next post.

What’s the most infuriating performance gotcha you’ve come across?

Christopher Diggins

AddThis Social Bookmark Button

Related link: http://www.artima.com/weblogs/viewpost.jsp?thread=134186

Here is a reprint of my blog entry at Artima.com:

I frequently encounter open-source code which reimplements code which exists elsewhere (and usually does so badly). When everyone is busy reinventing the wheel, no one has the time to build a cart.

Even though some developers are guilty of simply not doing research, part of the problem is that finding open-source code for a particular purpose is hard. Search engines are well suited for finding text, but not source code. This is because:

  • Source code documents are not often distributed directly on the web, but rather as part of compressed packages
  • Documentation and source-code are often separated. Robots have trouble creating hard-links between documentation and the source code.
  • Comments in source-code, are treated with the same level of priority as function names, and variables. This means that they aren’t indexed with the proper level of priority.

So how does this get solved? Well I can see two ways:

  1. Search engines start applying specialized techniques for parsing and indexing source code.
  2. Open-source developers come up with a new standardized language independant format for distributing source code. (perhaps Open-Source-XML?)

I think either (or both) of these technologies could have a significant impact on moving software technology forward.

How can we improve searching for source-code?

AddThis Social Bookmark Button

State Controller Steve Westly kicked off a campaign Wednesday to find the owners of $4.8 billion worth of items in the vault including checks, jewelry and antique gold coins. (From CNN)

In my modest searching of California’s Unclaimed Property Search, I’ve already found some property the state owes to various family members.

CNN’s article on the State of California’s effort to return property was a nice introduction, but oddly enough was missing the actual link to the Unclaimed Property Search. Given it was an online article, it seemed rather odd to be missing the link.

So far, the State of California’s online initiatives have really impressed me. The California DMV has done a great job of putting services up. They had their Hybrid car-pool sticker applications online the day after Federal regulation allowing it became law (Though I still haven’t found the time to go get mine).

I’d suggest searching the Unclaimed Property database for family thats not too Internet-savvy, its amazing how many people have property being held by the state for various reasons. If banks, trusts, etc. can’t locate the person they typically turn it over to the state, which happens for a lot of reasons and more frequently than many might imagine.

Have your states online services been up to par? Find any surprises in the Unclaimed Property database?

AddThis Social Bookmark Button

Last week I had to reinstall Linux on my work laptop. I’m not entirely certain what happened, but I had some filesystem corruption on my JFS formatted root partition and though I was able to repair it and didn’t lose any data (thank you for saving my bacon yet again Knoppix) my disk started thrashing more often afterwards and I just didn’t feel comfortable with that. So, I backed up my data, wiped everything clean, and decided to move from the Arch distro which I had been using since Dec 2004 to the flavor of the past year, Ubuntu.

But, this article isn’t really about that. It is about my first attempt at printing from the Firefox web browser today. After sending several pages to the printer (a somewhat old HP 5000GN) I walked over to it and found that I had not printed the most recent sales for Linux Desktop Hacks, but rather, I had three pages of the same error message.

This message was about 11 lines long, and told me in a straight forward manner that the problem was that the Postscript interpreter on my printer was version 2014.18 and that the printout requires version 2015 or greater. Now, I sorta had an idea what that meant, but I really didn’t know how to go about fixing it…my first thought was that I would have to upgrade the firmware of the printer which is something that I had to do in the past when some of our Macs were having a hard time printing PDFs from InDesign. Anyway…

…I didn’t need to figure out what to do. The rest of the error message told me exactly how to fix the problem. It involved changing the Firefox print command from the lpr gibberish that is there by default, to include some preprocessing with Ghostscript. I’m in the amateur ranks when it comes to lpr statements and Ghostscript is even more of an unknwn, but I could certainly type in what I was told. So I did…

…and it worked.

If you encounter this problem and error message you could say that its irksome to have to make any adjustments from the default. Or you could just be happy that some programmer had the forethought to put in a useful error message for something that would possibly be a common problem. Having this information right there in the printout is much more useful than having to search for a FAQ somewhere, particularly if all you know is your document didn’t print.

If only there were more good error messages like this.

AddThis Social Bookmark Button

Related link: http://www.onlamp.com/pub/a/onlamp/2005/10/13/what_is_rails.html?page=last#threa…

In comments on Curt Hibbs’s What is Ruby on Rails?, he and Aaron Trevena, maintainer of Perl’s similar Maypole project have debated whether Ruby or Rails are doing anything particularly new.

For people who’ve only ever seen complex “enterprise-class” frameworks and libraries and designs as usable, certainly watching any of the Rails movies might give some evidence that being able to solve the 95% of all possible web programming problems that don’t need huge application servers and complex transactional and messaging systems with a fraction of the effort and perhaps fewer lines of code in general than the complex system requires lines of XML in configuration files is a good thing.

Of course, anyone using a decent set of libraries in Perl, Python, Ruby, or PHP probably already knew this.

Ruby does bring certain advantages; I much prefer the ActiveRecord syntax and introspection over that of Perl’s Class::DBI, but they’re both fantastically useful. They’re equivalent enough that neither offers an order-of-magnitude improvement over the other.

Where something like Python’s Django might invent and polish a new idea, the amount of time and work necessary to do something similar in Perl or Ruby isn’t large either. I don’t have enough practical experience with PHP 5 to judge there, but I’m sure it’s also flexible and dynamic enough to work.

In my mind, the issue isn’t “Ruby on Rails is more flexible and capable than standard J2EE or .NET for any project under a (very high) threshold of complexity”. The real point is that the simplicity, flexibility, and abstraction possibilities offered by dynamic languages and well-designed libraries — as well as a talent for exploiting radical simplicity, extracting commonalities from actual working code, and knowing when too much flexibility makes you less agile — offer a huge advantage over languages and libraries and frameworks and platforms that assume you need a lot of hand-holding to solve a really hard problem.

Yes, Ruby on Rails does what it does very well. It’s not the only thing that does, though. I wonder perhaps if some of the buzz and glow is that it’s new and shiny (in comparison), so that people haven’t already formed their own opinions about it, as they may have with Perl (oh, you can’t write readable and maintainable code), Python (all the fun of the Lisp community without half the things that make Lisp special), and PHP (a language that needs to grow up).

Fortunately, a lot of smart people already understand this. It would be nice to have the right debate, though.

Am I wrong? Is it really Ruby and Rails, or is it the dynamicism, flexibility, and better opportunitites for abstraction of dynamic languages that provide so much of the benefit?

Nitesh Dhanjani

AddThis Social Bookmark Button

I’ve been spending a considerable amount of time auditing web applications, and I’ve come to realize that a large amount of developers do not understand the root cause of Cross Site Scripting (XSS) vulnerabilities. The most common mistake committed by developers (and many security experts, I might add) is to treat XSS as an input validation problem. Therefore, I frequently come across situations where developers fix XSS problems by attempting to filter out meta-characters (<, >, /, “, ‘, etc). At times, if an exhaustive list of meta-characters is used, it does solve the problem, but it makes the application less friendly to the end user – a large set of characters are deemed forbidden. The correct approach to solving XSS problems is to ensure that every user supplied parameter is HTML Output Encoded (Example: < is replaced with &lt;). Most frameworks (.NET for example) provide API’s that help with HTML encoding, but I have come across instances where such APIs don’t encode certain characters that can lead to XSS when more complicated variants of input are attempted. Therefore, I frequently and highly recommend RSnake’s XSS cheat-sheet to test web based applications and services for XSS vulnerabilities. If you are a web developer or tester, I do recommend that you test your application with the inputs suggested by RSnake to test for XSS issues.

Andy Oram

AddThis Social Bookmark Button

In recent weeks the state of Massachusetts announced, to cheers on one
side and alarm on the other, that it would start writing all new
memos, spreadsheets, and other documents in the OpenDocument format
standardized by
OASIS.

Now there’s a spiffy new web site by the
OpenReader
activists, promoting this format for ebooks.

These are two sides to the same coin, one that buys us freedom in
document formats. Getting your document’s content accurate and
readable is enough of a hassle without worrying about whether a change
in computer platform or tools will render the document ugly–or worse
yet, gibberish.

OpenDocument is an input format, OpenReader an output format.
OpenDocument provides freedom for writers, ensuring that they can
switch production tools as better ones become available. It also
promotes compatibility over time (less chance that upgrades will render
documents unreadable) and protection against bugs.
OpenOffice.org
and
KOffice
are among the projects adopting the format. If a number of states and
countries follow Massachusetts’s lead (which seems likely) Microsoft
may give up its current carping and jump on board.

Within Massachusetts, opponents of the move to OpenDocument are
reduced to about the weakest argument they can find–saying that
conversion would cost a lot of money. The whole impetus behind the
OpenDocument movement is to free us from such short-term thinking.

As for OpenReader, it promotes freedom for readers. It means that for
the first time there’s a feature-rich, multimedia format that allows
publishers to offer ebooks in confidence and that multiple device
manufacturers can support.

The Web offers much room for innovation, but it tends to be weak in
certain areas, particularly for large documents. It doesn’t let you
bookmark arbitrary points in documents, for instance. (XPath would
support that, but Web users don’t have access to tools using XPath.)
OpenReader addresses such needs.

As a proof of concept,
OSoft
is converting its free-software ThoutReader browser to OpenReader. So
books will hopefully start appearing in that format in 2006.

Probably there will always be elements of communication that are
non-standard. Standards bodies can’t keep up with innovation; they
usually must follow it. Free software implementations will promote
innovation without limiting access. So OpenDocument and OpenReader,
along with their free implementations, are foundations for future
document freedom.

Sid Steward

AddThis Social Bookmark Button

Related link: http://www.alwayson-network.com/comments.php?id=12541_0_11_0_C

Vint Cerf says the revolution is on the edges of the internet and cites VoIP as a good example. The edges? Client applications? That’s Microsoft’s turf.

While many eyes watch for new, life-changing web services such as Google search and Google Maps, they might be missing the next revolution.

VoIP, P2P, RSS, tagging and blogging are all decentralizing forces. VoIP and P2P have largely dispensed with centralized infrastructure. As bandwidth and computing power grows, I expect we’ll see more action on the desktop, not less.

Jonathan Bruce

AddThis Social Bookmark Button

Related link: http://blogs.datadirect.com/bin/mt-tb.cgi/40

Back in June, I talked about Microsoft’s public disclosures that details some their plans for Office 12 — finally we can look forward to XML file formats for the stable Office applications; Word, Excel and PowerPoint.

Since then, I’ve watched with interest as Open Document discussions have increased in volume. Foremost in their opinions include bloggers such as Jonathan Schwartz, COO of Sun; David Berlind from the ZDNET Tech Blog portal; and let’s not forget the ever growing community behind OpenOffice.org

To add spice to the interesting recipe, mix in the very public moves made by Commonwealth of Massachusetts to adopt Open Document as their standard file formats. Add some furious speculation as to what the recent Sun-Google alignment really means and then quickly serve up a full scale debate on whether or not we are on the cusp on a new revolution.

Let me try and cut through what I think is likely to emerge as the difference between reality and (to some degree) spin. First some key facts:

  • Google has successfully executed around the AJAX model giving a compelling web experience for email, maps and more recently blogs.

  • Open Office 2.0 recently shipped for general release. If you’ve not tried it, it is worth a look as the 2.0 version is a vast improvement on earlier versions.
  • Sun and Google have announced a broad technology partnership.
  • Open Document is seeing traction and serious consideration from the Commonwealth of Massachusetts.

Now let’s look at the evidence before us. First let me reiterate this quote from Jonathan Schwartz blog:

    “Could these apps I mention, above, be enhanced with better network connectivity, more collaboration, and better integration into your daily life? Absolutely…..So if you want to know what the future portends for OpenOffice.org, that’s a fine place to start (and AJAX will likely play a role).”

In this case I agree with David Berlind of ZDNET. The true meaning here is in what Schwartz omits. Large companies like Sun and Google always have an eye out for technologies that trigger something magical (the halo-like effect.) The trick is do it in such a way that developers will happily invest hours of their time with this technology and innovate freely around it.

If Google and Sun come up with something really compelling, I am very excited by the prospects, but I think it is important to add one note of caution; Microsoft will work feverishly to protect its golden Microsoft Office franchise. While we can look forward to XML formatted office documents, history has shown us that Microsoft is unlikely to rely entirely on spurring developer activity. They have the advantage of an estimated 95% market share.

From my perspective opportunities abound for technologies like XQuery to become integral parts of engines that integrate a distributed XML and relational data sources beneath an AJAXified Open Office front end. The introduction of Microsoft’s Office 12 formats presents similar opportunities.

From a developers stand-point it will be important to understand how to participate in the different strategies as they emerge. Should you opt for the halo-effect followed by the underdogs (Google/Sun) who seek to establish mindshare for Office applications? Or should you work with the incumbent (Microsoft) who will more than likely follow the upgrade route, incorporating less inclusive developer approaches that ensure their continued dominance.

Ladies and gentlemen, please place your bets.

brian d foy

AddThis Social Bookmark Button

Chris Albritton of Back To Iraq does a bit of investigative journalism using the revision tracking features of a Word document. He can see the changes to the Mehlis report on the assassination of Rafik Hariri. Additionally, he can match up the time of the revisions to the time Special Representative Mehlis met with UN Secretary-General Kofi Annan. Several names were redacted, but it’s too late for that because the Word document Chris got still has the revision history, so it still has the names.

He’s posted the relevant section in his entry Names Deleted from Mehlis Report.

AddThis Social Bookmark Button

Related link: http://www.zoomerang.com/survey.zgi?p=WEB224KLPXJUHE

We run a short survey every year to understand our readers. We use this information to change the topics we cover and to present our content more effectively. Last year, who could have predicted that Ruby on Rails and Ruby in general grow tremendously in popularity? (Okay, a few people knew it was good, but this popular this quickly?) This year, what are you reading on our sites?

So far, 97% of all respondents read articles. I’ve always thought that this is the primary draw and the statistics so far back it up. Also, slightly under half read weblogs — perhaps we should find a way to present them more prominently.

Half of the respondents find articles by browsing the home page, while over a third use a feed reader. It could be interesting to correlate the reading patterns of the groups (but I don’t have any statistical analysis of this at the moment).

Some questions allow multiple answers. Nearly 38% of respondents use BSD of some sort, with 82% using Linux, 33% using Mac OS X, and 60% using Windows. More interestingly, 73% of respondents develop on Linux, 53% on Windows, 26% on a BSD, 23% on Mac OS X, and 13% on Solaris. Deployment is a bit different, with 80% deploying on Linux, 50% on Windows, 32% on a BSD, 17% on Mac OS X, and 19% on Solaris.

These numbers obviously differ from the desktop market as a whole and probably reflect the bias of the site and the nature of our audience.

So far, the largest job categories of respondents is software developer (17%), with applications developer (14%) and system administrator (12%) not far behind. I don’t know what the difference is between the first two.

Nearly half of the respondents work for small companies of 50 people or less, though the rest of the responses fall pretty evenly between 50 and over 2500 people.

Finally, there’s a heavy industry bias. 20% of all respondents describe their business or industry as computer software or Internet and e-commerce.

The survey closes this Friday, 28 October, so please take it before then. We’ll enter you in a drawing for some nifty swag. More importantly, we’ll use what you tell us to plan for the next year of the site. (Note that the survey uses cookies only for its duration.)

I’ll be back after the survey ends to report on the results as a whole.

brian d foy

AddThis Social Bookmark Button

The Office of the Inspector General reports on the Transportation Security Administration’s computer network security, and it isn’t pretty.

Remember, the TSA are the same people who violated the Privacy Act by collecting airline passengers’ personal information without notifying them.

People worry about identify theft from shopping online. I worry about my government virtually giving it away.

Sid Steward

AddThis Social Bookmark Button

Google Library could pipe print publishers’ works into millions of homes; it’s a natural marriage of new and old media. So why are publishers fighting Google’s embrace? More importantly, how could they patch things up?

I think the main issue is money. Publishers want a piece of the action, but Google doesn’t want to pay.

Google could coax publishers into Google Library by allowing readers to buy pages of content online. Just like the old library photocopier, but on your desktop. To that end, Google would need to fashion a PayPal-like service.

So, I search and Google gives me an excerpt. I pay, and I see the page scan. Google and the publisher split my money, and everybody is happy. I dubbed this idea PageSense.

Fair Use Isn’t Enough — We Want Content

If Google wins the fair use battle, it will have license to index anything. (* Crowds Cheer *) Problem is, what good is a search result if it points me to an out-of-print book hidden in some monestary? Sound unlikely? Google suggests that 60% of the Google Library would be inaccessible content. The solution is to let me pay a quarter to see the page scan Google made. This quarter gets split between Google and the publisher. I’m happy, Google’s happy, and the publisher is happy.

Encourage Valuable Online Content

PageSense would also be helpful to web publishers. Today, web publishers can make a few bucks from ad revenue. So they publish content that maximizes this indirect reward — even web spam.

By creating a system that directly rewards web publishers for good, relevant content, you could expect to see an increase in valuable content online. It wouldn’t all be free. But you could see excerpts using Google and then pay for what you want. Win-win-win.

Jonathan Wellons

AddThis Social Bookmark Button

Abraham Lincoln knew the principles of Web 2.0 when he said, “… [virtual communities] of the [users], by the [users], for the [users], shall not perish from the [Web].”

Jeremy Jones

AddThis Social Bookmark Button

Related link: http://geekmuse.net/blog/comments.php?y=05&m=10&entry=entry051017-133741

I’m running a little behind on my podcast listening lately. I’m trying different podcasts to see which ones I want to fill my 2 hours of daily commute with. So far, my regulars are, O’Reilly’s Distributing the Future, This Week in Tech, GeekMuse, and some of IT Conversations.

The second topic on Episode 11 of GeekMuse was about TurboGears, apparently a discussion of the recent Slashdot article which compared TG with Ruby on Rails. The GeekMuse discussion quickly turned away from TurboGears and toward Python itself. Some of the GeekMusers made comments of disgust about the semantic value of white space in Python. There was the obligatory comparison to Cobol. There was a disdainful mention of forcing users to adopt a consistent coding style. And there was also a question of flexibility, noting that sometimes it is more convenient or readable to just have an if statement on one line (which, by the way, Python supports).

All of these arguments are pretty entertaining to me. The discussion never moved past the whitespace issue. And from their discussion, this issue alone would keep the GeekMusers from ever adopting Python as their respective language of choice. No problem. It’s a big world. There are plenty of good languages to choose from. Diversity makes the world go ’round.

It’s just funny to me because the whitespace issue nearly kept me away from Python. And it’s typically the first issue non-Python types bring up to point to why they wouldn’t adopt Python. When I first looked into Python, I had been working with Perl and had tinkered a bit with C, so I was accustomed to curly brackets identifying beginnings and ends of code blocks. The whitespace in Python turned me off. I could never give the rest of the language enough of a glance to properly appreciate it. I don’t remember what happened, but whitespace became less of an issue for me. I dug into Python and quickly became attached to the simplicity of the syntax and how naturally thinking about problems in this new language became.

Now, I enjoy the whitespace. It reminds me that I should be indenting code blocks anyway for readability. Yes, there are problems if one developer uses tabs (bad programmer!) and another uses four spaces like he’s supposed to. But it makes the code so much easier to read and understand and troubleshoot problems.

So, how much of an issue should significant whitespace be? I personally think that every development organization and every development project (open source or not) should establish coding conventions which must be adhered to. Some of those conventions should entail the use of whitespace, even for Python where whitespace is meaningful. And it is my preference for whitespace to follow code blocks for readability. Maybe I’m wrong, but I think most developers would agree with me on this point. And if they do, what’s the big deal? If everyone should be indenting properly, anyway, what’s the problem with including whitespace as part of the syntax of the language? But, like I said earlier, it’s a big world. There are plenty of good languages to choose from, most of which don’t regulate where or how whitespace is scattered through source files. If mandatory whitespace gives you the coding heebie-jeebies, you can find a language other than Python which will suit you well.

Jeremy Jones

AddThis Social Bookmark Button

Related link: http://www.cherrypy.org/

From cherrypy.org:

After 6 months of intense development since the last stable release, CherryPy-2.1.0 is finally released. Grab the release from the download page page and make sure you read “What’s new in CherryPy 2.1″ for instructions on how to upgrade from 2.0. You can also have a look at the ChangeLog.

On the what’s new in CP 2.1 wiki page, there is reference to a Session authenticate filter, but the page it links to is nonexistent.

Authentication was something that drew me to CherryPy back before 1.0, but has since been removed. I’ve been looking for some authentication mechanism with the blessing of the CP team or, preferably, something of the sort in the CP library. I hope this is it.

Regardless of whether that little desire of mine has come to fruition, great work, folks! CherryPy is fantastic! I wish you many more years of releases!

Derek Sivers

AddThis Social Bookmark Button

Related link: http://ferret.davebalmain.com/trac

If you are planning to use Ruby in any project that will need to search anything, pay close attention to Ferret - a Ruby port of Lucene.

Back in January, I started my rewrite of CD Baby in Rails - but one of the biggest unsolved problems was search :

  • over 100,000 albums (and adding 200 new albums a day!)
  • over 1,000,000 songs
  • need to be searchable not just by exact-match, but partial-match and mis-spelling
  • results need to be weighted so that exact-match result comes before partial-match
  • every search must search these six fields: artist, album, style, description, mis-spellings, similar-artists
  • result matches need to be weighted in this order of fields: artist, mis-spellings, similar-artists, album, style, description
  • all this has to happen in under 1 second

I used to have great search results, but it took TEN queries to do it (4 exact-match queries followed by 6 LIKE ‘%string%’ queries). This was fine before CD Baby got popular, but once we started growing, my old reliable search was taking 30 SECONDS to return results! Live! On the website! Intolerable!

I switched to MySQL’s fulltext search. Fast, yes. But disappointing results. Too many results. Search for “Bob Dylan” and you’ll get EVERY artist with any mention of “Bob” OR “Dylan” in their name or album name.

I asked on my blog, here and got some good advice, including a recommendation for Lucene. My good friend Robert Kaye also told me about Lucene. No - he RAVED about Lucene - about how it could wildcard-search a million strings and return properly-weighted results in a few milliseconds. We talked about his Lucene experience for an hour, and I was convinced that this was the way to go. If you’re interested in learning more about Lucene, download the Lucene book : Lucene in Action. It’s great.

Only one problem : it’s in Java. Fucking Java. I’ve never tried Java. I was hoping to not have to. I don’t hear nice things about it. It’s on my coffee list. But I was considering learning it a bit, just to get Lucene going.

RUBY BINDINGS TO LUCENE?
I asked around the Ruby list, and found out that Brian McCallister had been given a small grant to write Ruby bindings to Lucene. This looked very promising, at first, but eventually became apparent that it just wasn’t going to happen. At all. Sigh….

LUCENE WEB SERVICE:
Robert Kaye wrote the Lucene Web Service for me. Tomcat. Java. A good start. Open source. Even has some other contributors. But still would mean I’d need to install Java on my servers and maintain a Tomcat server, and do all this Java stuff I was really really hoping not to have to do, just to search my catalog! But it semeed like the only way, so I was going to dedicate next week to setting it all up and getting to know it.

ANNOUNCED THIS WEEK : LUCENE FOR RUBY! HOLY SHIT!
Then just a few days ago, David Balmain announced a full port of Lucene to Ruby - called Ferret. A full port! No Java needed! Oh man what perfect timing.

Marcus Whitney

AddThis Social Bookmark Button

Related link: http://www.zend.com/collaboration/

I took some time to reflect on the idea of a Zend PHP Framework.  I’ve come to understand that Zend is interested in furthering PHP in the enterprise, and has first hand experience with the questions that Fortune 100 and even Fortune 500 companies have about adopting PHP.  The Zend PHP Framework is not intended to threaten Solar, Prado, Yawp, Cake, Mojavi etc.  Fortune 100 companies would never adopt any of these frameworks on a large scale (no offense).  There just isn’t enough support behind them.  Too much risk.  Zend also can not be expected to fully support a framework that they haven’t been in on from day one.

Zend is for-profit, and I understand that mindset.  Their motives are often questioned because they have such a close relationship with a community that is anything but for-profit.  But I think they actually do a pretty good job of focusing so high in their company goals, that the community and the aspiring PHP greats have little to worry about.  You will still be able to get people to use your framework, .vimrc quickies, eclipse plug-ins etc.  I have no intention of switching to the Zend Framework in my existing projects, but I very well may use Zend’s framework in future projects.  It’s just less for me to support and worry about when I’m trying to keep a close eye on ROI.

As for anyone expecting Zend to collaborate on their existing project or framework, good luck.  It’s clear that they have an agenda, and they are moving on their roadmap with great expediency.  To their credit, this is really their vision, so the framework’s name is appropriate.  Taking PHP to the level of adoption that Java has achieved is a serious undertaking.  It’s like there are almost two tracks here, which is the beauty of PHP.  As an individual, you can choose to go a completely non-Zend route and still be very effective.  But now, the organization with dev teams of 100 or more (besides Yahoo who have PHP pioneer’s in their fold) can pick up some Zend Studio and Platform licenses, get the framework going and have some standards as well as Java integration going fairly quickly.  It’s a good thing.

Derek Sivers

AddThis Social Bookmark Button

Related link: http://www.postgresql.org/docs/7.2/static/queries.html

I spent so long in MySQL without the option of subselects - that I got so used to JOINing tables as the only way of doing things. Tonight (in PostgreSQL) I replaced a JOIN query with the sublime power of subselects.

PREVIOUS:
SELECT DISTINCT items.id, items.cache_sold
FROM item_subgenre_links isl
INNER JOIN catalogs_items ci ON isl.item_id=ci.item_id
INNER JOIN items ON isl.item_id=items.id
INNER JOIN subgenres ON isl.subgenre_id=subgenres.id
WHERE ci.catalog_id=1
AND subgenres.genre_id = 9
ORDER BY cache_sold DESC LIMIT 10;

SUBSELECT:
SELECT items.id, items.cache_sold
FROM items WHERE id IN (SELECT item_id
FROM catalogs_items
WHERE catalog_id = 1
AND item_id IN (SELECT item_id
FROM item_subgenre_links isl
WHERE isl.subgenre_id IN (SELECT id
FROM subgenres WHERE genre_id = 9)))
ORDER BY cache_sold DESC LIMIT 10;

As far as my non-developer-brain understands it, the reason that the subselect approach is more efficient is that you’re limiting the available choices first, instead of joining all four tables and finding the intersection.

Geoff Broadwell

AddThis Social Bookmark Button

Related link: http://www.oreillynet.com/pub/wlg/8097

Last week I talked
about how to determine how an application performs in different
scenarios, and generally where the code has bottlenecks. It’s
time to talk about why the code might be slow.

There are a great many particular reasons why code could be performing
slowly, so I’ll start off by painting some broad strokes and go for more
detail later on. Let’s break these issues into some big categories:

  1. The application is attempting to solve a more difficult problem
    than the one the user actually has.
  2. The code does more work than is necessary to solve the problem
    it was designed for.
  3. The code is great in theory, but runs slowly on real hardware.

That first one may seem silly, but I’d warrant it’s one of the biggest
causes of performance complaints. The mistake could be as simple as
running a complex statistical analysis program on a huge dataset, when
the user only needs to know how many data points lie within a certain
range. Or perhaps the application computes all possible answers to a
problem, when knowing just one answer is enough.

Or it could be more subtle issue, like requiring an exact answer when an
approximation will easily do. There are many problem spaces for which
the only known ways to calculate an exact answer are vastly slower than
making a very good guess. Modern computer graphics would not even exist
if not for the fact that, most of the time, gross approximations are
just fine. There are even problems for which it is infeasible or
impossible to calculate an exact answer at all; the only decision to
make is how loose an approximation you can can accept.

In other words, before looking at the code at all, think about what you
are really trying to achieve. Does the code try to compute an exact answer
to a difficult problem when all you really have to determine is “Yes,
no, or maybe?”, or “More than a few?”, or a similar fuzzy result?
Does the application computing many different things, when only a few
are actually necessary?

For example, let’s say you need to determine which hours of the week
your web server is most heavily loaded, and how much difference there is
between the heaviest and lightest loads. You could analyze all of the
log files, counting hits per second, bytes sent per connection, and
so on. Unfortunately, for a heavily loaded site this would require a
lot of work, perhaps taking an hour to churn through all the log data.
This makes a casual query turn into a huge production, and leads to
silly behavior like running a whole slew of similar reports every night,
just in case anyone wants to know what happened the previous day without
waiting an hour or two to get an answer for a simple question.

Alternately, if your web server rotates logs every hour, you could
just compare the log file sizes. A small script could even
chart a rough answer before the user could take his finger off the
enter key. Sure, the answers this method gives will only be an
approximation. But if you really only need to know whether your
traffic is fairly smooth or spikes up a lot on Monday mornings,
that’s good enough. Better yet, what was once a production is now
something that anyone could find out on a whim. Want to know whether
that big breaking news story brought in heavier traffic today? You
can have the answer now, rather than tomorrow morning when
the standard reports are delivered.

If you can make this kind of qualitative change in your solution,
do so. You’ll get more of a performance boost (and happier users) this
way than through anything else I’ll cover. Go ahead, spend some time
on this. I’ll wait.

At this point I’ll assume you’ve either determined that your solution
really is a good match to the problem, or have already gone back and
simplified your solution, but still need to do better. Next up is to
check whether the code does more work than needed for the chosen solution.
The big culprits here are poor choice of algorithm and poor use of decent
algorithms.

Let’s get the “poor use of decent algorithms” case out of the way first.
Even good algorithms can perform poorly if misused, especially in
combination. Chances are pretty decent for example that grepping a
few items out of a big list and then sorting the results will be faster
than sorting the big list first and then grepping. It may not always
be obvious that this is happening; sometimes the various steps are in
widely separated code written by different teams, or even in different
applications. This is particularly a problem when modules do a mountain
of work implicitly, either to “save the user the trouble”, or because
the module requires specific properties of its inputs (such as being
sorted), and would rather enforce those properties than merely detect
whether they have been violated.

A system administrator friend likes to tell the tale of a particular
script that was the subject of many performance complaints. After a
bit of research, he determined that the performance of the
script was completely dominated by one command pipe. It turned out
that the commands in that pipe were implicitly sorting a large data
set twice, then explicitly sorting it again, and
then finally grepping the result for just four or five desired items.
Even the best algorithms can’t prevent this from being slow. Happily,
simply reordering the operations and turning off the implicit sorts
turned a horrendously slow script in a decently fast one.

Once this kind of misbehavior has been rooted out, it’s time to look
at the algorithms used. Most problems can be solved a great many different
ways, each with its own performance characteristics. For any given
problem (sorting, say), each available algorithm has certain strengths
and weaknesses. For example, Quicksort does not need much extra memory
to operate, and usually performs quite well on random inputs.
Unfortunately, it’s an unstable sort, can degenerate badly in particular
cases, and isn’t well suited to sorting data in any form other than an
in-memory random-access array. Merge sort, on the other hand, usually
requires more memory and is a bit slower than Quicksort on random inputs,
but it’s a stable sort, won’t degenerate, and works well with
sequential-access data.

Unfortunately, there are also algorithms in common use that are weak
from just about every angle, except that they’ve been around a long
time or are exceptionally easy to understand. In either case, these
poor algorithms still get used because they were the first thing that
came to the implementor’s mind.

The next step, then, is to determine whether the algorithm the program
uses is the best one for the situation. For this there is no substitute
for knowledge. The more algorithms you are aware of, the better the
chances that you will be able to determine a more suitable choice,
especially if you know the general performance features of each approach.

I’m not of course suggesting that you memorize your copy of
Donald E.Knuth’s
The
Art of Computer Programming
. The detail you’ll
find in those volumes could easily make you an expert in the analysis
of algorithms, but that’s probably another case of trying to solve a
much larger problem than the one you actually face. For most purposes,
a book such as Robert
Sedgewick’s
Algorithms in C is probably
a better choice, assuming it hasn’t changed its style too much since
my dog-eared first edition came out. That book quickly became one of
my favorites because for every algorithm it laid out in plain wording
several key things: what its strengths and weaknesses were, how it
actually worked, and the standard optimizations to its basic design.

In some cases, an online reference such as
Wikipedia can give you an idea of
your options, and in the
best cases
will even give you a performance overview for various algorithms; more
specialized sites such as the
NIST Dictionary of Algorithms and
Data Structures
can point you to various implementations.

To understand these resources, you’ll need to understand big-O
(asymptotic performance) notation. Put simply, big-O notation tells you
broadly how an algorithm performs on large input sizes, generally
with a simple expression on one or more variables that define a
characteristic size for a problem. Typical examples are O(n³),
O(n log n), and O(m + n). By choosing suitable values for
the variables, you can compare the performance of several algorithms
in various situations.

There are two big gotchas, however. First, big-O notation ignores
constants; O(n) might seem to be always better than
O(n²), but if the actual run times are n seconds
and microseconds, then the better choice depends
on how big n is. This is particularly an issue for algorithms
that have the same big-O performance; in that case, constants make
all the difference no matter how big the input. Second, big-O notation
only gives information about asymptotic (large input) performance. For
small inputs, you can pretty much ignore big-O comparison, as startup
overhead and other details will often completely dominate runtime
performance.

Even with a good understanding of the use and limitations of the notation,
you’ll find that most of the time algorithms are discussed in a fairly
general sense, and often without reference to special situations that
may make one algorithm particularly appropriate. For example, efficient
sorting on a specialized, highly parallel GPU requires a different
algorithm than efficient sorting on a less parallel but more general
CPU. The unfortunate truth is that choosing a good algorithm requires
both general knowledge of available algorithms, and a good understanding
of how these are applied to your problem space, whether it be data
analysis, computer graphics, physical simulation, or what have you.

At this point, I could just suggest going off to read yet more books
and articles. You’d certainly learn a lot, and may even be able to
synthesize a few patterns out of the morass of data. Before you do
that, there’s that last big group of performance problems: things that
make code that’s great in theory perform horribly on real hardware.
Understanding those issues will make a world of difference in how you
approach all that reading, so they’re my subject for next week.

What’s the worst misuse of otherwise good code that you’ve seen?

Sid Steward

AddThis Social Bookmark Button

Related link: http://radar.oreilly.com/archives/2005/10/it_seems_to_be_working_for_jes.html

Reading Tim’s post today got me thinking and raised some questions. I love the idea of searching the full text of books, but I wonder about Google’s execution on Google Print Library.

Opt-In vs. Opt-Out

Tim says: “(publishers) no longer even know if they own the rights to be able to opt in.” Then: “Google’s opt out approach is the only way to cut the Gordian knot of forgotten rights and permissions.” To understand this, I had to dwell on exactly what opt-in and opt-out means, here.

Opt-in versus opt-out almost sounds like a silly technicality. In fact, the important issues are the assumptions behind these terms. “Opt-in” means that Google can’t index works without publishers’ permissions. “Opt-out” means that Google can index any books it wants under fair use. The opting-out part is just a courtesy extended by Google to publishers. And “Opt-out” sounds better than “SOL.”

After figuring this out, I understood Tim’s position better. Under the opt-in assumptions, he argues that even publishers sometimes have their hands tied. Under the opt-out assumptions, publishers, Google, and most anybody could index, well, anything they wanted.

I can also see how this battle is bigger than books. Copyrighted magazines, newspapers, audio and video would be open for indexing under these “opt-out” assumptions. And hold the opting-out part — that was just a temporary courtesy designed to lull publishers into a sense of control.

Where’s the Money?

As to why publishers are tussling with Google, egos and precedents might be involved, but we mustn’t overlook money. Google wouldn’t bother with this project if it wasn’t valuable to them somehow — remember, they must answer to their shareholders. Publishers probably want a piece of that action, and there’s nothing wrong with that. Google probably doesn’t want to pay them, so they pay lawyers instead.

Tim notes: “The works that are most in question are those that are likely unavailable except in libraries or used bookshops.” If that is so, then how could a publisher make money by allowing users to search books it can’t sell? The publisher asks: “where’s the money?” It must come from Google, since it won’t come from selling such books.

Imminent Domain for Intellectual Property?

Rights aside, there seems to be an interesting, underlying debate over the value of old, out of print, inaccessible books. Some seem to suggest they have no value since they are inaccessible. If they have no value, then why should Google want to index them? Because Google can unlock the value and turn it into profit. That’s good.

If it has potential for value, maybe Google should just pay the publishers fair value? Or maybe even allow publishers to share the risk and profit of this endeavor somehow? Such a move would fall on the better side of the good/evil spectrum.

OTOH, that could set a bad precedent for Google. When they go to index magazines, newspapers, video and audio, then all those publishers will want a piece of the action, too.

Disclaimer and Invitation

I must admit I haven’t followed the Google Print Library drama blow-by-blow, so some of my premises might be off. Please feel free to correct me.

1/30/06 Update

I cleaned this up a little, so the ‘imminent domain’ comments to this post now appear out of context.

Derek Sivers

AddThis Social Bookmark Button

Related link: http://www.google.com/search?q=html+fieldset

Are you being good and using <label> tags to match your <input> and <select> fields in forms?

Now what if you break up the date into separate year, month, and day selects? You can’t put THREE <label> tags - you really just want one - but its ID will not match the ID of any of the three selects? Broken HTML! Oh no! What to do? Use <fieldset> around all three selects and assign the fieldset the ID you need.

EXAMPLE:
<form action="something" method="post">
<label for="name">name</label>
<input type="text" id="name" name="custname" />
<label for="birthday">birthday</label>
<fieldset id="birthday">
<select name="bdyear"><option>…</option></select>
<select name="bdmonth"><option>…</option></select>
<select name="bdday"><option>…</option></select>
</fieldset>
<label for="sendit">then click…</label>
<input type="submit" id="sendit" name="submit" value="submit" />
</form>

Andy Oram

AddThis Social Bookmark Button

(Originally printed in the
American Reporter)

It’s an unlikely matter for the United States and other nations to
lock horns over: the administration of names and numbers used to reach
Internet sites. But this seemingly trivial function is occupying a lot
of time among government representatives traveling from continent to
continent. A United Nations body wants to wrest power over these
things from their current master, the Internet Corporation For
Assigned Names and Numbers (ICANN). The United States says that with
ICANN in charge, things are running just fine (which they
aren’t). Many people condemn one side or the other for trying to carry
out a power grab, or call the engagement a lot of hot air.

But to me the first question to ask is: who called the
names-and-numbers issue one of “governance” in the first place? Why
did this ever become something for corporations, governments, and
international bodies to wrestle over? Why isn’t it going on quietly,
cheaply, and with universal acceptance in the background, like so many
other aspects of Internet operation?

The matter of Internet names and numbers is actually a sad and sordid
history that goes back over ten years. It never should have come to
this point, where it is wasting millions upon millions of dollars,
along with time and energy of some top Internet thinkers–and where
real initiatives that could improve Internet access get shoved to the
corners and left to gather dust.

Domain names get hot

ICANN has control over three types of addressing, sometimes called
resources. The very term “resources” biases its listeners right off
the start, because a resource is usually something valuable that’s
limited and needs to be managed and bargained over. Addressing is
being treated that way, but it didn’t have to be.

The first type of address is the number assigned to each computer for
the purposes of routing traffic. This is called the IP address (where

IP stands for “Internet protocols”), and in its current form it tends
to be printed in four parts, such as 209.204.175.65 (the current
address of the site where you’re reading this,
www.american-reporter.com).

There used to be fears that we were running out of IP addresses, and
accusations that some sites were receiving more than they needed, were
hoarding them, etc. There may be pockets where scarcity exists, but to
address the problem (pun intended), the Internet body that designed
the IP address (and which has nothing to do with ICANN) created a new
version, IPv6, that vastly expands the number of available addresses.

Few sites have adopted IPv6, which requires enormous administrative
changes. One of the tragedies of ICANN’s existence is that it has
played no role promoting this conversion to new addressing, even
though as a central controlling organization for IP addresses, it
would have a natural role as evangelist for the change. Instead,
ICANN is mired in the other controversies described in this article,
along with its own bureaucratic bumbling.

A second addressing category involves a collection of potpourri called
“assigned numbers.”

But the towering controversy in ICANN is the third area of addressing,
the assignment of domain names. These are what let you request
www.american-reporter.com in your browser instead of 209.204.175.65,
and their presence is much appreciated. A lot of emotional baggage (as
well as commercial power, as we shall see) is loaded onto domain
names, but at base, they are simply another form of addressing.

In the mid-1990s a true governance issue struck the Internet: the U.S.
government opened it up for commercial use. Before then, it was
supposed to be used only for government, academic, and research
purposes. But now you could actually sell something over the Internet!
Driven by this new freedom and the recent invention of the graphical
web browser, thousands of companies poured the wares out before a
virtual marketplace.

Domain names were not designed for the new distribution of Internet
users. Just seven names existed at the top of the tree, such as .mil
for military use and .gov for government agencies. (A typical name is
whitehouse.gov.) These names–called top-level domains–were managed
by the U.S. government. Of course, other countries had come online by
then, but they had their own names, such as .fr for France. Only one
top-level name was allocated for commercial use–and for that reason,
the history of computing will always remember the mid- to-late 1990s
as the “dot-com” era.

Yes, having a name ending in .com became one of the most pressing
business requirements overnight. And businesses wanted a name that
reflected their trademark, such as ford.com or porsche.com.

Internet-savvy speculators bought up hundreds of thousands of famous
names with no goal in mind but to wait until the company with the
trademark came looking for it–and then they charged thousands or even
millions of dollars to make the transfer. This was simply the free
market at work, but the capitalists who got in line too late for their
favorite domain names steamed up and called it cyber-squatting.

There are many possible solutions to cyber-squatting. Companies could
choose a name that wasn’t taken yet, or just pay the squatter. But
quite large corporations started making ugly noises about possessing
trademarks and launched lawsuits to gain control over their own names.

At the same time, the company called Network Solutions that happened
to have gained control over handing out .com names–through a process
somewhat less orderly than a game of musical chairs–realized they
were sitting on the Fort Knox of Internet gold mines and started
charging money for a service that used to be free. The fee was nominal
by commercial standards, but high enough to make individuals think
twice about reserving names for themselves.

The clamor got worse and worse until a set of Internet public
activists suggested a conference pulling together everyone interested
in the domain name problem. This conference took place in July 1998 in
Virginia under the name Global Incorporation Alliance Workshop. The
name reflected a recent white paper from the U.S. commerce department
that indicated the government’s resolve to incorporate a new
corporation that would handle names and numbers. Computer Professionals for Social Responsibility (CPSR) submitted a
paper (to which I contributed) called
Domain Name Resolutions.”

Amazingly enough, progress was made at the workshop. The many
contending stakeholders came to consensus on some key points that
would protect free expression and diversity among domain names. Had
negotiations proceeded in that direction, the whole issue of
allocating domain names might have been resolved harmoniously and the
world could have gone on to more weighty topics. Whether or not
everyone abided by the decisions, the results of the workshop would
have represented a powerful moral direction post pointing toward an
open Internet governed by consensus–had the results been the basis
for further developments.

The first power grab

It should be understood, before we go further, that pressure on the
.com space could easily and immediately relieved by creating new
names, such as .biz. The allure of the “dot-com” name held
corporations back from endorsing this simple solution. But a more
sinister drive lies behind the domain name controversy.

Having established commercial beachheads on the Internet, corporations
wanted to own the whole terrain. Through the World Intellectual
Property Organization–an organization that make international
policies regarding trademarks, copyright, and so on–they were
designing a new regime for handling domain names. It was nicely suited
to large corporations with the money to take out trademarks, litigate
disputes, and so forth, but was unfriendly to individuals or
organizations of limited means. For a variety of reasons, an
artificial scarcity served the purposes of some powerful institutions.

Within weeks of the successful conclusion of the Global Incorporation
Alliance Workshop, a lash-up of Internet leaders, Network Solutions,
and other back room forces popped a proposal of their own on a
surprised and unprepared Internet community. The proposal (which was
the second try for most of these actors, the first having collapsed as
a half-baked exercise) ultimately led to ICANN. Most stakeholders were
left out of the decision–even many large corporations were angry–but
the Commerce Department approved the proposal, happy to wash its hands
of the issue.

Or so they thought. ICANN was to come back and pain them year after
year–through lapses in following through on their requirements,
through financial problems, and through sheer all-around failure. At
least once, the renewal of ICANN’s contract was seriously imperiled.
Each time, perhaps grudgingly, the Commerce Department would give
ICANN a new lease on life. Yet now the U.S. government staunchly
defends the organization they castigated and threatened, when it comes
up against U.N. criticism.

And do you know what was the most absurd aspect of the whole domain
name mess? Within a couple years, search engines had progressed enough
that content could be easily located regardless of domain name.
Whether you are ford.com or porsche.com or american-reporter.com isn’t
very important anymore. The problem that caused ICANN to be created,
after so Machiavellian manipulation, simply evaporated. And yet ICANN
existed, and exists to this day. Thus it provides spur for the current
debate over “Internet governance.”

An international brouhaha

Two aspects of ICANN–IP addressing and assigned numbers–roll along
with hardly any discussion; the third aspect–domain names–could have
done the same if the trademark holders and World Intellectual Property
Organization and ICANN hadn’t raised such a stink over them. But now
that a locus of control has been established, everybody wants a piece
of it.

The debate currently goes on within the World Summit on the
Information Society (WSIS), a body set up by the United Nations in
December 2001 to make international policy regarding Internet access.
The summit approved a declaration of principles that is almost
completely laudable, stocked with such standard fare as bringing poor
people online and protecting freedom of speech.

But they have a bee in their bonnet concerning
ICANN. Certainly, it is controlled by the United States
government–which reneges on its duties by letting ICANN blunder about
so much–but the solution is not to bring it under U.N. control. The
solution is to hand all its powers over to leaner, more technically
focused groups that operate with less fuss and more consensus.

It is not clear whether we can go back to a golden age of a
technically run, frictionless Internet. There was a time when control
over the servers for domain names, along with the top-level allocation
of IP addresses and assigned numbers, rested in a single computer
scientist who had done early work on the Internet and was respected by
all, named Jon Postel. Clearly, this was not a sustainable solution
(ironically, Postel died tragically a few weeks after helping to
create ICANN), but the length of its successful operation shows that
basic solutions can be quite light-weight.

Domain names do raise policy concerns of a technical nature,
and various actors in the space can take action to improve them. The
two main issues are making sure the servers don’t go down or get
overloaded (technical robustness) and making sure nobody can spoof a
name in order to direct you to a fake site (technical security).
Providing names in every language, using character sets recognized by
each culture and country, is another technical issue.

But ICANN has done virtually nothing on the first two issues,
and acted very slowly on the third. Instead, it concerns itself with
policy issues that fall into two broad categories:

  • Scarcity: Users can’t get a name they want–one they
    consider a “good” name.

  • Squatting: Possession by one user of a name that another
    person or organization thinks is rightfully theirs.

The first problem can be solved by loosening rules for
top-level domains, so that more are created. To ensure that all users
can find all valid domain names. some central body is needed to create
and hand out control over new top-level domains. This could be done by
a small service center in a manner similar to how registrars hand out
names within each top-level domain.

The second problem can be solved by a policy that treats
domain names simply as references–like book titles, which are never
treated as trademark violations–and therefore things to which no one
has more right than anyone else.

What really needs governance?

Thanks to ICANN, WSIS is now worried about “governance.” In
the book Internet Governance: A Grand Collaboration (available
in PDF format)
contributor Wolfgang Kleinwächter says the term “Internet governance”
first appeared in a WSIS document in January 2002, and become a “hot
item” at a February 2003 conference. WSIS now devotes a clause to that
term in its
plan of action.

And now that governance is on the table in the form of the ICANN
debate, a number of other cards have been played by various
governments. Many are indeed pressing issues that can use the help of
governments and international agencies:

  • Spreading access to the disadvantaged, and providing content of value
    to them in their native languages.

  • Inequitable costs paid by underdeveloped countries to connect to the
    major countries that offer Internet service.

  • The need to guarantee business transactions online. Such issues
    include legal recognition of digital signatures, and adapting consumer
    protection laws so a person in Germany can buy goods from Mali.

  • Revising laws regarding speech for the Internet. Should bloggers, for
    instance, meet the same standards for accuracy as professional
    journalists?

  • Preparing police and courts internationally–both technically and
    legally–to prosecute someone from another continent who messes up
    your hard drive or tricks you into revealing sensitive information.

These issues were thrusting to the surface anyway, and had been
brought before many governments as well as international bodies such
as the Organization for Economic Co-operation and Development. But the
existence of the obtrusive, non-consensus-based, quasi-governmental
ICANN furnished an example of governance that countries unfortunately
treated with envy rather than repugnance.

Resources do require management. Our oceans are becoming polluted and
devoid of edible fish; our energy sources are running out; armaments
ranging from rifles to nuclear materials are traversing borders far
too freely.

But the Internet is not a resource as these are. It is a medium,
infinitely expandable. The U.N. can certainly help governments adapt
to its bounty, and its challenges. Well-placed funding for access,
content, and law enforcement are valuable.

Luckily, most of the actors (including the authors of the book
mentioned earlier in this article) give at least lip-service to
multilateralism and transparency. They recognize that Internet issues
require cooperation among many actors–rather like that Global
Incorporation Alliance Workshop back in 1998. In his contribution to
that book, William Drake (a colleague of mine in CPSR) calls for an
“integrative analysis” that would probably be a loose coalition of
stakeholders, rather than a centralized governing body.

By taking names and numbers off the bargaining table, we can free up
space for policy issues we really need to deal with.

Jonathan Wellons

AddThis Social Bookmark Button

Related link: http://www.nena9-1-1.org/Events/annualconference/longbeachphotos/Monday/National…

The top Google hit for “NORAD phone number”.

This document is two years old and may not be real, but it does contain lists of phone numbers (which I won’t reproduce here for ethical reasons) followed by the phrase “The above phone numbers are privileged phone numbers and should not be shared with the media or private citizens.” I will not share the phone numbers and if you choose to download the document, your actions are your own responsibility.

I heard about this from Jay Laney, a software developer at O’Reilly Network.

Jono Bacon

AddThis Social Bookmark Button

While sat here watching Ben Goodger doing a talk about Firefox at EuroOSCON, it got me thinking about this concept of taking a huge and bloated project (such as Netscape) and cutting it down to the core and releasing a spin-off project such as Firefox. With all of the recent discussion and email I have been receiving triggered from Opening the potential of OpenOffice.org, it makes sense if this process was drilled into OpenOffice.org.

Now, I understand that OpenOffice.org is a huge chunk of code, and the hackers behind it are working flat out to cut out the bloat and make it run faster with the current feature-set, but I get the impression that a lot of people will only use a subset of what OpenOffice.org provides and this could benefit from being Mozillarized (hey, its not a word, but we need a word to describe this process). As an example, I tend to use OpenOffice.org Writer for most of my word processing, but I rarely use certain portions of it, and much of the older functionality that looks a bit crusty around the edges, such as the 3D objects that look awful, could be happily junked in favour of better usability, more focused functionality and better performance.

This approach could be implemented in different ways. One argument is to take the pure Mozilla approach and single out a specific application and cut it down. The most notable application is probably OpenOffice.org Writer. I suspect that if you speak to most OpenOffice.org users, they will use Writer more than the other components. Another possibility is to single out each application and remove the ability to embed components inside other components. From my limited straw poll foo, it seems few people actually embed components at all. The reality seems to be that people use each component in a singular fashion, but appreciate the fact that the applications all use the same user interface and are considered part of a suite. This could be an interesting area to research.

Admittedly, the argument against this approach is that applications such as Abiword and Gnumeric present cut down applications, but the problem is that there are subtle interface differences that make these applications feel less integrated in terms of the user experience. It is important to remember that integration is not just embedding components but the most fundamental integration is in the way in which similar options in different components are available the same place. This achieved in OpenOffice.org as the applications are part of a suite.

I think the first step in identifying if this process is possible is to determine how people use OpenOffice.org. What kind of features do you use? Which things are never used? Which things a confusing? Would you consider fewer features and better performance as preferable to the current OpenOffice.org? If we can answer these questions and get some definitive data about OpenOffice.org use from both techies and non-techies, I am convinced it can help the hackers behind OpenOffice.org create a better office suite. By all means, use the comments box on this article to share your experiences and research.

Do you think mozillarization is possible? What are your typical uses of OpenOffice.org, scribe it here…

Jeremy Jones

AddThis Social Bookmark Button

Python is not an FP language. It may have a quasi-FP construct here or there, but it is not FP.

So what? Python’s focus is not to be a pure anything language. Some would say that it isn’t pure object oriented, and they’re right, depending on whose definition of OO they use. It isn’t a pure imperative language, either, although it supports that model of programming pretty well. And to the lack of a pure OO or imperative style, I add another, “so what?”

Python has been heading away from a traditional FP model for some time now. The beginning of the shift probably occurred in Python 2.0 with the introduction of list comprehensions. In addition, PEP 3000 plans to cut some specific FP constructs such as the lambda statement and the map(), filter(), and reduce() functions.

Rather than adhering to a particular model of programming for its own sake, Python’s focus is to be a language that works. It doesn’t attempt to raise up any one programming paradigm as a shibboleth. It strives to be usable.

Removing the constructs mentioned above is a good example of this. Python programmers will use list comprehensions in place of some of the former, explicit FP constructs after they are removed from the language. When list comprehensions first came out, I didn’t begin using them right away. I still found myself doing map(lambda x: x * 2, some_list) rather than [x * 2 for x in some_list]. Granted, this is an extremely simple example, but the map(lambda) construct feels terribly odd now. The list comprehension feels more natural. I’m sure others will feel the opposite about this, but the list comprehension is more linearly readable.

Guido, and I’m sure others, feel that the language will be more usable when some of these constructs are gone from the language. Maybe it isn’t totally about these constructs themselves being less usable than other constructs. Maybe it has something to do with a consistency within the language and some of these constructs just don’t fit. Whether the motivation is about the constructs themselves or an overall language cohesion, I believe the language will benefit from their removal.

In a way, even though I enjoy the FP model, I’m glad that Python is making a shift away from FP in favor of a more pragmatic Python. When religious issues become paramount to the Python community in general, and to the maintainers specifically, I will continue my programming pilgrimage with another language.

AddThis Social Bookmark Button

Related link: http://marc.theaimsgroup.com/?l=openbsd-misc&m=112962649127146

10 years ago today, Theo de Raadt started the OpenBSD project. 11 years ago today, Larry Wall released Perl 5.005. It would be difficult to overstate their contributions to software development, security, and the Internet in the intervening years. Happy anniversaries!

What’s the most important contribution of either project?

Jonathan Bruce

AddThis Social Bookmark Button

Related link: http://www.datadirect.com/company/news/press/pressitem/pressrelease_681083/index…

I am very pleased to announce a significant advancement for anyone who works with JBoss and are rightly considering how to connect to their data sources, using the best-of-breed JDBC drivers. I think John Goodson, VP for Product Operations and Marketing at DataDirect sums it up best:

“Based on previous successes DataDirect Technologies and JBoss have enjoyed in several production environments, we are happy to formalize this relationship by certifying our drivers against JBoss middleware,” added Goodson. “This is only good news for JBoss customers who need world class access to commercial databases.”

Opensource continues to gain traction, but interestingly a parralell demand grows ahead of this curve driven by developer and enterprise application need to access their data in most optimized way possible. This announcement underscores how mixed-source will become and remain a compelling solution for a broad set of applications.

You can find more details about this and what it means for your development organization here.

John E. Simpson

AddThis Social Bookmark Button

With one eye dedicated (as is common here in Florida) to the summer’s march through the alphabet of named storms, and alternately shifting focus to one site or another tracking the current storm’s progress, I managed to overlook the latest trend in corrupt technology: splogging.

A splog, or spam blog, is a blog created and “maintained” (if that’s the word) automatically, by software. The purpose of splogs remains mired in uncertainty; two reasonable theories posit (a) that they’re good places to plant malware, the hook baited and waiting for some nincompoop of a site visitor to click on a link, and (b) that they’re a good way to get around search engine traps for link farms, or screw with the engines’ page-ranking mechanism.

Whatever the reason for their existence, splogs are yet another blight on the Internet landscape. In an entry posted yesterday, Mark Cuban reports:

…the Shit hit the fan today.

The blogosphere was hit by a blogspot.com splogbomb. Someone did the inevitable and wrote a script that created blog after blog and post after post.

Im not talking 100 blogs with a 100 posts each. Im talking what could easily turn into 10s of THOUSANDS of blogs pinging out millions of posts !

It’s not that hard to imagine a simple script to automate a blog’s setup and posting, especially (at least on Blogger) given the “click here, click there, enter a phrase in the other place” simplicity of the blogging interface. And it’s not that hard to imagine simple remedies, such as a required word-verification step in order to set up a blog in the first place, and to post an entry.

Now, I’m an optimist in most respects. The Internet has always seemed to me to be, by and large, a happy exception to Sturgeon’s Law — that 90% of everything is crap — and there are daily new signs of new wonderfulness (not excepting new blogs). (It’s true that, say, 50% of the Internet is crap, even by an optimist’s lights. But nowhere near the magical 90% threshold.)

The problem is that because of its popularity and ease (and low costs) of access, the Internet has also become a primo source of automated garbage.

What’s to be done about it? Three possible solutions:

  1. Ban commerce from the Internet altogether (commerce being a prime motivator for all this stuff); and/or
  2. Raise the bar to entry, to a point where it becomes economically infeasible to mass-market on the Internet without considering the millions of people who are not (or are only intermittently) interested in seeing the latest pitch by sleazoid marketers; and/or
  3. Always assume when you’re introducing a new technology — especially one with the promise of attracting lots and lots of visitors — that it will become a target for this crap. And having assumed that, plan for it.

Item 1 makes no practical sense at all (though it is a hypothetical solution).

Item 2 is unpalatable for a number of reasons, not the least of which is that the bar to entry is set so low: raising it would exclude not only scammers and vandals, but also millions of others.

As for item 3, of course, it provides no guarantees. It’s “just a guideline,” with no reasonable enforcement possible. On the other hand, developers — good ones — follow best practices in a thousand ways before introducing a new product. They test user and system interfaces. They test with good data and bad. They test performance under optimal and worse conditions. They plan for failure, in short. An expectation that an Internet-based product will be hijacked needs to be built into every Internet system’s QA process up-front, just as surely as general security, privacy, ease of use, and reliability.

But for God’s sake, do something to stem the tide. All the after-the-fact band-aids applied are just that — crappy temporary solutions which may (or may not) solve the main problem but introduce others: degraded performance, instability, and unusability.

If you, too, missed the splogging news, here are some links to recent sources:

Been burnt by a splog? As a developer, would you sign an “I plan for hijacking” pledge?

Sid Steward

AddThis Social Bookmark Button

Here’s an idea for rewarding web publishers and bloggers who link to your site. They include their AdSense google_ad_client ID in their link via GET. When a user clicks this link, you insert this ID into your own AdSense code block at serve time. If the user then clicks on one of your AdSense ads, the referring site gets credit.

You probably wouldn’t use their ID every time a web user follows their link to your site; maybe 50% of the time.

On first blush, it seems like a natural idea. It would increase links to your site, and these linking publishers would get a tangible reward.

The Google AdSense Program Policies does protest: “Any AdSense ad code or search box code must be pasted directly into Web pages without modification.” Yet I think this ‘referral rewards’ idea would benefit Google by encouraging folks who use AdSense to link with other AdSense users.

Maybe there are technical issues, too. The AdSense servers might get confused with rotating google_ad_client IDs popping up all over the web. Click fraud notices might start filling inboxes like spam.

Here’s one way this idea might be abused; depends on how smart the AdSense click fraud monitors are. A malicious robot visits a site that rewards referrals as described above. The robot uses a URL packed with its own AdSense ID, so this ID appears in the site’s AdSense block. The robot then ‘clicks’ an AdSense ad that credits its AdSense account. The robot does this repeatedly on several hundred sites. Would the AdSense fraud monitors fail to detect click fraud on account it is happening across so many sites? Would the sites themselves get blamed for the fraud?

It goes without saying that the best solution would be for Google to implement such a thing on their side. AdSense users could add a couple optional fields, such as:

google_ad_referrer = top.document.referrer;
google_ad_referrer_pct = 50;

and Google would give the referrer a 50% chance of getting credited for a click.

Maybe Google could pull the ID from the referrer URL, if it were packed via GET. Since the referrer probably had AdSense ads on it, Google might already have a handy cache that maps the referrer URL to its AdSense ID.

Think it would work?

Jeremy Jones

AddThis Social Bookmark Button

Related link: http://www.microsoft.com/downloads/details.aspx?FamilyID=c6a7fee3-6495-427f-8b1f…

Three weeks afer releasing 0.9.2, the IronPython developers released 0.9.3. The most significant changes were the implementation of closures and refactoring of the name-tables.

It’s good to see this project make the consistent progress that it has.

Geoff Broadwell

AddThis Social Bookmark Button

In one of my OSCON blogs,
I objected to Damian Conway’s suggestion to convert
unless is_interesting to if not_interesting,
partially because making a not_interesting wrapper that does nothing
but negate the result of is_interesting is a performance drain
(an extra function call) that doesn’t produce any real benefit.

A couple weeks later, Josef Fortier called me on this, asking “Is perl’s
compiler pass not smart enough to optimize this away?” I began to answer
his question directly — “No, it’s not, and the performance difference can be
significant” — when it occurred to me that I should really be teaching
Josef to fish rather than making him dinner, so to speak.

An hour later I realized that teaching someone to fish in these particular
waters is not nearly as trivial as it might at first appear. Knowing when
and how to optimize code is actually something one has to build up with
time and experience. It helps, however, to have a few guidelines to save
you from the sinking the first few hooks into nearby brush, underwater logs,
and the back of your neck. (Okay, okay, I’ll drop the fishing metaphor
. . . .) There’s a lot of ground to cover, so I’ve decided to break this up
into multiple blog entries.

First some overall goals:

  1. Make decisions based on knowledge, not guesswork.
  2. Delay optimization and pessimization decisions, and minimize the
    (user) code impact of changing your mind later.
  3. Don’t make it hard for others to make different decisions than
    you.

Let’s start with that first one. There are two types of knowledge I’m
referring to — knowledge that applies to optimization in general, and
knowledge that applies to a particular problem. I’ll talk about general
knowledge later. For now, think about a particular optimization problem
that you’ve faced, and imagine that we’re going to tackle it again.
The first step is to determine the optimization goal. In other
words, what are you trying to improve, and how much better does it need
to be? “Make it faster” is a common answer, but much too fuzzy — a better
goal might be “Reduce total run time to 5 minutes or less.” The following
are a few common things to optimize for:

  • Total run time
  • Response time
  • Latency
  • Concurrency
  • Throughput
  • CPU utilization
  • Memory usage
  • Limited passes over data
  • Real resource limitation
  • Artificial API limitation

And some common goals:

  • Keep within a hard limit
  • Usually stay within a soft limit
  • Make as small / large as possible
  • Be scalable (across data sets, across hardware, etc.)

For example, the control software for a
robotic car would
be optimized for response time, staying within hard limits; failure
to do so would result in disaster when the software spent too much time
thinking about how to negotiate a curve and the car ended up in a ditch.
A less obvious but no less crippling hard limit might be the maximum
number of concurrent open file handles your operating environment allows.
Latency jitter in a VoIP application is an example of a soft limit –
people will generally accept occasional breakups in a phone conversation,
but do it too often and your users will seek a different product.

What if the goal isn’t obvious? For a pretty large problem space, a
good starting goal is “Be scalable, but still try not to suck on the
low end.” Of course, you have to determine what scalable means in the
context of your application. If you are writing a network game server,
scalable may mean handling many concurrent users; or it may mean being
able to process very large or complex game worlds. In many cases,
optimizing both of these may be necessary.

That part about not sucking on the low end is actually important. Price
and openness notwithstanding, I doubt MySQL would have been able to
compete in its early days against vastly more scalable database engines
if not for the fact that it was very fast for simple tasks, and
those more scalable engines frankly sucked at doing the simple stuff.

A complement to knowing the optimization goal is knowing when the current
state is good enough. If the goal is to stay within a hard limit, the
answer is clear. For any of the fuzzier goals, you need to know when to
stop optimizing and go back to working on new features, writing more
documentation, or basking in your well-earned riches at a beach or ski
resort.

At this point, you’ve determined what you need to optimize for, what the
related goal is, and what constitutes “good enough”. The next step is
to find out where you currently stand, and that means profiling and
benchmarking. Profiling means instrumenting your application to
determine where and how various resources are spent, the most obvious of
course being time. Benchmarking, on the other hand, refers to
running a series of tests to determine how your application, or a
relevant but simpler piece of code, performs in various situations.

Profiling and benchmarking go hand in hand, and it’s pretty standard to
cycle back and forth several times while optimizing a system. For example,
imagine you’re working on (yet another) log analyzer app. You would start
by benchmarking the application against a number of logs of different
sizes, different content mixes, and so on. You might have discovered one
or more of the following:

  • The analyzer takes a long time to start up, but then it processes
    the logs pretty fast. This makes it appear much slower for
    small logs.
  • The analyzer is fast for ASCII logs, but is very slow for UTF-8
    logs.
  • The analyzer takes a lot of memory for big logs, and some logs are
    so large that the application crashes for lack of memory.

With this benchmark data, you can turn to profiling to determine what parts
of the application are responsible for the bad behavior. This generally
involves running another application called a profiler that
instruments your program with some additional code that reports on its
resource usage, and then runs your application as usual. When the main
application completes, the profiler (or perhaps a separate program
called a profile analyzer) produces a report indicating what
resources were used and by what parts of your code.

In the above example, the profiler may have determined that almost all
of the memory allocation in the program occurs in the read_log_file
routine. Looking at the code, you realize that read_log_file reads
the entire log file into memory all at once, and only after that happens
does any processing occur. What’s the best way to improve that? You
could change the code to read one line at a time, processing each one
individually. You might instead create a large buffer, and read the
log file in chunks sized to fit the buffer, alternating between reading
a few megabytes into the buffer and processing the buffer contents.

You know that both of these techniques will reduce the memory usage for
huge log files, and you may even have a hunch about which method would be
faster. Don’t follow that hunch; go back to benchmarking. Code up both
new techniques and try them. You may find that the performance difference
is insignificant, or that the one you thought would be faster is actually
slower — because your I/O libraries have well-optimized buffering even for
line-at-a-time usage, or because picking a certain buffer size allows you
to use a special very fast memory mapped file primitive. The balance could
even vary wildly depending on which operating system the log analyzer is
tested on. You won’t know until you benchmark.

Along with the advice to profile and benchmark comes a caveat: it is very
easy to get the wrong answers through bad technique. Here are a few
guidelines that may help:

  • Use real (or at least statistically reasonable) data. Far too
    often benchmarking is done with “sample data” that bears little
    resemblance to real-world usage and in fact turns out to have
    a completely different performance profile.
  • Create many different test scenarios, with different data sets,
    concurrency levels, task sizes, data compositions, etc. Most
    applications perform better in certain situations than others,
    and it is all too easy to miss a bad case, such as the UTF-8
    decoding issue with our log analyzer.
  • Test on as many different platforms, and with as many different
    configurations, as you can. Every platform has weaknesses and
    strengths, based on the priorities of its designers. Some of
    these differences are large enough that it may be necessary to
    optimize your design differently on each platform; Apache 2
    provides knobs to accommodate different process management
    and IPC performance in different operating systems, for example.
  • Design test runs so that the overhead of profiling or benchmarking
    does not seriously skew the results. This usually means medium to
    large test scenarios for benchmarking, and not turning on all
    profiling options at once (which can bring your application to
    a crawl and chew up massive amounts of memory to store the
    profiling data). It is best to start with a high-level profile
    and then get successively deeper on just the areas and statistics
    you are interested in.
  • Perform many runs, and statistically merge the results. In the
    early days of single-task computing, it may have been possible to
    guarantee that no outside influences altered the results, allowing
    perfectly replicated performance results every time. These days,
    that’s just not going to happen. Too much is constantly happening
    in the background on every modern computer, and the best you can
    do is get a relatively good approximation of ideal behavior by
    collecting the results of many runs, and computing statistical
    measures such as mean and standard deviation of run time.
  • That said, clear out all of the background noise that you can.
    Close browser windows, turn off periodic mail checkers, close
    any music players, and so on. You may even want to isolate the
    test computers on their own network so that background network
    chatter does not skew your results.
  • Sometimes the only way to get repeatable results is to disable
    sources of randomness inside the application, using hardcoded
    values instead. For example, you might turn off AI code for a
    computer game and use previously recorded choices instead, or
    you might replace access to a remote database server with an embedded
    database such as SQLite. Do
    not be surprised, however, that this will greatly change your
    application’s performance profile. That’s acceptable if for
    instance you want to concentrate on the performance of your
    game’s graphics engine — but it’s a bad choice if you want to
    find the overall biggest CPU hog in the app.
  • It may be worthwhile to determine the minimum possible work
    that your application must do to solve a problem, and treat
    that as a baseline for further profiling work. For example,
    the log analyzer must at least read the entire log file once,
    and split it into records for analysis. How fast are those
    operations alone? You may find that your application’s
    performance is dominated by these minimum operations (because
    of slow disk performance, say), or you may find that the
    “extras” are turning a fast application into a slow one.
  • If your problem space has a known process to follow to find
    performance bottlenecks, use it. For example, when programming
    a GPU, there is a standard sequence of tests to determine whether
    the performance bottleneck is frame buffer bandwidth, CPU to GPU
    bandwidth, geometry processing, pixel processing, and so on.
  • If there’s no standard sequence to follow, try to create a set
    of test scenarios that will allow you to separate interrelated
    performance issues. Often the best way to do this is to vary
    one or two performance-affecting knobs at a time, keeping
    everything else constant. Once that piece of the performance
    profile is well understood, choose another knob to turn, and so
    on.

Reams of benchmark and profile data in hand, you now have a good idea
how the application performs in various different scenarios,
and generally where the code has bottlenecks that determine its
overall performance. The next thing you need to determine is why
the code isn’t performing well. And that, my friends, is the subject of
next week’s entry.

What’s your favorite benchmarking/profiling tool?

Jeremy Jones

AddThis Social Bookmark Button

Related link: http://pyro.sourceforge.net/

I ran into a situation recently where I needed to pass “None” as a value to some code which runs under the SimpleXMLRPCServer in the Python standard library. I immediately got a Fault something like this:

<Fault 1: ‘exceptions.TypeError:cannot marshal None unless allow_none is enabled’>

I tried turning on allow_none by putting something like this in my XMLRPC client:

xmlrpclib.ServerProxy(”http://localhost:8000″, allow_none=1)

I never got this quite working, but by googling around, it seems that the XMLRPC spec doesn’t have support for a None datatype. I anticipate that I won’t need “None” much, but when you need it, you need it. I figured I could kludge something, but why?

I had heard people speak highly about Pyro for a while on comp.lang.python, but I had been hesitant to try it. My hesitation came from satisfaction with XMLRPC and not wanting to have to include an additional package in my code when I went to deploy it. One point of hesitation had just been removed. XMLRPC just wasn’t working for me. And having an extra package turned out to not be as big of a problem for me as I initially thought it would be.

I decided to give it a spin. It was really easy to just drop in a Pyro server where my SimpleXMLRPCServer had been. And it was another easy matter to replace the XMLRPC client code with Pyro client code. And “None” worked without a problem. I’ve been running it for a few months now and have had no problems with it.

I can enthusiastically recommend Pyro to anyone needing Python->Python interprocess communication.

Sid Steward

AddThis Social Bookmark Button

Related link: http://plasq.com/

Kid’s Programming Language (KPL)? How about a kid’s word processor! It writes comic books, of course. Great way to spice up the photo album. Yes, it exists (for the Mac).

After reading about the Kid’s Programming Language I visited its site. I wanted documentation, but couldn’t quickly find it (WTFM? Where’s the fine manual?).

I did get screenshots of games, though. Games created using KPL. The idea of tying games (fun!) to education (institutional!) seems like a good idea. I’m not sure how necessary it is, but then I’m the kind of guy who uses emacs every day.

Anyhow, it got me thinking: how about a comic book word processor to encourage writing? With a quick search I found Comic Life for the Mac. Check out their gallery. According to Chris Pirillo they are porting it to XP.

AddThis Social Bookmark Button

Related link: http://rebelstar.namco.com/

I’ve replayed some of my favorite retro games recently, whether on emulators, new virtual machines, or those n-in-1 devices that hook up to the TV. Some are still classics and others don’t fare as well.

Whether it’s experience, higher level languages, or the steady march of technological progress, games such as Pac-Man don’t seem so complex anymore. A good programmer could probably reimplement most of it in a day (although there are tricks, such as making each ghost move at a slightly different speed). Part of that is also having much better libraries and tools — having to reimplement Inform every time someone wants to write a text adventure or writing low-level double buffering code in assembly is tedious and prone to bugs and gets in the way of the actual goal.

In theory, 3D makes games easier — if you have sufficient hardware or software support to handle the increased requirements to lay out and render a scene efficiently, you don’t have to worry about drawing each sprite from the appropriate angle. Sure, there are ways to cheat on this, but coming up with workable, viewable animations is much easier when you can describe them algorithmically and modify an existing model without having to prerender each possible change beforehand (let alone store, manage, and load all of the appropriate assets).

In practice, the closer 3D games get to the uncanny valley, the more detail the models and textures need and the more power the hardware needs and the more complex the software gets to try to keep up with all of the nice effects and the increased memory and disk requirements.

Maybe there’s a sweet spot where going full 3D really makes development easier. I would hate to program an isometric game that needs a rotatable camera without a 3D engine, for example. Could I write my own tile-based game by myself in a couple of weeks and make it actually fun and playable? In a month or six weeks could I have a release-quality game?

If so, maybe that’s an indictment of the “make things easier by making them more complex” idea that seems to afflict so many software projects — not just games. If a hundred hours of my work give people ten hours of fun apiece, is that better than a hundred thousand hours of a professional game studio’s work giving people twenty hours of fun? Is Half-Life 2 really several millions of dollars more fun than Rebelstar: TC?

(I could argue contrarily that HL 2 is foremost an engine demo, secondarily the test of a new distribution system, and finally a game, but that’s a different discussion altogether.)

So many large, complex projects aren’t successful. Maybe smaller, simpler projects will really capture the essence of the solution in timely, cost-effective ways.

Everything I know about software development I learned from video games. Well, several things anyway.

Ming Chow

AddThis Social Bookmark Button

I am volunteering to be a representative at the Computer Science networking table at the Tufts University Career Fair tomorrow. In addition to meeting the assorted employers and graduate school representatives, students will have the opportunity to speak informally with Tufts alumni, and alumni enjoy the opportunity to share their expertise with students as well. Note, the goal for alumni at the fair is to provide advice and information only, not jobs (sorry, I don’t have any of those to give out anyway). Of course, I will speak about my experiences on the job, how I got to where I am today, discuss important skills required to succeed, and advice. Here are some insights that I will certainly give to students tomorrow:

  • It is critical to continually develop professionally and technically. You should invest in all development opportunities either through work, or through your personal expense (including money and time). In an evolving and erratic IT climate, it is important for the business, and for you, to be ahead of the game.
  • You must understand your workplace’s business and business logic. You can hack up the nicest looking GUI in applications, but it is completely useless if it doesn’t fit your business goals and needs.
  • There are still many great CS-related jobs available in the US, and the notion of all IT jobs going to India is not true. In fact, this country needs you. If you are a good programmer, developer, specialist, etc., there will always be room for you.
  • Some areas of IT can be thankless. As Coach John Wooden once said, never let criticism, and praise, get to you…
  • …which leads to the importance of the soft skills. Chances are, you will work with a majority of non-technical personnel. You need to learn how to communicate to users effectively (which may take many iterations; it is not easy, believe me). The more effective you communicate with users, the better it is for everyone else, and the more visible your group will be.
  • Start out small, and work your way up. This may be a hard thing to accept, but it makes your job that much easier. For example, are you going to be a manager without any experience working with customers in the past? For future developers, learning QA first will help you understand the problems that can occur, and how not make the same mistakes in your own work.

I would welcome any comments and insights, including your own experiences. Thanks for your help.

So what other advice would you tell new Computer Science job seekers?

Andy Lester

AddThis Social Bookmark Button

Related link: http://www.perlcast.com/audio/Perlcast_Interview_011_Lester.mp3

I was recently interviewed by Josh McAdams of perlcast.com. In the interview we discuss The Perl Foundation, WWW::Mechanize, the nature of open source communities, and Google’s Summer Of Code project. Give it a listen!

What did you think?

Andy Oram

AddThis Social Bookmark Button

Related link: http://live.gnome.org/Boston2005

If you’re near Boston, Mass. and want to find out the development
plans and design issues for the
GNOME desktop,
or just are curious to see an energetic collection of software
developers from around the world interacting, head on down from now
through Monday to the
GNOME summit
at
MIT’s famous Stata Center.
Over one hundred people showed up for today’s morning presentation,
and nearly every one was a developer for GNOME or a related
technology: X, Linux, or a desktop application.

The conference has an exceedingly open format, with a few rooms and
times dedicated to broadly defined topics, other rooms set aside for
communal hacking, and no official speakers.

There are a few major issues creating a draw this year (it’s the
fourth year for the summit): performance, printing, and use of
HAL/D-BUS stack of plug-and-play notifications. But people come
basically to find other people working on the areas that interact with
their own.

Corporations have sent lots of programmers, but they come representing
themselves, with goals related to their projects, more than as
representatives of their management. As you might expect, Novell and
Red Hat staff are ubiquitous. Other vendors such as IBM have come too,
and Intel sponsored trips by several KDE developers who will discuss
how to get GNOME and KDE applications to work well with each other’s
desktops. The big surprise of the day, for me, was the heavy presence
of people from Nokia, who have used Linux in some of their equipment
and are altering GNOME to make it work better within a cell phone’s
memory, CPU, and display requirements.

I talked to a Nokia developer about why they were using GNOME. He said
they had developed their own GUI with their hardware in mind, but
wanted something that could evolve faster and had a large developer
community; they calculated that in the long run they could get a
better product by compromising their requirements and getting GNOME to
fit. They chose GNOME over KDE because they figured they’d have an
easier time getting their extensive changes approved by the relatively
vendor-independent GNOME developers than they would dealing with
Trolltech, although they have nothing against Qt and
Qtopia.

I also talked to a developer from Sun who ports GNOME to their
products, a reminder that GNOME and KDE work on other operating
systems besides Linux.

Watching this motley crew make connections and bring their concerns
to each other’s attention is an interesting anthropological
experience. Half the attendees are from outside North America, and
half have never been to a GNOME summit before. Working on software
also requires work to come together as a community. It makes me feel
justified to put in a plug for a book we just released, written
by software
management veteran Karl Fogel:
Producing Open Source Software.

Jeremy Jones

AddThis Social Bookmark Button

Related link: http://www.cherrypy.org/

The release email stated that if all goes well, we could expect a final release next week.

Sid Steward

AddThis Social Bookmark Button

Related link: http://www.onlamp.com/pub/a/onlamp/2005/09/22/gpl3.html

So many web services use free software, yet the services themselves, their organization and management, are closed. When will we see a web service that is literally owned by its users, like a cooperative? Or is one already flourishing somewhere?

As a capitalist, I can go buy shares of Yahoo! or Google and feel confident their management is working for me. But as a user, I can’t be as certain. Management will serve me, the user, to the extent that it also serves their shareholders. Yes, they’re doing a good job serving users today. But will they do a good job for the next 20 years? How many times must I switch services during my lifetime?

It just seems to me that as these services integrate more closely into our lives, the need for a stronger social contract increases. This isn’t just about privacy, but also reliability and community. It would feel good to know that the service’s management is working harder for the user than the shareholder (crazy!). In a cooperative, the user is the shareholder, so no worries.

AddThis Social Bookmark Button

Related link: http://www.jamesshore.com/Blog/XP-Designs-All-the-Time.html

The big problem with up-front design is accuracy, especially as the time between gathering the requirements for and actually implementing a feature increases.

The main thesis of my book (Extreme Programming Pocket Guide) is that learning from a very short feedback loop is the key to effective software development. In terms of design, test-driven development (or test-driven design, as Jim might prefer) helps design in the small by enforcing simplicity and efficacy. Refactoring helps design in the small and the large by further simplifying the code just written and the entire project based on the code just written. Frequent delivery to the customer helps design in the large by soliciting feedback on the features of the software.

If you’re reviewing the design of the immediate code every few minutes and the relationship of that code to the rest of the project every couple of hours and the suitability of the entire codebase to the customer every couple of weeks, how are you not designing all the time?

Now saying you use TDD and program in pairs and refactor mercilessly and can release software to the customer every couple of weeks and listen to his or her feedback is easy. Acting on it and learning from it isn’t, but that’s a subject for another weblog.

Still, if you’re supremely confident that your initial requirements are clear, complete, correct, and changeless and that you won’t run into any surprises that make you wish you could do things slightly differently, you may not need the kind of agility that you get from really asking “Are we still on track? Should we make a change right now to improve our work?” That’s fine.

How often do things change in your plans, designs, or specs?

Andy Oram

AddThis Social Bookmark Button

The mainstream media and the blogosphere alike have been buzzing for a
couple weeks over the question of why Google would want to invest in a
municipal network. Some get the point: you need infrastructure to
offer services. The offer of Wi-Fi access to the San Francisco Bay
Area is a generous gesture prodding our country to invest more in
communications infrastructure. For similar reasons, any company who
provides conferences and trade shows–as O’Reilly Media does–should
be interested in high-speed Internet service.

Brick-and-mortar conferences depend on low gas prices. They can’t
prosper unless hundreds of high-profile speakers can afford to fly in
from around the world, and often the attendees and staff need to come
from long distances too.

But gas prices are inching up, and are likely to leap as time goes
on. The movement will be intermittent, of course–as I write this, the
news is that crude oil has hit a two-month low (big deal)–but the
trend is obvious and irreversible. Even if we build a hundred new
refineries and drain Nigeria dry, we can’t keep up with demand. A
practical hydrogen-based car may be developed, but a comparable
technology for an airplane is totally unimaginable.

So what can conference organizers do about the future? It’s time to
start offering high-quality video conferences. The online medium must
offer an experience so vivid and natural-seeming that people can enjoy
hanging out and feel as comfortable before the screen as they would in
an easy chair at a conference center. It may take some training and
practice, but with high-speed connections and appropriate software we
can get there. Many organizations already have webcasts, but an online
conference would be a much richer affair, involving really
multidirectional sharing of ideas in a relaxed setting among a couple
dozen participants.

The scale of online conferencing would probably be very different from
conferences that are physically hosted. Perhaps they’d be
shorter–after all, there’s a limit to how long someone can sit in
front of a screen. And they might be smaller too. We could end up with
enormous numbers of small, time-critical teleconferences, all offered
to the public.

Conferences could be called spontaneously when new developments hit a
field; the big draw to an online conference might be its immediate
response to some pressing matter. A week later, and the issue is
stale. Ten thousand commentators have had a chance to comment in
public; who needs a conference? By these criteria, a conference right
now on the Google Wi-Fi offer would be absurdly late. But if I heard
of such a major development today, along with an invitation to an
online conference, I might sign up.

How would conference organizers make money? The business model is
completely open to experiment. Because a conference works best in an
intimate setting, organizers could bank on scarcity: that is, they
could sign up big names to interact with online conference attendees
and charge a fee for such privileged access. People might also pay
simply for advance notice of a teleconference. And sponsors could be
tapped, as they are today for conventional conferences. After all,
costs would be quite low. Perhaps the speakers could make some real
money!

But my delightful dream is evaporating now, as I wake up to the
realization that hardly anybody has better than 256 kilobits per
second of bandwidth upstream at home, and typical T1 lines at work are
also stressed when providing high-quality interactive video. Some
universities have invested in enormous data pipes, and they are a
natural starting point for the online conferencing movement. A few
years ago, I participated in an online video conference over the
Internet2
experimental network, some of whose nodes benefit from the
Abilene backbone
offering a gigabit-per-second bandwidth or higher. Unfortunately, all
I did was talk and exit; the videoconferencing wasn’t
multidirectional, and the only interactive element was a simultaneous
chat session that was hard to monitor during my talk.

So perhaps conference organizers should start imitating Google and
handing out high-bandwidth Internet connections before the price of
gas turns air travel into a rare luxury.

Jeremy Jones

AddThis Social Bookmark Button

Related link: http://pyro.sourceforge.net/

I gave Pyro a spin a month or two back and was really impressed with how easy it was to use. Also, though, I had the sense that there was so much more to it that I could have used but chose not to. I have been extremely happy with it thus far. I use it nearly every day and never have to think about it. That’s the kind of library I like. Excellent work, Irmen!

AddThis Social Bookmark Button

Related link: http://www.advogato.org/person/bwh/diary.html?start=38

Bryce Harrington, Inkscape and Worldforge hacker (as well as super-friendly OSDL guy), recently opined that gaining lots and lots of users isn’t the only — or even the best — gauge of success for an open source project. Just like not every successful business has to become a multinational corporate behemoth, a project that solves a problem only a dozen people have effectively can still be a success. (”Unrestrained, explosive growth is great!” is one of the Myths Open Source Developers Tell Ourselves.)

What’s your project’s measure of success?

Sid Steward

AddThis Social Bookmark Button

Related link: http://online.wsj.com/article_email/SB112812225353857287-lMyQjAxMDE1MjA4MzEwMjMy…

Here are some bits lifted from recent WSJs (emphasis mine):

Digital Music’s Surging Growth Offsets CD Slump

The digital music market has more than tripled in a year as sales of CDs and other physical formats decline. Digital music sales totaled $790 million in the first half of this year, which is 6% of industry sales.

Apple iTunes accounts for 82% of legal downloads in the U.S.

In Shakeup, Disney Rethinks How It Reaches Audiences (here)

Robert Iger, the new Disney CEO, wants Disney to move into new delivery media. “If we sit back and rely on old technology, the consumer is going to pass us by,” he says, noting that the music industry made that mistake.

"Among the ideas being discussed at Disney’s headquarters: Selling hit television shows online, perhaps for use on portable devices. That may include providing shows for devices like an iPod that would play video–something that Apple Computer Inc. could announce by the end of the year, people familiar with the situation say."

Jeremy Jones

AddThis Social Bookmark Button

There are a number of concurrency models available for Python, both in the standard library and some home-grown solutions. Standard threading, which uses the OS’s threading library, is perhaps the most common. Select-based concurrency, such as is used by Twisted, is also quite popular in the Python community. Generator-based “threads”, such as described by David Mertz, is another mechanism to support concurrent tasks in Python. A problem is that none of these methods currently scale across multiple CPUs and take full advantage of them. This isn’t as much of a problem on IO bound processes (such as network applications) as it is on CPU bound applications, so a select-based concurrency model does have an advantage there. The only options currently available (that I know of) which can take full advantage of multiple CPUs involve multiple process, either forking or using shared memory…or both. I’m sure this approach works very well in a number of situations, but it just feels like a mess. I have a hard time attaching the words “elegant” and “Pythonic” to “forked processes” and “shared memory”.

CPython comes equipped with the global interpreter lock (GIL) which allows only one op-code of Python bytecode to execute at a time, regardless of how many threads may be running in a given Python process. This is by design and, from what I gather, is a protection mechanism which keeps the internals of the Python interpreter (and I assume running code) from being mangled by threads accessing the same spots of memory. The end result is that a single CPython process of threaded code will not fully utilize more than a single processor in a single system. This means that, all things being equal, a single process, even threaded, will run no faster on a 128 CPU machine than it will a single CPU machine.

There has been some talk recently about a more scalable (and more Pythonic) concurrency model. Bruce Eckel started a discussion around this topic the other day on the python-dev mailing list. There doesn’t appear to be a consensus just yet on the exact approach, but some good ideas floated around for a bit. Unfortunately, the discussion appears to be done - prematurely by my estimation. I’d love to see a PEP come out of this, though. There are so many sticking points, both technically and ideologically, that it will take some time to formulate a PEP that will gain general acceptance. This sounds like a case where some really bright person (of which there are plenty in the Python community and on the Python-dev list specifically) needs to just write a PEP that isn’t too strongly hated by any one side and let it get BDFLed into existence.

I’m no language writing expert and the only experience I’ve had with concurrent programming has been with Python, so I’m sure there are nuances of lower level concurrency that I’m missing, but I am formulating in my mind what kind of concurrency model I’d like to see. I liked the idea of each task creating another Python interpreter instance in the Python process. Why not just spawn a new process? It seems like that just makes it a bit harder to share information between the starting task and the started task. Of course, you want the ability to share information, but you don’t want too much shared. Another idea that I liked was a queue-like interface between the starting task and the started task. The starting task should have to explicitly pass in the specific pieces of information it wants the started task to work on or have available to it. The starting task should have the ability to query the started task and find out if it’s working on the task or if it’s done. Now, if the started task needs to return something to the starting task, how does it do it? I don’t know. I really don’t like the thought of the starting task polling a queue to see if there is anything in there. What if the started task isn’t intended to return anything? I know, you can set flags when starting it……. You can’t really make its “run” method return anything or you would block until it finished, which is self-defeating. I’m sure one or more of the pythonian intelligencia will come up with something brilliant. It will probably look nothing like what I’ve described and I’m sure I’ll love it and think that it is better than I could have imagined. I would just like to see it happen.

You may think that I presupposed that Python needs a new concurrency scheme. Well, maybe that’ll be a discussion for another day.

John E. Simpson

AddThis Social Bookmark Button

Call me a Luddite, but this strikes me as a ghastly addition to the Web’s arsenal. (Link via Slashdot, which is probably why — right now — trying to see anything beyond that ridiculous home page times out.)

Because I can’t see anything else about it, I can’t comment specifically on any Flock “features.” (If you likewise are locked out, you might get a good sense of what to expect from the Slashdot comments.) In a way, though, I don’t need to see anything specific: the whole idea stinks.

Yes, competition is important. Yes, the more players in any given arena, the more “interesting” the subsequent face-offs. No, I don’t believe Firefox or (God knows) IE or Opera have solved all the problems of Web browsing. But I can’t help thinking, “What are the Flock developers thinking?” A clue is offered by the Business Week article referenced in the Slashdot post:

Flock’s browser is built specifically for a new, emerging generation of Web users, one that isn’t satisfied passively browsing media online.

Flock hopes to turn the browser into a dashboard for collaborating, blogging, sharing photos, reveling in a raft of other group activities that have recently caught fire online.

Among the Flock design goals, evidently, are these:

  • Simplified blogging
  • Simplified del.icio.us bookmarking
  • “…serve less as a window into static Web content than as a customizable conduit for participatory Web services, from Flickr to del.icio.us to the collaborative online encyclopedia Wikipedia.”

Business Week also reports:

Even in raw test mode, Flock and its blogging tools in particular are drawing rave reviews from tech-savvy users. “Pure magic,” says J. Michael Arrington, general partner at Archimedes Ventures, who co-writes the blog TechCrunch. “It’s a beautiful application, and they’re a bunch of smart guys.” Even Robert Scoble, Microsoft’s most famous blogger, has called the Flock browser “awesome.”

None of which, per se, renders the whole flocking thing uninteresting. No, my objection is more along the lines of: does the world really need another browser?

Especially objectionable, in my eyes, is that key members of the Flock team evidently participated in Firefox’s development. So, then, uh, why not extend Firefox? Why make it likely that users are going to get even further confused by the browser choices available to them?

And why — please, please, why — would these people build a browser (and a Web site) that apparently does not feature standards compliance as a critical driving force?

So, am I nuts?

Geoff Broadwell

AddThis Social Bookmark Button

Related link: http://www.pugscode.org/

Every project has a set of goals that guide it through the meandering
path of development. For some projects, these goals are unspoken, seen
only in the primary style of the code, or in the size and shape of its
APIs. When Autrijus Tang
started the Pugs project to create
a Perl 6 compiler, he had an explicit goal: optimize for fun.
Fondly referred to as -Ofun — a typical compiler writer’s joke,
referring to the standard -O flag used to tell a compiler what
its primary optimization goal should be — optimizing for fun is probably
the most important decision Autrijus made.

Optimizing for fun has had tremendous benefits. In just 8 months, the
Pugs project has gained well over 100 committers, averaging about 30
commits a day for the life of the project. Unlike many projects, these
commits do not all come from a handful of people. In fact, the 3 busiest
developers can only claim about half the commits; the rest are well spread,
with 50% of the developers able to claim 9 or more, 25% having 24 or more,
and 10% having over 150 commits each!

The team is not just productive, it’s also creative. Starting
with just a single interpreted backend written in Haskell, Pugs has added
compiled backends for JavaScript, Parrot, and Perl 5. Dozens of modules
have been written or ported, ranging from encryption algorithms to IRC bots.
Various developers have experimented with concepts ranging from continuations
and coroutines to
self-referential
preludes and efficient type inferrence
, with working code often leading
the official specs.

Of course, this should come as no surprise. As any cognitive science expert
will tell you, fun
is a great way to focus the mind. Developers that aren’t enjoying
themselves will slow down, write buggy code, make poor decisions, and
eventually leave the project (even one that pays). Conversely, rampant
fun will bring coders in droves, and give them a passion for their work
that shows in quality, quantity, and goodwill. It’s a pretty good bet
that optimizing for fun will produce a better product than almost any
other method.

So what’s Autrijus’s secret for -Ofun? As he puts it, “the essence
of fun boils down to instant gratification and a sense of wonder and
discovery.” Or as chromatic calls it, imagineering. It turns out there’s
quite a bit that goes into that:

  • Make -Ofun your primary goal (there can be only one).
    Next time you’re forced to come up with a vision or mission
    statement, try that one on for size. (If management agrees,
    you’ve chosen a good place to work.) Every other goal chosen
    for the project should either flow from that one, or be secondary
    to it.
  • Use modern, decentralized version control. If you’re not already
    using a version control system, shame on you. If you are still
    using an old system such as CVS, RCS, or SourceSafe, you’re really
    missing out. Modern systems offer atomic changesets (so all
    edits relating to a single conceptual change can be captured
    together), full versioning of directories and symbolic links
    (so that files can be moved, copied, or renamed and still
    maintain full history), fast tags and branches, and more. Most
    important, modern version control systems offer decentralized,
    offline operation. Every developer can keep a local copy of the
    repository on their laptop, editing and committing locally
    to their heart’s content, even when network access is unavailable.
    When ready for a merge, the developer can push changes to other
    developers or to a central “master” repository. Some systems, such as
    darcs and the
    git family are decentralized at
    their core; the excellent
    SVK client layers
    decentralization on top of a modern centralized system,
    Subversion.
  • Embrace anarchy. One of the key realizations of modern Internet
    projects (the oft-quoted
    Web 2.0) is that on the whole, your users can be
    trusted. The key is that the users also need to have the tools
    needed to repair any damage the tiny minority may cause. For a
    development project, modern version control systems can give you
    “anarchy with an audit trail”. If something does go wrong
    (intentionally or more likely accidentally), it’s easy for any
    other developer to identify and fix or revert the problem.
    Having this safety net allows the project to run full-bore without
    time-wasting process getting in the way, and without undue worry
    that code quality will suffer.
  • Avoid deadlocks. There should be nothing blocking a programmer
    from committing his code. Mandatory reviews (or even
    acknowledgement) before commit are often used to work around
    failures in tools or project structure. For example, before
    atomic changesets and quality merge tools, it was extremely
    difficult to roll back a single change made at some point in the
    past; now it is much easier to do so. And without a proper test
    suite, it’s hard to tell if a change broke the code in the first
    place. Any review process gets in the way of instant gratification
    (a key part of fun), and turns reviewers into bottlenecks — if
    they are too busy, the project may slow to a halt until they are
    free. Worse yet, a “bus error” (a key person being hit by a bus)
    may stall the project for good.
  • Cast committer rights far and wide. A central core committer
    group is necessarily slower than allowing every developer to commit
    as desired. Tthe more committers a project can gain, the faster
    the project can go, and the more ideas and group wisdom it can
    incorporate. Autrijus scans a number of technical groups
    2-3 times a day trying to hand out the committer bit, responding to
    people’s musings, and generally spreading awareness. It’s important
    also to make committer sign-up fast and easy. Autrijus hacked a
    quick invitation interface into
    rt.openfoundry.org so that new committers could be invited en masse and sign up on their
    own without having to wait for an admin to fiddle with configs.
    This helps to make sure that people don’t fall out of their
    interest window — even the most casual contributors who just won’t
    wait for any manual process. If invitation isn’t completely
    automated (as for example with a wiki), make sure many people
    in different timezones have admin rights to invite a new committer,
    and pay attention enough to do so.

  • Working code is more fun than mere ideas. Continuously push the
    team to sketch out ideas in code, committing quick and dirty
    protypes that can be refactored as they grow. Have something to
    work with and show off to others from the first week of the
    project. Get things out in public, no matter what the state, as
    soon as possible. The easiest way to do this is to have publicly
    accessible version control. It should be trivial for someone to
    download the repository and play with it (and then easy later to
    edit and join in the fun). When his friends are afraid to release
    “not good enough” code, Autrijus often asks whether he can release
    it on their behalf, or says that he’d like to talk about their work,
    and can they please put it somewhere linkable? He does this so
    often that he’s contracted it to a favorite one-word utterance:
    “url?”
  • Build a rich, supportive community. There are numerous ways to do
    this, but most importantly, lead by example. Mentoring and even
    answering basic questions should happen continuously. Support
    groups should have quick turnaround (IRC is a good choice) and
    an open attitude. Encourage random fun tangents, such as the
    Perl community’s JAPH, obfu, golf, and pervasive Tolkein poetry.
    Get every committer to add themselves to the AUTHORS file (this
    is a good choice to be a new developer’s first commit, if they are
    unfamiliar with your version control system). Turn your project
    into a culture, one that you would like to live in.
  • Excitement and learning are infectious. It’s clear that everyone
    working on Pugs is having a blast, and the team is poring over
    technical papers, attending conferences, and trading information
    with other projects at a massive rate. There’s a pervasive sense of
    high potential and great possibilities, and that sense decays slowly,
    even during inevitable lulls. All of this research and
    experimentation inevitably creates a ladder of skill, from wizened
    experts to fresh newbies. But that’s actually a very good thing;
    skill ladders are part of the very
    definition of fun. True
    passion and community-building rarely develop around a project that
    doesn’t have such a ladder. The more you know, the more you want
    to know — and that’s a heck of a lot of fun, cements the team,
    and produces some amazing code.

Many projects have achieved some of these by accident. Few have achieved
all of them, and in such abundance. Autrijus gave us all a wonderful
present when he made his decision to -Ofun — now it’s your turn.

How do you -Ofun?

AddThis Social Bookmark Button

Related link: http://www.gamedev.net/reference/articles/article2259.asp

Jay Barnson wrote a simple hack-and-slash RPG in 40 ideal hours using Python and PyGame. He also kept detailed logs of the project. Not only is this possible, it’s within the reach of a decent programmer. If you do this two or three times, you’ll probably be able to refactor out a decent framework for building games like this even faster. (Now if Crandall and I can get some free time, we can try to write that little turn-based strategy game in Ruby.)

How good is Ruby SDL?

AddThis Social Bookmark Button

Related link: http://wgz.org/chromatic/writeyourlife/

I’ve long wanted to give something back to the world that would have lasting effects. Writing free software helps, but there are plenty of other ways to make the world better.

Last year I decided to write a series of writing exercises, one per day, to help people who wanted to write but didn’t know where to start and to help people tell their life stories. I wrote along with the assignments and found it very valuable to write something new every day for a month.

National Novel Writing Month is a few weeks away. 1,700 words per day (on average) may not seem like a lot, but it’s a huge amount when you’re staring at a blank screen.

The exercises in my Write Your Life project aren’t necessarily the best or most convenient for everyone, but deciding to write something every day for a month, even just half an hour, even a few hundred words, is a great experience for writing something longer.

The important part of writing is always writing even a little bit. It doesn’t have to be perfect. You can edit later. Just write.

(I’ve revised the Atom feed on my Write Your Life page to produce one new entry every day for the rest of October. The main page won’t change from showing all of the assignments in reverse chronological order, but if you subscribe to the feed you’ll get one a day in the order of posting.)

This took five minutes with Perl, shell, and cron. Long live the Unix toolkit!

Jeremy Jones

AddThis Social Bookmark Button

Related link: http://cheeseshop.python.org/pypi/yaxl/0.0.3

I downloaded and played for a few minutes with yaxl, which stands for Yet Another (Pythonic) XML Library. It’s interesting and has some potential, but I’m not sure how well it will “compete” (I hate using that word in relation to open source projects, but I can’t come up with a more appropriate one just now) with a project such a ElementTree. (By the way, ElementTree is pretty much “the gold standard” in my book.)

One thing I found interesting in yaxl is being able to access an element’s attributes via dictionary syntax. If I have <foo a=”1″ b=”2″/>, I can get the values for “a” and “b” by doing foo_element[’a'] and foo_element[’b']. I like it, sort of, but I’m not too sure how much better that is than ElementTree’s foo_element.attrib[’a'] and foo_element.attrib[’b']. So it saves 7 keystrokes. I’m wavering between +0 and -0 on that. It’s cool, kind of Pythonic, clearly expresses the intent of getting attribute data, but I don’t think it has an advantage over ElementTree’s syntax, which shares some of the same benefits.

One thing that I didn’t like is that “anyelement.children” is a list of all of its descendants rather than just its immediate children. I like the isolation of accessing just the immediate children rather than the children plus their children plus their children…. Maybe there’s another way of getting just the immediate children without having to resort to a list comprehension to do it, which doesn’t appear to work since “anyelement.parent” seems to only return None.

This is an interesting library and, based on the dates on the Cheese Shop, is very new, so I’m certain it has some inevitable changes it will go through. I wish it well. I’m interested to see how it is received by the Python community and what niche it may fill.

Sid Steward

AddThis Social Bookmark Button

Related link: http://lookleap.com/site/mixup

Here’s a working example and PHP script for randomizing web page text in-place. The result scans well and indexes well, but it doesn’t give the story away.

I think this would be a good technique for interfacing paid content with the free web. It would be friendlier to users than an access denied page. And scanning the randomized page gives the reader an idea of what the page is talking about, tempting her into buying the content.

You can visit my mixup page or try it out right here:

Please Enter the URL of a web page:



Download the PHP Code Here

I worry, however, that a search engine might detect the random pattern (so to speak) and consider the page spam. I would appreciate your insight, here.

Would search engines flag randomized web pages as spam?

AddThis Social Bookmark Button

Related link: http://thread.gmane.org/gmane.comp.version-control.mercurial.devel/3481

You probably remember a recent explosion on the Linux kernel mailing list where Bitmover objected so strongly to an OSDL contractor examining the plain-text protocol of a public Bitkeeper server that Linus Torvalds wrote his own source code management system to avoid further fighting.

That was also the end of the “free” Bitkeeper license, as free as a license can be for software that forces you to agree to revised licenses as they come out.

Some people defended Bitmover’s decision to deny “free” licenses to developers who work on competing systems. Does Bitmover have the right (in a legal or moral sense) to deny the employees of paid customers the right to work on competing systems on their own time?

Maybe so.

However competing by threatening to use the legal system, not solely by providing meritorious software or service, is distasteful. Doing this to customers is unconscionable. I wouldn’t do business with a company that behaved in this manner.

Not all proprietary software is like this, right?

Sid Steward

AddThis Social Bookmark Button

Related link: http://online.wsj.com/article_email/0,,SB112787016757454136-IVjfYNhlaR4n52qbICIb…

I like how the WSJ laces its articles with statistics. In Media Firms Dig Into War Chests For Latest Assault on the Internet (Sep. 28) we get some good data points:

  • Broadband will reach 42 million US households this year (eMarketer Inc.)
  • 73 million homes have cable (National Cable Television Assn.)
  • Internet ad sales rose 33% in 2004 to $9.6 billion (PricewaterhouseCoopers).
  • US advertising market: $141 billion
  • The top 10 Internet properties raked in about 71% of the ad revenues in the industry (PricewaterhouseCoopers).
  • Yahoo’s revenue last year: $3.6 billion
  • Google’s revenue last year: $3.2 billion
  • Disney took a roughly $1 billion charge related to the closure of its Go.com portal.
  • MySpace.com attracted 17.7 million visitors in June.
  • MySpace.com net income for the year ended March 31: $4.5 million
  • News Corp. buying MySpace.com parent company for $580 million.

And a graph:

chart of following stats

Text version of these stats:

Internet Traffic for August 2005,
millions of unique visitors:

Yahoo! Sites 122.0
Time Warner Network 118.9
MSN-Microsoft Sites 114.6
Google Sites 85.7
Viacom Online 39.8
Walt Disney Internet Group 30.8
News Corp. Online 12.0

Source: comScore Media Metrix

Andy Oram

AddThis Social Bookmark Button

Related link: http://www.continuent.org/

Because databases are important repositories and the lynchpin of any
application that uses their data, clustering is a critical technology.
Most database vendors provide their own clustering solutions, but they
might not be suitable to average users who simply want to throw
together some systems and say, “Do what you were doing before, but
just replicate everything and execute an automatic failover if
necessary.”

Now there’s an open-source solution for this simple clustering
configuration.
Continuent
is a database-independent project that handles clustering and provides
simple management interfaces.

I talked last week to Continuent spokesperson Emannuel Cecchet. The
project has employed eight engineers since January of this year and is
funded by
Emic Networks,
a long-time provider of clusters for MySQL. The code is released under
the Apache Public License.

The idea behind Continuent is that you can simply run its basic
software, known as Sequoia, on two or more systems that host databases
and have it handle your clustering. Any query directed to one system
is automatically broadcast to the others.

Sequoia handles transaction scheduling, and allows all systems to be
updated aynchronously at the speed of the fastest node. Sequoia also
ensures that the user always gets data from a fresh copy where all
updates have been applied. Failover is accomplished automatically.

The group communications behind Sequoia replication is based on a
component called Hedera that allows developers to plug in various
implementations. Hedera currently comes with the popular JGroups group
communication library.

The broadcasting is more coarse-grained than the scheduling that
databases do on their own, but it ensures that the software is
database-independent and requires no special hooks into the
databases. It has proven efficient enough for moderately heavy
database use, particularly in read-heavy applications (about 80%
reads) that are the norm. But it also scales well with heavier write
workloads.

Continuent grew out of a project called c-jdbc, which was hosted at
the ObjectWeb Consortium and proved quite popular with 50,000
downloads. As the name suggests, the project is written in Java and
started with a Java interface. It is now expanding to offer a C++
interface (called Carob) and to replace its cumbersome ODBC-to-JDBC
bridge with a native ODBC implementation.

Management is through an Eclipse plug-in named Oak. The team hopes to
work with the Eclipse database tools project to do further
integration.

While Sequoia is usually employed with homogeneous database instances,
some sites find it useful to help them migrate to new versions of a
database. New versions can be dynamically and transparently added to
the cluster while the administrators work out kinks.

A few intrepid sites have also mixed databases from different vendors.
For instance, if they consider it necessary to do sensitive and
mission-critical work on Oracle, they may create a cluster with the
critical data on Oracle and less critical data (such as static
content) on MySQL. Different tables can be stored on different cluster
nodes, and Sequoia directs queries to the appropriate node.

Although c-jdbc was originally released under the LGPL, its team found
that the APL was more suited to this project. This is mainly because
the main interface and library are Java, and it’s unclear how to apply
the LGPL to Java code. Cecchet said the team sensed that many
potential contributors were keeping all their code proprietary because
they could not be sure how to split it between free and proprietary
components.

Advertisement