October 2005 Archives

Andy Oram

AddThis Social Bookmark Button

Douglas Engelbart, pioneer of the GUI and of computer-supported
cooperative work, has received a couple awards of late. About 35 years
late, in fact. But he hasn’t let neglect (and perhaps worse, empty lip
service to his accomplishments) curb his spontaneous love of
exploration. Spending a few minutes with him–at a ceremony
celebrating an
award
given last Saturday by
Computer Professionals for Social Responsibility–convinced
me that he remains an American original with a vast scope of
interests, a bit like Edison or Feynman. I wonder what the modern net
of sensors and cameras and GPS devices and wireless networks would be
like if we had integrated his 1960s-era insights from the start.

I had a chance to tell Engelbart how I had first come across his
reputation and ground-breaking work; it was at a conference about
human communication that I attended in the early 1980s. A keynoter
managed to get his hands on a film of Engelbart’s famous 1968
demonstration of computers as augmentations of human intellect, and
showed us five or ten minutes I will never forget.

The part of the demonstration we viewed started with some voice
conversation. Then a piece of paper came up on the screen and some
marks appeared as someone drew on it. More marks appeared as the other
person drew comments on the first one’s marks. Then a small video
opened up in one corner and the head of an assistant, looking very
much like a Californian grad student of the 1960s, popped up and
started talking. The merger of voice, video, and whiteboarding was a
creative implementation of the kind of teleconferencing I suggested
recently in my article
A googol of teleconferences.

This demonstration, as I saw it in the early 1980s, blew my mind. That
the demonstration actually took place in 1968 was almost beyond ken.
But that was how ahead of us Engelbart was, and remains,

The 1968 demo reportedly cost $100,000 (which in 1968 dollars is also
nearly beyond ken). The computing world of 2005 has much more of the
infrastructure that could make Engelbart’s vision a common experience.

But suppose we had listened to Engelbart back when he began? I can
imagine what would the modern network be like if we had integrated his
humanistic approach to augmenting intellect into technology each step
of the way:

The dignity and capacity of humans would remain central.

Modern sensor systems, such as smart dust and MIT’s
Project Oxygen,
scoop up data somewhat indiscriminately, while the projected uses of
this new “Internet of things” suggest computers switching each other
to new tasks without human intervention. It’s all kind of scary,
suggesting a world out of our control, a kind of science-fiction
Terminator reign of the machine. (To be fair, the designers of Project
Oxygen claim their mission is to be human-centered.) If we already
had a highly interactive network centered on human interaction, with
people tied closely by wire, we could build the new capabilities
asking at every turn, “How are we enabling people to do more of what
they are best at doing?”

Protocols might be in place for the integration of new instruments.

Click around your computer system–or the web sites of many
organizations, including the internal ones they establish perportedly
to increase the productivity of their employees–and you’ll come
across a lot of information with no apparent use. Even the experts
will admit that some of it is pointless. Sensors and other
participants in the Internet of things are likely to suffer from the
same problem. If we had established a human-centered network over the
years, we might know more about what knowledge we need and provide
frameworks for incorporating valuable new devices.

User interfaces would be richer and more subtle.

Right now we’re stuck with the legacy of the Bell telephone system and
that of the typewriter, a nineteenth-century mechanical device with a
bias toward the characters of the English language. Had we stressed
communication and the contributions of individuals to each other’s
endeavors, we might have a plethora of different ways to react with
the computer by now.

Perhaps we might even have adaptive interfaces, which watch what users
do and change over time to present each user with the functions he or
she is more likely to want. I’d be very reluctant to use an adaptive
interface in our current state of computing, because our knowledge of
human-computer interaction hasn’t achieved the sophistication an
adaptive system needs to be productive rather than annoying.

We might have new solutions to the storage and retrieval of massive
amounts of data.

For a long time, the focus of the computer field was on providing
applications. Now we’ve shifted toward a focus on
services, which are more fine-grained and can be combined in
innovative ways by the users. I tracked this evolution in an article
titled

Applications, User Interfaces, and Servers in the Soup
.

Each shift in the use of data–as well as the amount collected and
searched–has brought with it sophisticated research into databases
and storage. We’re seeing another leap in size and search requirements
as people become used to storing images and videos. The Internet of
sensors will lead perhaps the biggest scaling problem we’ve ever had.
But a network based on communication might have given us a head start
in understanding and adapting to the onslaught of data.

I think Engelbart’s vision involves a move beyond applications and
services to a new focus: support for the most distinctive features of
human intellect, including communication with other people and
devices. Engelbart’s vision has remained a beacon for us over the
decades, and I am hopeful that a future decade will instantiate it.

Geoff Broadwell

AddThis Social Bookmark Button

Related link: http://www.oreillynet.com/pub/wlg/8097

Over the last
two
weeks,
I’ve been talking about optimizing code, starting with the information
gathering phase (choosing an optimization target, profiling, benchmarking,
and so on), and continuing with theoretical reasons code is slow (such as
trying to solve a bigger problem than the one at hand, or choosing a poor
algorithm). This week we’ve finally gotten down to the wires and chips:
why real hardware makes seemingly good code run dog slow.

As I mentioned last week, big-O notation ignores the constants that
determine how long each primitive operation takes, and even how many
primitive operations go into each “high level” operation. Even worse,
these constants aren’t really constant. (Osborne’s law: “Variables won’t,
constants aren’t.”) Why? To answer that, let’s take a little tour
of a modern computer.

The heart of each computer is of course its processors. I use the
plural there because the average desktop or video game console contains
at least two powerful processors — the CPU doing most of the system
management and application logic, and the GPU slinging pixels with
abandon. Workstations (and even high end desktops) often contain two
of each, and servers may have dozens of CPUs. There are also various
smaller processors scattered through the system, including DSPs, I/O
processors, network processors, and so on.

All of these processors need to be able to talk to each other, so
the computer is also filled with communication links, either in the
form of shared channels such as PCI, or point to point links such as
HyperTransport. The processors also need external I/O, so another
set of communication channels exist for that; once again, some are
shared channels (SCSI), and some are point to point (SATA).

These communication channels bring us to the first big bottleneck:
communication is slow. Every channel has a maximum bandwidth capacity
as well as a latency involved in transferring data from one end of the
channel to the other. There is also a certain amount of overhead to
initiate a transfer, and often additional overhead to keep a large
transfer going. These three limitations add up to big problems:

  • Transferring a large chunk of data will be slowed significantly
    by the bandwidth limits of the channel. Transferring 2 GB over
    a 1 GB/s link will take (at least) 2 seconds.
  • On shared channels, when more than one device wants to transfer
    data, they not only have to share bandwidth, but often pay additional
    overhead to switch channel users, so each only gets a fraction
    of the bandwidth of the channel, adding up to well under 100%
    of the total available.
  • Transferring small chunks of data will be limited by both the
    latency of the channel and the overhead to set up each transfer,
    especially if that overhead requires a handshake (a protocol in
    which multiple endpoints have to agree to set up the transfer).
    In the handshake case, the latency cost will be paid several
    times over while the various devices reach agreement.

The data flowing across all these channels needs to be stored somewhere,
so each processor has a chunk of memory to call its own. Slow devices
produce (and consume) little data, so this may only need to be a few KB
or even just a few bytes, and it can be stored right on the processor chip.
More powerful processors, including the CPUs and GPUs, can produce and
consume truly massive piles of data. These processors have sizeable
local memories, from a few dozen MB to a few GB.

That’s too much memory
to put on the processor itself, so it gets separated onto other
devices which — you guessed it — have to communicate with the processor
over a slow channel. Worse yet, big memory is slow to access anyway;
channel latency adds to memory chip latency and memory transaction overhead
to make off-chip memory quite slow.

Because big off-chip memory is slow, fast processors have on-chip cache
memory as well, which tries to keep copies of data that is likely to be
used again soon, so that the processor will not have to go across the
channel to get it. Of course, cache is not a panacea. It must be
smaller than the main memory, of course, and the heuristics used to pick
which pieces of main memory to keep copies of, and which data can be tossed
to make room because it won’t be used again soon, are often wrong.

Processor designers can compensate for this weakness by making the
cache bigger, but that makes it slower (big memory is slow, remember).
Pretty soon the silicon spent on cache takes as much room as the
processor itself, so the processor and cache are once again forced to
communicate over a channel that will slow things down. (Sure, the
channel happens to sit right on the processor chip or perhaps right
next to it in the same plastic package, but even this is a long
enough channel to notice the slowdown.) On very fast processors,
the big caches themselves have smaller caches to try to hide the
performance problems inherent in a large cache. Lather, rinse, repeat.

Even if the heuristics were perfect, cache can’t help at all in some
situations, such as when streaming large chunks of data. In this case,
data is only used once, and then either sent to main memory or to another
processor; keeping a copy of the data won’t speed that up. It’s worse
than that, of course; the cache itself adds latency to every memory
access, so keeping a copy of streamed data actually slows everything
down. Also, in order to simplify and speed up the cache design, caches
typically only copy data in chunks of a certain minimum size. Algorithms
that jump around memory grabbing small bits of data tend to be slowed
significantly by caches, as the cache unnecessarily copies more data
from main memory than it needs to (sometimes a lot more), which chews
up much of the bandwidth of the main memory channel.

Still, there are enough cases in which cache is a win (and sometimes a
significant win) that virtually every processor has one or more
of them. These strata of caches of different sizes and speeds create
what is known as the memory hierarchy. At one end are the
ultra-fast registers, tiny chunks of memory that add up to perhaps 1KB
or less. A little slower and an order of magnitude or two larger is the
L1 (level one) cache; slower and larger still is the L2 cache, and some
processors even have an L3. Beyond that comes the processor’s own main
memory, and then memory belonging to other processors, which must be
reached over yet more channels inside the computer.

Relatively often, all the memory belonging to all the processors in the
computer still isn’t enough to handle all the work the user demands at
once, so every modern operating system supports virtual memory,
in which some of the data is stored on a local disk or even across a
network on another computer. Unfortunately, disks and networks are
even slower than the slowest communications channel inside the computer.
Both have moderate to severe bandwidth limitations (especially networks),
and both have severe to insane latency issues. Processors count time
in fractions of a billionth of a second; disks and cross-country networks
take several milliseconds to set up a data transfer. This is a difference
of 7 or 8 orders of magnitude, and amounts to an eternity from the point
of view of the processor. Sadly, these limitations come partly from
physics and mechanical engineering issues, such as the maximum spin speed
of a disk drive platter using current materials and fabrication techniques,
or the speed of light in a long-haul fiber optic cable. There’s room
for improvement, but not orders of magnitude.

Of course, disks and network devices have caches of their own to hide
some of these issues. In reality, these are mostly just buffers; in
other words, their primary duty is to impedance match a fast, high
bandwidth channel to the processor with a slow channel to the disk or
network, so that the processor can spend as little time as possible
waiting for data to trickle in or out. Disk buffers are actually fairly
sophisticated, reordering incoming and outgoing data requests to
minimize the time spent physically moving disk heads back and forth.

Caching and buffering are great things, but essentially they
are kludges, attempts to hide some of the problems some of the time.
As I mentioned before, there are data use patterns that cannot
possibly be sped up by caching, no matter how sophisticated. And there
will always be data use patterns that will be slow in reality, even
if they could be cached in theory. These tend to trip over tradeoffs
in cache design in which the chip designer chose a different tradeoff
than the use pattern would prefer.

Knowing that processors will be forced to wait, the next
technique for keeping them busy is to do more than one thing at once.
That way, anytime the processor has to wait on a data channel (either
sending or receiving), it can switch to another task and try to get
work done on it. This of course only works until that task
has to wait, at which time the processor tries to go to a third task,
or perhaps just back to the first task to check if its data is ready.

As you might imagine, this works much better if there are lots of
tasks, so that there is likely to always be at least one ready to go.
The downside, is that all of those tasks have to share the
fixed resources of the system. Each will want to use some of the
bandwidth of the various communications channels, each will take some
room in various caches, and so on. Eventually there will be enough
tasks that the system slows down significantly because of contention
on these shared resources.

When the contention is over space in some
layer of the memory hierarchy, it’s known as thrashing, and
it is one of the biggest performance walls out there. There’s a
noticeable drop in performance as each layer of the memory hierarchy
is overfilled; the system essentially slows to the speed of the
next layer in the hierarchy. When this gets all the way to thrashing
virtual memory, it not only slows the machine to a crawl, but is
actually audible — even the quietest disk begins to make a good
deal of noise when it is forced to read and write data in many
different places on the disk as fast as possible.

Of course, thrashing can be caused by a single application, too; as
its data set grows, the cacheable portion will often grow too large
and overwhelm the smaller caches all by itself. Still, a programmer
may be able to control the cache friendliness of his own application;
trying to make many applications play nicely together is a much bigger
problem.

Processor designers have more tricks up their sleeve, however. In
particular, they take advantage of more kinds of parallelism than just
the multitasking described above. Current processors have
at least three more ways to get more work done. Simplest (and most recently),
they can simply add additional processor cores to each processor package.
Each core is essentially a complete processor of its own, able to handle
its own set of tasks. Adding cores has an advantage over adding more
processor packages, because the communications channel between the cores
can be very short and wide, and hence very fast. Multiple cores also
usually have their own copies of the fastest caches, but share the largest
caches. This sharing may raise contention issues, but often is a win
because some tasks will tend to be more cache-intensive than others, and
the same total amount of cache can be apportioned in a better way.

Before there was room for multiple cores, processor designers added
more functional units to each processor core. This allowed the
processor to be simultaneously performing several different calculations
within the same task at the same time. Unfortunately, it’s usually
not helpful to simply add dozens of additional functional units. In
most applications, later calculations depend on the results of earlier
ones, and pretty soon most of the extra functional units are idle
waiting for each other’s results. (It happens that computer graphics
is one very large exception to this, so GPUs have literally hundreds
of functional units in them.)

Before processors were made “wider” by adding more functional units,
they were made “longer”. By splitting every instruction handled by the
processor into a sequence of suboperations, and using dedicated (and
highly optimized) hardware for each suboperation, it is possible to form a
pipeline within the processor through which instructions flow.
At any given time, each piece of dedicated hardware, or pipeline
stage
, has a different instruction in it. The processor may be
decoding one instruction, fetching input data for another, performing
various stages of calculation for others, and finally writing the
output data for yet more.

Pipelining has two huge benefits — first, older processors still had
to perform these suboperations, but they could only do one at a time,
so every instruction took several clock cycles. A full pipeline can
overlap these suboperations, and so can complete a new instruction every
clock cycle. Second, breaking instructions down even more, into
smaller and smaller suboperations, allows the processor’s clock speed
to be increased significantly. Of course, this only works up to a
certain point; as Intel discovered, pretty soon physics and various
types of overhead get in the way of achieving faster processors this way.

Pipelines also have a major weakness, in the form of stalls,
also known as bubbles. These occur when one stage of the
pipeline has to wait on another stage, and so must sit idle for one or
more clock cycles. The problem is that since a pipeline is always
flowing, idleness moves downstream too; each stage will be idle at
least as long as its predecessor. If the first stage of the pipeline
stalls at cycle number 1, for example (waiting on data perhaps), then
every stage of the pipeline will waste a cycle later on.
The second stage will be idle at cycle 2, the third stage at cycle 3,
and so on.

There are a great many ways that pipelines can stall. Waiting for
data is an obvious one, but two more are extremely common as well. The
first is when a young instruction in an early stage of the pipeline has
to wait for a calculation being performed by a later stage for an older
instruction. The stages in between will idle out until the calculation
completed and the earlier stage can continue performing useful work.

The second common stall is when a branch occurs; by the time the
processor realizes that it must take a branch, the branch instruction
will be near the end of the pipeline. The instructions that were in
the pipeline behind the branch were from the wrong code path, so they
all have to be thrown away. Then there’s usually some extra time spent
figuring out where the branch destination is in virtual memory, trying
to find the proper next instruction in one or more caches, and so on.
During all of this, the whole pipeline sits idle. Ouch.

It should come as no surprise that processor designers try to hide these
problems as well. The first case is partially hidden by reordering
program instructions on the fly in a way that will calculate the same
results, but prevent as many of these cross-stage waits as possible.
The branching problem is hidden with a host of tricks, including trying
to guess the correct destination from previous history and executing both
paths after a binary branch and then keeping the results from the right
path once the final branch decision is made (this at least keeps the
pipeline half busy, instead of completely idle). Once again,
these tricks make the average case somewhat better, but there are many
reasons they can fail, and even make things worse rather than better.

I’ve really only scratched the surface of all the ways that real computers
can run much slower than algorithmic analysis might indicate. For
example, I’ve almost completely ignored the huge class of performance
gotchas that arise from processor designers trying to save a little space
here and there; these often appear as special cases that can drive you mad.
(”You can do two multiplies and an add, or two adds and a multiply, at
full speed; but it takes an extra cycle to do three in a row of either one.”)

So now what? If there are so many ways that code can run slowly on real
hardware, what can you do to minimize these problems? It turns out that
there are some broad techniques that should help you reduce the affects
of most of what I’ve written; those techniques will be the subject
of my next post.

What’s the most infuriating performance gotcha you’ve come across?

Christopher Diggins

AddThis Social Bookmark Button

Related link: http://www.artima.com/weblogs/viewpost.jsp?thread=134186

Here is a reprint of my blog entry at Artima.com:

I frequently encounter open-source code which reimplements code which exists elsewhere (and usually does so badly). When everyone is busy reinventing the wheel, no one has the time to build a cart.

Even though some developers are guilty of simply not doing research, part of the problem is that finding open-source code for a particular purpose is hard. Search engines are well suited for finding text, but not source code. This is because:

  • Source code documents are not often distributed directly on the web, but rather as part of compressed packages
  • Documentation and source-code are often separated. Robots have trouble creating hard-links between documentation and the source code.
  • Comments in source-code, are treated with the same level of priority as function names, and variables. This means that they aren’t indexed with the proper level of priority.

So how does this get solved? Well I can see two ways:

  1. Search engines start applying specialized techniques for parsing and indexing source code.
  2. Open-source developers come up with a new standardized language independant format for distributing source code. (perhaps Open-Source-XML?)

I think either (or both) of these technologies could have a significant impact on moving software technology forward.

How can we improve searching for source-code?

AddThis Social Bookmark Button

State Controller Steve Westly kicked off a campaign Wednesday to find the owners of $4.8 billion worth of items in the vault including checks, jewelry and antique gold coins. (From CNN)

In my modest searching of California’s Unclaimed Property Search, I’ve already found some property the state owes to various family members.

CNN’s article on the State of California’s effort to return property was a nice introduction, but oddly enough was missing the actual link to the Unclaimed Property Search. Given it was an online article, it seemed rather odd to be missing the link.

So far, the State of California’s online initiatives have really impressed me. The California DMV has done a great job of putting services up. They had their Hybrid car-pool sticker applications online the day after Federal regulation allowing it became law (Though I still haven’t found the time to go get mine).

I’d suggest searching the Unclaimed Property database for family thats not too Internet-savvy, its amazing how many people have property being held by the state for various reasons. If banks, trusts, etc. can’t locate the person they typically turn it over to the state, which happens for a lot of reasons and more frequently than many might imagine.

Have your states online services been up to par? Find any surprises in the Unclaimed Property database?

AddThis Social Bookmark Button

Last week I had to reinstall Linux on my work laptop. I’m not entirely certain what happened, but I had some filesystem corruption on my JFS formatted root partition and though I was able to repair it and didn’t lose any data (thank you for saving my bacon yet again Knoppix) my disk started thrashing more often afterwards and I just didn’t feel comfortable with that. So, I backed up my data, wiped everything clean, and decided to move from the Arch distro which I had been using since Dec 2004 to the flavor of the past year, Ubuntu.

But, this article isn’t really about that. It is about my first attempt at printing from the Firefox web browser today. After sending several pages to the printer (a somewhat old HP 5000GN) I walked over to it and found that I had not printed the most recent sales for Linux Desktop Hacks, but rather, I had three pages of the same error message.

This message was about 11 lines long, and told me in a straight forward manner that the problem was that the Postscript interpreter on my printer was version 2014.18 and that the printout requires version 2015 or greater. Now, I sorta had an idea what that meant, but I really didn’t know how to go about fixing it…my first thought was that I would have to upgrade the firmware of the printer which is something that I had to do in the past when some of our Macs were having a hard time printing PDFs from InDesign. Anyway…

…I didn’t need to figure out what to do. The rest of the error message told me exactly how to fix the problem. It involved changing the Firefox print command from the lpr gibberish that is there by default, to include some preprocessing with Ghostscript. I’m in the amateur ranks when it comes to lpr statements and Ghostscript is even more of an unknwn, but I could certainly type in what I was told. So I did…

…and it worked.

If you encounter this problem and error message you could say that its irksome to have to make any adjustments from the default. Or you could just be happy that some programmer had the forethought to put in a useful error message for something that would possibly be a common problem. Having this information right there in the printout is much more useful than having to search for a FAQ somewhere, particularly if all you know is your document didn’t print.

If only there were more good error messages like this.

chromatic

AddThis Social Bookmark Button

Related link: http://www.onlamp.com/pub/a/onlamp/2005/10/13/what_is_rails.html?page=last#threa…

In comments on Curt Hibbs’s What is Ruby on Rails?, he and Aaron Trevena, maintainer of Perl’s similar Maypole project have debated whether Ruby or Rails are doing anything particularly new.

For people who’ve only ever seen complex “enterprise-class” frameworks and libraries and designs as usable, certainly watching any of the Rails movies might give some evidence that being able to solve the 95% of all possible web programming problems that don’t need huge application servers and complex transactional and messaging systems with a fraction of the effort and perhaps fewer lines of code in general than the complex system requires lines of XML in configuration files is a good thing.

Of course, anyone using a decent set of libraries in Perl, Python, Ruby, or PHP probably already knew this.

Ruby does bring certain advantages; I much prefer the ActiveRecord syntax and introspection over that of Perl’s Class::DBI, but they’re both fantastically useful. They’re equivalent enough that neither offers an order-of-magnitude improvement over the other.

Where something like Python’s Django might invent and polish a new idea, the amount of time and work necessary to do something similar in Perl or Ruby isn’t large either. I don’t have enough practical experience with PHP 5 to judge there, but I’m sure it’s also flexible and dynamic enough to work.

In my mind, the issue isn’t “Ruby on Rails is more flexible and capable than standard J2EE or .NET for any project under a (very high) threshold of complexity”. The real point is that the simplicity, flexibility, and abstraction possibilities offered by dynamic languages and well-designed libraries — as well as a talent for exploiting radical simplicity, extracting commonalities from actual working code, and knowing when too much flexibility makes you less agile — offer a huge advantage over languages and libraries and frameworks and platforms that assume you need a lot of hand-holding to solve a really hard problem.

Yes, Ruby on Rails does what it does very well. It’s not the only thing that does, though. I wonder perhaps if some of the buzz and glow is that it’s new and shiny (in comparison), so that people haven’t already formed their own opinions about it, as they may have with Perl (oh, you can’t write readable and maintainable code), Python (all the fun of the Lisp community without half the things that make Lisp special), and PHP (a language that needs to grow up).

Fortunately, a lot of smart people already understand this. It would be nice to have the right debate, though.

Am I wrong? Is it really Ruby and Rails, or is it the dynamicism, flexibility, and better opportunitites for abstraction of dynamic languages that provide so much of the benefit?

Nitesh Dhanjani

AddThis Social Bookmark Button

I’ve been spending a considerable amount of time auditing web applications, and I’ve come to realize that a large amount of developers do not understand the root cause of Cross Site Scripting (XSS) vulnerabilities. The most common mistake committed by developers (and many security experts, I might add) is to treat XSS as an input validation problem. Therefore, I frequently come across situations where developers fix XSS problems by attempting to filter out meta-characters (<, >, /, “, ‘, etc). At times, if an exhaustive list of meta-characters is used, it does solve the problem, but it makes the application less friendly to the end user – a large set of characters are deemed forbidden. The correct approach to solving XSS problems is to ensure that every user supplied parameter is HTML Output Encoded (Example: < is replaced with &lt;). Most frameworks (.NET for example) provide API’s that help with HTML encoding, but I have come across instances where such APIs don’t encode certain characters that can lead to XSS when more complicated variants of input are attempted. Therefore, I frequently and highly recommend RSnake’s XSS cheat-sheet to test web based applications and services for XSS vulnerabilities. If you are a web developer or tester, I do recommend that you test your application with the inputs suggested by RSnake to test for XSS issues.

Andy Oram

AddThis Social Bookmark Button

In recent weeks the state of Massachusetts announced, to cheers on one
side and alarm on the other, that it would start writing all new
memos, spreadsheets, and other documents in the OpenDocument format
standardized by
OASIS.

Now there’s a spiffy new web site by the
OpenReader
activists, promoting this format for ebooks.

These are two sides to the same coin, one that buys us freedom in
document formats. Getting your document’s content accurate and
readable is enough of a hassle without worrying about whether a change
in computer platform or tools will render the document ugly–or worse
yet, gibberish.

OpenDocument is an input format, OpenReader an output format.
OpenDocument provides freedom for writers, ensuring that they can
switch production tools as better ones become available. It also
promotes compatibility over time (less chance that upgrades will render
documents unreadable) and protection against bugs.
OpenOffice.org
and
KOffice
are among the projects adopting the format. If a number of states and
countries follow Massachusetts’s lead (which seems likely) Microsoft
may give up its current carping and jump on board.

Within Massachusetts, opponents of the move to OpenDocument are
reduced to about the weakest argument they can find–saying that
conversion would cost a lot of money. The whole impetus behind the
OpenDocument movement is to free us from such short-term thinking.

As for OpenReader, it promotes freedom for readers. It means that for
the first time there’s a feature-rich, multimedia format that allows
publishers to offer ebooks in confidence and that multiple device
manufacturers can support.

The Web offers much room for innovation, but it tends to be weak in
certain areas, particularly for large documents. It doesn’t let you
bookmark arbitrary points in documents, for instance. (XPath would
support that, but Web users don’t have access to tools using XPath.)
OpenReader addresses such needs.

As a proof of concept,
OSoft
is converting its free-software ThoutReader browser to OpenReader. So
books will hopefully start appearing in that format in 2006.

Probably there will always be elements of communication that are
non-standard. Standards bodies can’t keep up with innovation; they
usually must follow it. Free software implementations will promote
innovation without limiting access. So OpenDocument and OpenReader,
along with their free implementations, are foundations for future
document freedom.

Sid Steward

AddThis Social Bookmark Button

Related link: http://www.alwayson-network.com/comments.php?id=12541_0_11_0_C

Vint Cerf says the revolution is on the edges of the internet and cites VoIP as a good example. The edges? Client applications? That’s Microsoft’s turf.

While many eyes watch for new, life-changing web services such as Google search and Google Maps, they might be missing the next revolution.

VoIP, P2P, RSS, tagging and blogging are all decentralizing forces. VoIP and P2P have largely dispensed with centralized infrastructure. As bandwidth and computing power grows, I expect we’ll see more action on the desktop, not less.

Jonathan Bruce

AddThis Social Bookmark Button

Related link: http://blogs.datadirect.com/bin/mt-tb.cgi/40

Back in June, I talked about Microsoft’s public disclosures that details some their plans for Office 12 — finally we can look forward to XML file formats for the stable Office applications; Word, Excel and PowerPoint.

Since then, I’ve watched with interest as Open Document discussions have increased in volume. Foremost in their opinions include bloggers such as Jonathan Schwartz, COO of Sun; David Berlind from the ZDNET Tech Blog portal; and let’s not forget the ever growing community behind OpenOffice.org

To add spice to the interesting recipe, mix in the very public moves made by Commonwealth of Massachusetts to adopt Open Document as their standard file formats. Add some furious speculation as to what the recent Sun-Google alignment really means and then quickly serve up a full scale debate on whether or not we are on the cusp on a new revolution.

Let me try and cut through what I think is likely to emerge as the difference between reality and (to some degree) spin. First some key facts:

  • Google has successfully executed around the AJAX model giving a compelling web experience for email, maps and more recently blogs.

  • Open Office 2.0 recently shipped for general release. If you’ve not tried it, it is worth a look as the 2.0 version is a vast improvement on earlier versions.
  • Sun and Google have announced a broad technology partnership.
  • Open Document is seeing traction and serious consideration from the Commonwealth of Massachusetts.

Now let’s look at the evidence before us. First let me reiterate this quote from Jonathan Schwartz blog:

    “Could these apps I mention, above, be enhanced with better network connectivity, more collaboration, and better integration into your daily life? Absolutely…..So if you want to know what the future portends for OpenOffice.org, that’s a fine place to start (and AJAX will likely play a role).”

In this case I agree with David Berlind of ZDNET. The true meaning here is in what Schwartz omits. Large companies like Sun and Google always have an eye out for technologies that trigger something magical (the halo-like effect.) The trick is do it in such a way that developers will happily invest hours of their time with this technology and innovate freely around it.

If Google and Sun come up with something really compelling, I am very excited by the prospects, but I think it is important to add one note of caution; Microsoft will work feverishly to protect its golden Microsoft Office franchise. While we can look forward to XML formatted office documents, history has shown us that Microsoft is unlikely to rely entirely on spurring developer activity. They have the advantage of an estimated 95% market share.

From my perspective opportunities abound for technologies like XQuery to become integral parts of engines that integrate a distributed XML and relational data sources beneath an AJAXified Open Office front end. The introduction of Microsoft’s Office 12 formats presents similar opportunities.

From a developers stand-point it will be important to understand how to participate in the different strategies as they emerge. Should you opt for the halo-effect followed by the underdogs (Google/Sun) who seek to establish mindshare for Office applications? Or should you work with the incumbent (Microsoft) who will more than likely follow the upgrade route, incorporating less inclusive developer approaches that ensure their continued dominance.

Ladies and gentlemen, please place your bets.

Nitesh Dhanjani

AddThis Social Bookmark Button

I just came across twill, a Python based tool for web application testing. It can be used interactively (command-line) or via a Python script. Below is a quick example on how to use twill so submit a form (HTTP POST). I’ve used Google for demonstration purposes. Note that all user input is represented in bold.

Startup twill:
$ ./twil-sh
-= Welcome to twill! =-

current page: *empty page*

Goto http://google.com/ and show form details:
>> go http://google.com/
==> at http://www.google.com/
current page: http://www.google.com/
>> showforms
Form name=f
## __Name______ __Type___ __ID________ __Value__________________
hl hidden (None) en
ie hidden (None) ISO-8859-1
q text (None)
1 btnG submit (None) Google Search
2 btnI submit (None) I'm Feeling Lucky
current page: http://www.google.com/

Use “oreilly” for the query (q) paramter, and submit using “I’m Feeling Lucky”:
>> fv 1 q oreilly
current page: http://www.google.com/
>> submit btnI
Note: submit is using submit button: name="btnI", value="I'm Feeling Lucky"
current page: http://www.oreilly.com/

Our search succeeded, and we are now at http://www.oreilly.com/ (redirected by Google because we submitted using the “I’m Feeling Lucky” option). Next, lets list forms on http://www.oreilly.com/:
>> showforms
Form #1
## __Name______ __Type___ __ID________ __Value__________________
sp-a hidden (None) sp1000a5a9
sp-f hidden (None) ISO-8859-1
sp-t hidden (None) search
sp-x-1 hidden (None) cat
sp-x-2 hidden (None) cat2
sp-q-1 hidden (None)
sp-q-2 hidden (None)
sp-c hidden (None) 25
sp-k hidden (None) Articles|Books|Conferences|Other|Weblogs
sp-q text (None)
1 search submit (None) Go
current page: http://www.oreilly.com/

Show cookies acquired so far:
>> show_cookies

There are 1 cookie(s) in the cookiejar.

<Cookie PREF=ID=cf692c05eddeb4e8:TM=1130266168:LM=1130266168:S=5XixcWgCmokEZC0m for .google.com/>

current page: http://www.oreilly.com/
I see how twill can be very useful in performing security assessments against web applications. twill makes it easy to submit forms for input validation testing (XSS, SQL Injection, etc), look at hidden HTML tags, cookie details, etc. The twill website has more details on how to use twill in a Python script. This can be useful when you need to automate twill actions. Also, see “Web app testing with Python 3: twill” for more examples.

brian d foy

AddThis Social Bookmark Button

Chris Albritton of Back To Iraq does a bit of investigative journalism using the revision tracking features of a Word document. He can see the changes to the Mehlis report on the assassination of Rafik Hariri. Additionally, he can match up the time of the revisions to the time Special Representative Mehlis met with UN Secretary-General Kofi Annan. Several names were redacted, but it’s too late for that because the Word document Chris got still has the revision history, so it still has the names.

He’s posted the relevant section in his entry Names Deleted from Mehlis Report.

chromatic

AddThis Social Bookmark Button

Related link: http://www.zoomerang.com/survey.zgi?p=WEB224KLPXJUHE

We run a short survey every year to understand our readers. We use this information to change the topics we cover and to present our content more effectively. Last year, who could have predicted that Ruby on Rails and Ruby in general grow tremendously in popularity? (Okay, a few people knew it was good, but this popular this quickly?) This year, what are you reading on our sites?

So far, 97% of all respondents read articles. I’ve always thought that this is the primary draw and the statistics so far back it up. Also, slightly under half read weblogs — perhaps we should find a way to present them more prominently.

Half of the respondents find articles by browsing the home page, while over a third use a feed reader. It could be interesting to correlate the reading patterns of the groups (but I don’t have any statistical analysis of this at the moment).

Some questions allow multiple answers. Nearly 38% of respondents use BSD of some sort, with 82% using Linux, 33% using Mac OS X, and 60% using Windows. More interestingly, 73% of respondents develop on Linux, 53% on Windows, 26% on a BSD, 23% on Mac OS X, and 13% on Solaris. Deployment is a bit different, with 80% deploying on Linux, 50% on Windows, 32% on a BSD, 17% on Mac OS X, and 19% on Solaris.

These numbers obviously differ from the desktop market as a whole and probably reflect the bias of the site and the nature of our audience.

So far, the largest job categories of respondents is software developer (17%), with applications developer (14%) and system administrator (12%) not far behind. I don’t know what the difference is between the first two.

Nearly half of the respondents work for small companies of 50 people or less, though the rest of the responses fall pretty evenly between 50 and over 2500 people.

Finally, there’s a heavy industry bias. 20% of all respondents describe their business or industry as computer software or Internet and e-commerce.

The survey closes this Friday, 28 October, so please take it before then. We’ll enter you in a drawing for some nifty swag. More importantly, we’ll use what you tell us to plan for the next year of the site. (Note that the survey uses cookies only for its duration.)

I’ll be back after the survey ends to report on the results as a whole.

brian d foy

AddThis Social Bookmark Button

The Office of the Inspector General reports on the Transportation Security Administration’s computer network security, and it isn’t pretty.

Remember, the TSA are the same people who violated the Privacy Act by collecting airline passengers’ personal information without notifying them.

People worry about identify theft from shopping online. I worry about my government virtually giving it away.

Nitesh Dhanjani

AddThis Social Bookmark Button

A few days ago, I noted Tenable’s announcement stating that Nessus3 will not be released under the GPL. As expected, this announcement has caused 3 new Nessus forks to be announced: GNessUS, Sussen, and Porz-Wahn.

GNessUS seems to be most active of the three (as of now). According to the announcements section of the project website, GNessUS will soon change its name:


Date: Sat, 15 Oct 2005 11:11:16 +0100 (BST)
From: Tim Brown
To: gnessus-announce@gnessus.org, gnessus-news@gnessus.org
Subject: News from the Tenable talks

All,

Yesterday evening I spoke with Jack Huffard from Tenable regarding the choise of GNessUs for the new project name. Whilst I had carried out a trademark search prior to registration of the gnessus.org domain this failed to show that Tenable have an outstanding trademark registration in progress for the Nessus name and as a result Tenable are unhappy with the choice of GNessUs for the new project.

Whilst I argued whether GNessUs would conflict with this registration (particularly since they won’t be registering Nessus world wide) and inquired as to whether they felt OpenNessus would also be in conflict (knowing full well they already owned opennessus.org), which they did, I eventually decided that this was a fight I wasn’t willing to have.

Jack and I have agreed that by the end of the year, I will have sold gnessus.org to Tenable for the price originally paid (12 euros from Gandi.net), subject to Tenable making the trademark application paperwork available for me to review. The conversation was however constructive and Jack wished the project all the success in the future, reconfirming that any changes made to Tenable’s Nessus 2.x branch would remain GPLd and that they had no intention to break compatibility.

What does this mean for the project? Well, I have no intention of shelving it and to this end, I have set up gnessus-discuss@gnessus.org where we can debate a new name for the project. Subscription to this can be achieved by sending a mail to majordomo@nth-dimension.org.uk with a body of “subscribe gnessus-discuss” and I would welcome you joining.
[more]

This is great news. I will continue to watch all three of these projects, and contribute where I can.

Sid Steward

AddThis Social Bookmark Button

Google Library could pipe print publishers’ works into millions of homes; it’s a natural marriage of new and old media. So why are publishers fighting Google’s embrace? More importantly, how could they patch things up?

I think the main issue is money. Publishers want a piece of the action, but Google doesn’t want to pay.

Google could coax publishers into Google Library by allowing readers to buy pages of content online. Just like the old library photocopier, but on your desktop. To that end, Google would need to fashion a PayPal-like service.

So, I search and Google gives me an excerpt. I pay, and I see the page scan. Google and the publisher split my money, and everybody is happy. I dubbed this idea PageSense.

Fair Use Isn’t Enough — We Want Content

If Google wins the fair use battle, it will have license to index anything. (* Crowds Cheer *) Problem is, what good is a search result if it points me to an out-of-print book hidden in some monestary? Sound unlikely? Google suggests that 60% of the Google Library would be inaccessible content. The solution is to let me pay a quarter to see the page scan Google made. This quarter gets split between Google and the publisher. I’m happy, Google’s happy, and the publisher is happy.

Encourage Valuable Online Content

PageSense would also be helpful to web publishers. Today, web publishers can make a few bucks from ad revenue. So they publish content that maximizes this indirect reward — even web spam.

By creating a system that directly rewards web publishers for good, relevant content, you could expect to see an increase in valuable content online. It wouldn’t all be free. But you could see excerpts using Google and then pay for what you want. Win-win-win.

Jonathan Wellons

AddThis Social Bookmark Button

Abraham Lincoln knew the principles of Web 2.0 when he said, “… [virtual communities] of the [users], by the [users], for the [users], shall not perish from the [Web].”

Jeremy Jones

AddThis Social Bookmark Button

Related link: http://geekmuse.net/blog/comments.php?y=05&m=10&entry=entry051017-133741

I’m running a little behind on my podcast listening lately. I’m trying different podcasts to see which ones I want to fill my 2 hours of daily commute with. So far, my regulars are, O’Reilly’s Distributing the Future, This Week in Tech, GeekMuse, and some of IT Conversations.

The second topic on Episode 11 of GeekMuse was about TurboGears, apparently a discussion of the recent Slashdot article which compared TG with Ruby on Rails. The GeekMuse discussion quickly turned away from TurboGears and toward Python itself. Some of the GeekMusers made comments of disgust about the semantic value of white space in Python. There was the obligatory comparison to Cobol. There was a disdainful mention of forcing users to adopt a consistent coding style. And there was also a question of flexibility, noting that sometimes it is more convenient or readable to just have an if statement on one line (which, by the way, Python supports).

All of these arguments are pretty entertaining to me. The discussion never moved past the whitespace issue. And from their discussion, this issue alone would keep the GeekMusers from ever adopting Python as their respective language of choice. No problem. It’s a big world. There are plenty of good languages to choose from. Diversity makes the world go ’round.

It’s just funny to me because the whitespace issue nearly kept me away from Python. And it’s typically the first issue non-Python types bring up to point to why they wouldn’t adopt Python. When I first looked into Python, I had been working with Perl and had tinkered a bit with C, so I was accustomed to curly brackets identifying beginnings and ends of code blocks. The whitespace in Python turned me off. I could never give the rest of the language enough of a glance to properly appreciate it. I don’t remember what happened, but whitespace became less of an issue for me. I dug into Python and quickly became attached to the simplicity of the syntax and how naturally thinking about problems in this new language became.

Now, I enjoy the whitespace. It reminds me that I should be indenting code blocks anyway for readability. Yes, there are problems if one developer uses tabs (bad programmer!) and another uses four spaces like he’s supposed to. But it makes the code so much easier to read and understand and troubleshoot problems.

So, how much of an issue should significant whitespace be? I personally think that every development organization and every development project (open source or not) should establish coding conventions which must be adhered to. Some of those conventions should entail the use of whitespace, even for Python where whitespace is meaningful. And it is my preference for whitespace to follow code blocks for readability. Maybe I’m wrong, but I think most developers would agree with me on this point. And if they do, what’s the big deal? If everyone should be indenting properly, anyway, what’s the problem with including whitespace as part of the syntax of the language? But, like I said earlier, it’s a big world. There are plenty of good languages to choose from, most of which don’t regulate where or how whitespace is scattered through source files. If mandatory whitespace gives you the coding heebie-jeebies, you can find a language other than Python which will suit you well.

Jeremy Jones

AddThis Social Bookmark Button

Related link: http://www.cherrypy.org/

From cherrypy.org:

After 6 months of intense development since the last stable release, CherryPy-2.1.0 is finally released. Grab the release from the download page page and make sure you read “What’s new in CherryPy 2.1″ for instructions on how to upgrade from 2.0. You can also have a look at the ChangeLog.

On the what’s new in CP 2.1 wiki page, there is reference to a Session authenticate filter, but the page it links to is nonexistent.

Authentication was something that drew me to CherryPy back before 1.0, but has since been removed. I’ve been looking for some authentication mechanism with the blessing of the CP team or, preferably, something of the sort in the CP library. I hope this is it.

Regardless of whether that little desire of mine has come to fruition, great work, folks! CherryPy is fantastic! I wish you many more years of releases!

Derek Sivers

AddThis Social Bookmark Button

Related link: http://ferret.davebalmain.com/trac

If you are planning to use Ruby in any project that will need to search anything, pay close attention to Ferret - a Ruby port of Lucene.

Back in January, I started my rewrite of CD Baby in Rails - but one of the biggest unsolved problems was search :

  • over 100,000 albums (and adding 200 new albums a day!)
  • over 1,000,000 songs
  • need to be searchable not just by exact-match, but partial-match and mis-spelling
  • results need to be weighted so that exact-match result comes before partial-match
  • every search must search these six fields: artist, album, style, description, mis-spellings, similar-artists
  • result matches need to be weighted in this order of fields: artist, mis-spellings, similar-artists, album, style, description
  • all this has to happen in under 1 second

I used to have great search results, but it took TEN queries to do it (4 exact-match queries followed by 6 LIKE ‘%string%’ queries). This was fine before CD Baby got popular, but once we started growing, my old reliable search was taking 30 SECONDS to return results! Live! On the website! Intolerable!

I switched to MySQL’s fulltext search. Fast, yes. But disappointing results. Too many results. Search for “Bob Dylan” and you’ll get EVERY artist with any mention of “Bob” OR “Dylan” in their name or album name.

I asked on my blog, here and got some good advice, including a recommendation for Lucene. My good friend Robert Kaye also told me about Lucene. No - he RAVED about Lucene - about how it could wildcard-search a million strings and return properly-weighted results in a few milliseconds. We talked about his Lucene experience for an hour, and I was convinced that this was the way to go. If you’re interested in learning more about Lucene, download the Lucene book : Lucene in Action. It’s great.

Only one problem : it’s in Java. Fucking Java. I’ve never tried Java. I was hoping to not have to. I don’t hear nice things about it. It’s on my coffee list. But I was considering learning it a bit, just to get Lucene going.

RUBY BINDINGS TO LUCENE?
I asked around the Ruby list, and found out that Brian McCallister had been given a small grant to write Ruby bindings to Lucene. This looked very promising, at first, but eventually became apparent that it just wasn’t going to happen. At all. Sigh….

LUCENE WEB SERVICE:
Robert Kaye wrote the Lucene Web Service for me. Tomcat. Java. A good start. Open source. Even has some other contributors. But still would mean I’d need to install Java on my servers and maintain a Tomcat server, and do all this Java stuff I was really really hoping not to have to do, just to search my catalog! But it semeed like the only way, so I was going to dedicate next week to setting it all up and getting to know it.

ANNOUNCED THIS WEEK : LUCENE FOR RUBY! HOLY SHIT!
Then just a few days ago, David Balmain announced a full port of Lucene to Ruby - called Ferret. A full port! No Java needed! Oh man what perfect timing.

Marcus Whitney

AddThis Social Bookmark Button

Related link: http://www.zend.com/collaboration/

I took some time to reflect on the idea of a Zend PHP Framework.  I’ve come to understand that Zend is interested in furthering PHP in the enterprise, and has first hand experience with the questions that Fortune 100 and even Fortune 500 companies have about adopting PHP.  The Zend PHP Framework is not intended to threaten Solar, Prado, Yawp, Cake, Mojavi etc.  Fortune 100 companies would never adopt any of these frameworks on a large scale (no offense).  There just isn’t enough support behind them.  Too much risk.  Zend also can not be expected to fully support a framework that they haven’t been in on from day one.

Zend is for-profit, and I understand that mindset.  Their motives are often questioned because they have such a close relationship with a community that is anything but for-profit.  But I think they actually do a pretty good job of focusing so high in their company goals, that the community and the aspiring PHP greats have little to worry about.  You will still be able to get people to use your framework, .vimrc quickies, eclipse plug-ins etc.  I have no intention of switching to the Zend Framework in my existing projects, but I very well may use Zend’s framework in future projects.  It’s just less for me to support and worry about when I’m trying to keep a close eye on ROI.

As for anyone expecting Zend to collaborate on their existing project or framework, good luck.  It’s clear that they have an agenda, and they are moving on their roadmap with great expediency.  To their credit, this is really their vision, so the framework’s name is appropriate.  Taking PHP to the level of adoption that Java has achieved is a serious undertaking.  It’s like there are almost two tracks here, which is the beauty of PHP.  As an individual, you can choose to go a completely non-Zend route and still be very effective.  But now, the organization with dev teams of 100 or more (besides Yahoo who have PHP pioneer’s in their fold) can pick up some Zend Studio and Platform licenses, get the framework going and have some standards as well as Java integration going fairly quickly.  It’s a good thing.

Derek Sivers

AddThis Social Bookmark Button

Related link: http://www.postgresql.org/docs/7.2/static/queries.html

I spent so long in MySQL without the option of subselects - that I got so used to JOINing tables as the only way of doing things. Tonight (in PostgreSQL) I replaced a JOIN query with the sublime power of subselects.

PREVIOUS:
SELECT DISTINCT items.id, items.cache_sold
FROM item_subgenre_links isl
INNER JOIN catalogs_items ci ON isl.item_id=ci.item_id
INNER JOIN items ON isl.item_id=items.id
INNER JOIN subgenres ON isl.subgenre_id=subgenres.id
WHERE ci.catalog_id=1
AND subgenres.genre_id = 9
ORDER BY cache_sold DESC LIMIT 10;

SUBSELECT:
SELECT items.id, items.cache_sold
FROM items WHERE id IN (SELECT item_id
FROM catalogs_items
WHERE catalog_id = 1
AND item_id IN (SELECT item_id
FROM item_subgenre_links isl
WHERE isl.subgenre_id IN (SELECT id
FROM subgenres WHERE genre_id = 9)))
ORDER BY cache_sold DESC LIMIT 10;

As far as my non-developer-brain understands it, the reason that the subselect approach is more efficient is that you’re limiting the available choices first, instead of joining all four tables and finding the intersection.

Geoff Broadwell

AddThis Social Bookmark Button

Related link: http://www.oreillynet.com/pub/wlg/8097

Last week I talked
about how to determine how an application performs in different
scenarios, and generally where the code has bottlenecks. It’s
time to talk about why the code might be slow.

There are a great many particular reasons why code could be performing
slowly, so I’ll start off by painting some broad strokes and go for more
detail later on. Let’s break these issues into some big categories:

  1. The application is attempting to solve a more difficult problem
    than the one the user actually has.
  2. The code does more work than is necessary to solve the problem
    it was designed for.
  3. The code is great in theory, but runs slowly on real hardware.

That first one may seem silly, but I’d warrant it’s one of the biggest
causes of performance complaints. The mistake could be as simple as
running a complex statistical analysis program on a huge dataset, when
the user only needs to know how many data points lie within a certain
range. Or perhaps the application computes all possible answers to a
problem, when knowing just one answer is enough.

Or it could be more subtle issue, like requiring an exact answer when an
approximation will easily do. There are many problem spaces for which
the only known ways to calculate an exact answer are vastly slower than
making a very good guess. Modern computer graphics would not even exist
if not for the fact that, most of the time, gross approximations are
just fine. There are even problems for which it is infeasible or
impossible to calculate an exact answer at all; the