October 2003 Archives

Jacek Artymiak

AddThis Social Bookmark Button

Related link: http://marc.theaimsgroup.com/?l=openbsd-misc&m=106759614014493&w=2

The author of the original message claims that the local French laws forbid the sale of non-localized software to non-business customers.

If true, this is another case of “customer protection” gone too far. If enforced with the same diligence as described in the original message, it could hurt the sales of all Open Source software to the wide public in France.

BTW. If you are in France and want to purchase OpenBSD for your own private use, you should still be able to do it online. Of course, you should check your local laws first.

Does your country implement similar customer protection laws? Do they apply to software?

Jacek Artymiak

AddThis Social Bookmark Button

Related link: http://www.openbsd.org/34.html

OpenBSD 3.4 has been released to the wide wild world. Install and enjoy! And don’t forget to support the project.

David Sklar

AddThis Social Bookmark Button

If this were just a language battle of Perl 6 vs. C#, it wouldn’t merit any more notice than any other matchup in the perpetual linguistic conflict that is constant among programmers. What sets this face-off apart is that each of these VMs can host other languages as well. Running on top of Parrot, Python code can use Perl libraries and vice versa. Running on top of the CLR, C#, J#, Visual Basic.NET code can perform similar feats of reuse.

Rarely are functions in the standard C library reimplemented in systems programming. This kind of wheel-reinvention happens all the time in web apps, however. The multi-language capabilities of Parrot and the CLR provide the opportunity for each system to have a “libc for the web” that handles standard functionality: handling form and URL data, serving and consuming web services, processing XML, and all of the other tasks that make up the plumbing of web applications. No matter what language you write your application in, you can use the same core library functions. If knowing a language’s library APIs is as crucial as knowing a language’s syntax, a common core library makes you productive in new languages much more quickly.

Microsoft’s ASP.NET libraries, also implemented for Mono, are a shot at the prize. There is plenty of code in CPAN that, converted to Perl 6, could also be a contender.

So why was php-con a skirmish? The talk by Sterling Hughes and Thies Arntzen about their plans to build a PHP-to-Parrot compiler tug the future of PHP towards Parrot. But there’s also an existing project to build PHP#, a version of PHP that runs on the CLR. (Memo to Microsoft: hire Alan Knowles and/or anyone else you can get your hands on to make PHP# a reality if you want to win this battle.) Brian Goldfarb, a Product Manager from Microsoft, came to PHP-Con to learn from folks there.

Microsoft was blindsided by Linux and now is attempting to avoid their mistakes in the world of scripting languages and PHP. With little to no commercial backing, PHP has become an incredibly popular on-ramp to small-time web developers. That’s supposed to be Visual Basic’s niche. So maybe by understanding PHP, Microsoft can compete better in that world.

There is a lot going on that makes the outcome of this titan clash still unclear. ActiveState is working on Perl for .NET. The .NET development model is tightly integrated with Visual Studio, which is designed with languages like Visual Basic and C# in mind, so day-to-day comfort of application development doesn’t yet quite sync with the technical possibility of running other languages on the CLR. The CLR and Mono are available now, while Perl 6 is not. PHP 5 is on the way (”before Perl 6″ I think is the target release date) and plenty of people will stick with PHP 4, let alone move from PHP 5 to a future Parrot/CLR based version of PHP in a few years. IIS and Windows are tuned to provide the best operating environment for .NET — what kind of tight integration will be developed between Apache, Linux, and Mono or Parrot?

Oh yeah, Java is still around, right? Visual J# seems focused on migrating developers away from Java and I don’t expect a version of Java for Parrot (but open-source developers have built all sorts of crazy projects). Both Parrot and the CLR have learned from the Java VM’s host-only-one-language mistake. Given the $zillions invested Java deployments to date, I suppose Java’s not disappearing any time soon, but Sun needs to work on how to stay relevant in the multi-language VM world of the near future.

What do you think will happen in the battle between .NET and Parrot?

Chris DiBona

AddThis Social Bookmark Button

As an experiment, I added a Google AdSense bar to my website at DiBona.com. And on that front page, I have a number of links to my recent entries here on the O’Reilly Network.


The last three entries looked something like this: I talk about SQL Caching on the O’Reilly Network.


As you may know, Google figures out which ads to feed by looking at the content of your website and choosing what it thinks are the best fit ads for your site. It does this quite well, usually. For instance, if you look at my FAQ page for viewers of my segments on TV, you’ll notice a lot of spot on Linux ads. I mean, they’re really appropriate choices. But for the front page, the term “O’Reilly” appeared too often and tricked Google into thinking I was talking about this O’Reilly and not this one.


Now, I’m not going to get into a political debate here, but this is one of the problems with keyword based systems, sometimes the context isn’t there.


How to fix it? Google provides a way to block specific urls from appearing on your site, but there really isn’t a way to exclude every Bill O’Reilly related adlink. The only answer is for google to provide people with the ability to “ignore” a keyword. Even this is non-optimal as cutting out the O’Reilly keyword would potentially stall some fairly on topic ads keyed to O’Reilly (as in tim) while blocking the ads keyed to “O’Reilly” (As in bill). Other than that, however, Adsense is a pretty neat service.

Not -that- O’Reilly, the other one.

Andy Lester

AddThis Social Bookmark Button

Related link: http://use.perl.org/~brian_d_foy/journal/15405

“When you think about the sacrifices that soldiers make to be here, remember that I cannot ssh. The horror.”


Read on for more
tales of military computing from Iraq, brought to you by the founder of the
Perl Mongers,
brian d foy.

Who will play brian in the made-for-TV movie, “Mac In Iraq”, assuming that John Malkovich is busy?

David Sklar

AddThis Social Bookmark Button

Related link: http://www.sfgate.com/cgi-bin/article.cgi?file=/c/a/2003/10/22/MNGCO2FN8G1.DTL

If a US-based employee disclosed confidential patient data in a back-pay dispute, they’d be in big trouble, whether or not they had a valid pay gripe. But the long arm of US law doesn’t extend so easily to Pakistan, where this incident happened, or plenty of other overseas destinations for medical transcription outsourcing.

As a chain is only as strong as its weakest link, a privacy or confidentiality regime protecting data is only as strong as the flimsiest, most disclosure-prone access to the data. In this case, an underpaid and mistreated (or wily and greedy, depending on who you believe) person not accountable to US law provides an extremely weak link in the medical privacy chain.

I’ve idly wondered if working as a custodian at a software company can get you a lucrative sideline as a pirate software distributor — bring a FireWire DVD burner to work with you and take home some goodies. There are a lot of people in the “chain of data” that are sometimes just looked at as furniture by the “professionals” who are “really” working with the data. In the UCSF case, doctors record gobs of dictation and then, a little while later, it shows up all typed out. Do they care if it took a trip around the world in the process?* Software developers go to work every day and find the floor vacuumed.** Are they concerned about who cleaned up and restocked the kitchen with Jolt? Who changes the lightbulbs in Experian’s data center? Lots of juicy data there.

Once we realize all of the people that really do have access to very sensitive data, we can treat them appropriately (and scrutinize them properly before such access is granted).


* Many doctors may in fact care, I don’t mean to categorically malign them. Multiply subcontracting administrators seem to have been the problem in the UCSF case.

** I realize it is likely that some developers are familiar with the custodial folks since the developers are cranking away when the custodians show up at midnight.

What are the weak spots in the data chain that you worry about?

chromatic

AddThis Social Bookmark Button

Related link: http://www.thepetitionsite.com/takeaction/190498263

Serious gamers are an interesting bunch. Some of my college buddies were a
representative bunch — leaving the cases off their computer because they
swapped hardware so often, reinstalling Windows every couple of months to
improve the stability threatened by installing and uninstalling games and
DirectX components so often. Releasing dedicated servers really taps in to
that do-it-yourself mindset.

There’s a similar spirit to be found in many Linux users. Witness the
popularity of install-or-compile-it-yourself distributions. There’s something
very satisfying about knowing how everything fits together, having put it all
together yourself. (Of course, the same rugged individualism can lead to
myriad identical projects with big goals, no code, and no future.)

It’s no surprise that game companies are aware of this mentality in their
most dedicated customers. Running your own game server is appealing. You get
to set the rules. You choose the maps and scenarios. You (finally) have a
good network connection to the server.

Of course, this also means that you’re helping keep the game alive.
Instead of the company paying for massive server hardware, upkeep, and
bandwidth, you’re responsible. The joy of the hack only goes so far —
system administration is either boring or exciting, and exciting system
administration is bad. It’s not hard to see why server administrators are
starting to say, "Wait, I’m doing this for free. Is it really worth
it?"

That’s why the href="http://www.thepetitionsite.com/takeaction/190498263">Valve
Software Linux Server Boycott Petition is interesting.

While it’s nice that proprietary game developers acknowledge that Linux is
a viable platform for their game servers, it’s unfortunate that it’s not seen
as a viable platform for game clients. If the base game logic compiles and
runs successfully on Linux already, how much more work is it to port the
graphics and sound?

I’m somewhat sympathetic to the response “Quite a bit, actually.”
As Chris DiBona
unequivocally puts it, “The games industry is all [fouled] up.” When nearly
every retrospective admits that the project spent months in deathmarch mode
trying to get a game out on time, you have to wonder when someone’ll figure
out that something’s deadly wrong. Still, there are several reasons of
varying legitimacy why more game publishers don’t want to publish Linux
clients.

  • Low market size (though if you don’t publish, no one can possibly
    buy).
  • Portability concerns (though a well-designed game will minimize
    platform-specific code and good libraries such as href="http://www.libsdl.org/">SDL can take care of much of the rest).
  • Lack of time (though this is more a feature of a broken development
    process).
  • Lack of knowledge (though people such as href="http://icculus.org/">Icculus have been known to work very quick and
    very impressive magic).
  • Lack of standards (though the state of video and audio on Linux is better
    than ever and continues to improve).

I don’t necessarily agree or disagree with any of those points. It’s the
game publisher’s decision whether to port a game to Linux. (It’d be really
nice to have source releases, as not everyone uses Linux on x86 chips or Linux
at all — but you don’t really get anyone’s attention by talking about
FreeBSD gaming.) Why should a publisher put out a Linux port if people will
just buy the Windows version and play it under Wine, WineX, or the token
Windows box?

With that in mind, though, it’s also up to potential customers whether
they want to buy the game. This petition will likely have no effect on Valve
(in an industry perhaps most charitably described as “arrogantly adolescent”,
self-absorption is all too common), but it brings up a question Linux gamers
should consider.

Why is Linux good enough to host your game but not good enough to play it?

Are there other questions Linux gamers should be asking?

Andy Oram

AddThis Social Bookmark Button

Related link: http://206.191.55.162/MLISTS/news2003/0108.html

Back in 2001, if not earlier, companies were marketing filesystems
that offered far greater robustness, scalability, and (perhaps)
performance than traditional distributed filesystems such as NFS.
OceanStore is
probably the best known of these systems, although it is a research
project rather than a product.

Now Internet2 is offering such a filesystem across all the dozens of
universities that are part of its network. There’s even a Windows
client! That means Logistical Networking could become the first mass
phenomenon in the next generation of distributed filesystems.

The idea of these filesystems is that each file is broken into pieces
when stored, and each piece is sent to multiple systems. Encryption
protects the data from snooping, and digests protect it from
corruption. Never again will you have to stop work because your file
server is down; there will always be another server with your
data. Downloads experience extra overhead because the locations of the
parts have to be resolved, the data could be far away, and the
encryption introduces extra work–but breaking large files into pieces
can help compensate.

There are so many theoretical advantages for these systems over NFS
and CIFS that I am convinced they’ll dominate filesystem storage in a
few years. Internet2 has taken an excellent step in that direction. It
may be the most important illustration of the peer-to-peer concept,
far more useful than anonymous distributed systems such as KaZaa.

Andy Oram

AddThis Social Bookmark Button

Before the Internet was a medium for exchanging Eminem MP3 files, it
was an information medium. Not that the change is all bad; we can
still enjoy a global medium that serves as an infinitely customizable
radio station, and a shopping forum, and a junk yard for all sorts of
trivia like the IQ test my daughter showed me recently. But I’ve been
thinking back to when the Internet was heralded, around the 1980s, as
the greatest potential medium for information ever invented.

I know I’m not supposed to post a weblog containing any date that
begins with a “19″ because people view history as irrelevant. They’re
more interested in somebody’s latest program crash or job gripe than
anything I did in the 1980s. But information ruled back then; users of the Internet back then were very excited about it.

We didn’t have MP3, but we did have GIF and even MIME. Mostly we had
information in the crudest ASCII form, and a lot of it was fake, but
there was enough good stuff to propel the use of the Internet forward
to what we have today. So don’t turn up your nose at this information,
especially considering most Internet users have practically forgotten
what real information is. I’m using information not in the sense of
“digital bits with meaning”–which is so broad as to almost neuter the
term–but in the sense of “input that allows a person to make a
conscious change of direction in his life.”

You can still get this sort of information on the Internet today.
Foremost is information about sex. This is critical because
information about sex was hard for the average person to get before
the Internet. The medium for such information tended to be articles
along the lines of “Testosterone functioning as a function of
corticotropin releasing hormone diffusion.”

A few highly progressive researchers wrote easy-to-read books about
sex. But the problem with books is that, to get them, you had to go to
a bookstore. And everybody was too embarrassed for that. So nobody
purchased the books by highly progressive researchers except highly
progressive parents trying to inform their children. And while this is
nice for the children, if you’re a parent when you buy a book on sex
it’s obviously too late for you.

TV shows and movies contain lots of sex, but in the form of
titillation rather than information. Both forms are all over the
Internet, though, with the key result that information can be obtained
by anybody, anywhere, any time, anonymously. John McCain passed a bill
requiring software filters that block the information, but they don’t
work, so people can still get their information

(OK, I admit that here I’ve exaggerated a little bit. The McCain bill
requires filters only in schools and libraries that accept federal
funds. And the filters have to block only “obscenity,” but since
“obscenity” is defined in U.S. law as “anything interesting your child
would like to see, or her teacher would like her to research,” the
effect is basically what I said.)

Next after sex, the Internet provides information on political
campaigns. I find this valuable because a lot of people want
information on these campaigns, and once again the traditional media
provide titillation instead.

TV and newspaper editors are aware that the public is interested in
politics. But not knowing how to cover politics, the media cover it as
competitive sports. (I am by no means the first person to point this
out.)

You see, competitive sports make a lot of money for the media, so
they’ve learned how to do sports well. And since that’s where they’ve
developed their expertise, everything else “looks like a nail” and
they try to apply the same sports-tested hammer to it.

War works reasonably well as a competitive sport. The main difference
for the media is that in sports, the bystanders provide the violence,
whereas in war they are the victims of the violence.

Literature and academic publishing, when viewed from the right angle,
can also be covered as a competitive sport. The New York Review of
Books
has been doing this for decades.

But political campaigns do not work well as competitive sports. Yet
the media have developed a tradition of covering them this way. For
instance, I’ll find several columns one day in the paper about a
problem John Kerry is having with his campaign staff. This maps in the
newspaper editor’s mind to the problems a sports team manager has with
his players, so it takes up most of a page. But it does not offer
anything about who John Kerry is or what the article has to do with
anything else in the world.

I happen to know who John Kerry is because I live in Massachusetts
(for all the rest of you, John Kerry is a senator from Massachusetts)
and because I am interested in where we get illegal drugs, and Kerry
has helped to explain that. (Try, for instance, entering a combination
of “Kerry” and “contra” into a search engine.) Drugs, like sex,
represent an example of a taboo subject where the Internet makes
information more available than ever.

Meanwhile, politics in the traditional media consist more of campaign
logistics than of issues, which is as if sports stories reported more
about personnel problems than than about the games. So we’re lucky we
now have the Internet and all the sites that offer information as well
as citizen participation. (The
League of Women Voters’
DemocracyNet

and
Rock the Vote
are just two of the many sites that show the diversity available.)

The Internet didn’t seem to have much effect on the recent California
recall election, but I suspect that’s because the voters didn’t treat
it as an issues campaign, but as a personality campaign. A California
resident told me that, in this case, the TV stations came through and
aired some good debates.

The reason Arnold Schwarzenegger won, in my opinion, is that the media
decided early on that his candidacy was the only dramatic story they
could make out of the campaign. And so the news covered mostly
Schwarzenegger from then on. The other hundred-plus candidates found
that they’d be covered only insofar as they interacted with
Schwarzenegger’s campaign. It got so extreme that many candidates
piled into whatever vehicles they could get and followed the
Schwarzenegger campaign around the state in the hope of getting
noticed. Since this chase turned into a competitive sport, it got
reported in the newspaper.

So the Internet doesn’t seem to be doing much for John Kerry or Gray
Davis. Nor did it do much for John McCain, but that’s to be expected
because, as I said, he’s not an Internet-friendly guy. But the
Internet is doing a lot for Howard Dean. It’s not the best way to
reach a large part of the country’s ethnically and economically
diverse population, and the Dean campaign is aware of that. But the
Internet is a good way to rally early and highly committed people who
can provide both money and volunteers. O’Reilly & Associates published
a book about using the Internet for political organizing back in 1996,
but it’s getting really interesting only now.

I mentioned trivia and tests earlier. The Internet successfully
combines serious politics with the human urge to play. Try calculating
your
ecological
footprint

and see whether that doesn’t induce political action.

Another topic where the Internet has useful information is cars. The
ability to check many prices from one’s desk gives the consumer much
negotiating power. This also goes for air travel and other big ticket
items where pricing is elastic.

Of course, the Internet is a great source of information concerning
other people, but I won’t comment further on that because it’s
distasteful to me and it’s already gotten plenty of publicity.

The final kind of information I find it particularly interesting on
the Internet is information about the Internet itself. You can get
everything from
RFCs
and fascinating
Internet-Drafts
to the actual software that lets you use the Internet and run servers
on it. (I can’t resist a plug here for my company’s own resource, the
Safari Bookshelf,
because it also functions as a source of information about the
Internet.)

It’s a novel phenomenon, researching the Internet on the Internet. You
couldn’t build the Panama Canal using other canals, or build the
transcontinental railroads using other railroads. But previous work on
the Internet is directly supporting people building its next
phase. While most of us are seeking Eminem or porn or the Howard Dean
site, there’s always some outlying individual asking, “How can I
develop new J2EE software in a way that’s easy to subclass?” And that
geek will draw everything else along into the future.

Does the Internet change our relationship to information?

Chris DiBona

AddThis Social Bookmark Button

Reading Jack Herrington’s article reminded me of an article I have been wanting to write for some time about improving scalability for data driven web sites. When I worked

for OSDN, I reflected that if you look at a page that a site like NewsForge or Slashdot assembles for

a viewer, there are a lot of things that simply do not change from viewer to viewer. From the rss feed boxes to the polls to the stories, the number of common elements way

outnumber the customized components. But when taken as a full page, there is so much changing from user to user that it really doesn’t make a lot of sense to do much in

the way of page caching due to both the dynamic nature of the site and the number of user customizations that exist.


But when you look at how any one page is put together for a great number of viewers of any website, you see that most of the queries being presented to the database are

largely the same. Thanks to innovative server side caching by databases like mySQL and the rest, you find that these repetitive, non-unique queries are not as bad as they

could be if the db had to reexecute the queries every time rather than pull them from memory, so you might imagine that caching SQL on the client side would be a bad idea.


That is, unless of course you want your site to scale better horizontally. The problem with doing your caching on the SQL server is that you really want to leave the DB

alone as much as possible if you want your site to scale, and really only access that machine in the event of change in a table or row. This becomes all the more

important if you are using a commercial DB in which scaling matches with increased licensing costs.


How it works could almost be left as an exercise for the O’Reilly reader. Basically what you are doing is adding a “smart” memoization function to your SQL access library

in whichever language you are using. Including this on the client also means you’ll need to create a tracking table in the database to keep track of which database caches

have become dirty and when.


When constructing such a system, you can create pages that at most take 1 query per page (plus any logging queries which you can choose whether or not to dirty the caches

over) using the following method:


First thing, before any actual real queries happen, the page construction logic should query the tracking table to get a list of tables that have become dirty since the

last time a page was constructed, this can commonly become the only query that a page might see.


In the event that a list of “dirtied” tables has been returned, the routine should remove the cached tables from whatever storage you’re using. I choose disk based SQL

query caching as modern operating systems are better than I am at setting up disk caches for mostly read only data, a programmer could use a

HREF="http://www.danga.com/memcached/">memcached kind of solution, but I don’t know how much benefit it would give you, a programmer should look at how the page logic

uses memory before choosing the exact method of storing data.


It wouldn’t be that much harder to implement a row level solution to this as well, but the less granular solution is much simpler, and since we’re talking about scaling

and not performance (although you do see some performance gains in this kind of thing) I don’t think it is worth spending the time implementing row level caching, although

if you did that would make a good case for doing it via a memory only mechanism as I can see running out of file descriptors very quickly.


Anyhow, in the event that the list is empty, you are in good shape. The page logic would continue on with its business, until a query occurs, at this point, the SQL access

level would check and see if the query has been cached before, and thus not erased previously, and would loaded that pickled data from the file and return that as the

query’s result, with no actual SQL query occurring.


In the event that the cached results do not exist, the query is run normally through the DB, and then the result is pickled onto the drive via whatever mechanism, then the

result is returned to the page assembly logic.


The DB is then given an explain query , or alternatively, a programmer can inspect the query via traditionally text processing methods, and then the pickled data is tagged

as having to do with those tables, so that when those tables are written to, they can be invalidated properly. Part of this analysis stage is used to examine whether the

query is the kind of query that would dirty the cache, like an Insert or Delete and if so those tables are marked as dirty in the tracking table, the nice thing about this

is that on the next page load this very same head would then go about the invalidation step, so don’t worry about calling the cache invalidation routines here.


Important things to keep in mind when implementing:


Allow for queries to assert that they cannot invalidate the cache. For instance, some queries are less important to the overall appearance of the site. Logging information

may be less important for the average end user to care about, for instance.


Allow for some queries to bypass the query mechanism. When I wrote my first implementation of this, I felt this was very important. In time I found that I used this option

less and less, for good reason, the dirtying step worked fine.


Allow for some queries to not drop caches in the filesystem, for initial user logins for instance, or cookie queries. In the event the machine goes down or enters into a

state where the DB is unavailable, this would make the machine continue to serve anonymous pages no problem, but unable to serve user specific data.


You might think this would only be good for generic, non-customized results. This is not the case as most queries are based on parameters that users commonly share, but

more importantly, there is little penalty for using the system. For instance, suppose you do a query that has no cached result, you are then looking at 1 extra query which

is bad, but when taken against the overall savings that such a system brings, is well worth it, as are the extra file descriptor searches.


The very cool thing about this is that you can derive some level of safe degradation of a web site with this mechanism. In the event that a DB goes down, the caching

system continues to serve pages from the cache, which can be desirable and even outrageously useful. It is worth pointing out before you even think about it that it is not wise to nfs share these cached data files, nor should you be tempted to assume that I’ve gone into every detail here, I mean I haven’t mentioned locking the files to ensure you are getting a good copy of the cached data, etc…


When I implemented such a system, I made some pretty ridiculous pages and only noticed real delays when I had over 300 simulated queries per page (with 2 real queries, 1 tracking and 1 log). An additional benefit of the system is that the tracking queries are really simple (something like: select tablename from sitetracking where trackid > lastidchecked) and are readily cached at the server.


I wrote some implementation code for a cms I was writing some time ago, but abandoned it and you are welcome to email me for a copy.

I should point out that I wrote a php version when I was not very familiar with that language, so no promises as to its overall phpness in the event you look at it. The

only real problem is that implementing this in your code can be very specific to your schema, the real answer is for me (or you) to reimplement this system as a shim ODBC

driver, so I’ve been looking up that spec, but in the meantime I thought I’d post it here for people to consider when constructing web sites and services with a mind

towards robust scalability.

In which Chris talks about caching methods designed to scale large sites without loading your database servers.

David Sklar

AddThis Social Bookmark Button

Related link: http://www.nytimes.com/2003/10/16/technology/circuits/16mine.html

This article about text analysis tools is informative and offers a reasonable conclusion — they’re (right now) best used to assist human editors or decision makers, not to make decisions on their own.

What are your experiences with text-mining? Helpful? Confusing?

AddThis Social Bookmark Button

When Open Government Information Awareness first launched, I found it as an interesting idea — enable citizens to keep track of their government. It seemed the perfect retaliatory reaction to TIA, the once called Total Information Awarenes and now Terrorist Information Awareness, the contraversial DARPA program to help the government keep track of people/terrorists. Quoting GIA’s FAQ

The system presents itself to users as a Web site, but is actually a suite of information technologies that actively peruse data, accept contributions, and post alerts about government. The system will accommodate information of almost any type, allowing users to sort through volumes of information which would otherwise be unusable. More importantly, the system allows for people to submit any information, while retaining anonymity, but while also being identified as a consistent source.

This system allows citizens to post information to help others understand what is going on with their government and their elected officials. Its like a bulletin board system for people interested in what their government is up to and some of the tools on the site also allow it to be a specialised Google for aggregated government-related content. But the hard problem, is how do you know what people are contributing is true? What if they are total rumours, or worse yet, outright lies? The people running GIA could take legal responsibility for that mis-information. And I think worse yet, the public is getting bad information from which they are making judgements.

A recent article in the New Scientist, is mentioning that the creators believe the solution to this “trust” problem is to make the system distributed and peer-to-peer

They hope that following the Napster approach will get them round this problem. Instead of storing the data on a single server, so-called peer-to-peer networks hold data in a number of locations around the internet, from where it can be downloaded directly.

While this type of P2P solution would alleviate the legal problems the creators of the system may face (similar to the Napster vs. Gnutella architectural decisions), it also has the possibility of tainting the quality of the information that gets out to the public. Instead of building a reputation engine into GIA which would allow people to rate, in a P2P fashion, whether the contributed information is correct or not and whether the contributor is trustworthy or not, GIA is using P2P to spawn off more copies of itself. This seems to provide more opportunities for the public to get bad knowledge of what their government is up to. A trust system on one centralized system is hard enough to get right — but a trust system spread out amongst lots of autonomous servers is even harder, if not impossible.

Think of it this way. From the iTunes Music Store, you can download music that you know will be right. Its going to be of good quality and it will (with very good probability) be the piece of music that you think you are downloading because you are getting it from a reputable source (one that takes the time to verify). If instead you go onto Kazaa and try to download the same piece of music, you have no guarantee the song you are downloading is the song you wanted. It could be mislabeled. It could be a horrible recording of it. The creators of GIA strive very hard to be neutral in this “information battlefield”, but if they want to provide a service to the public, they have to be neutral in a “NPR” sense and at least do some work in verification of information, deliver some means of proving credible sources, or at least marking which information they have no way to proving.

Spawning off many versions of GIA to be controlled by different people seems to be a way of taking the legal burden of getting any of the information right off the site maintainers. But this seems to be very self-serving to the creators, and not serving at all to the public. The problem really is, unlike the iTunes and Kazaa example, that downloading uncredible information about the government is not simply annoying — it is very dangerous to the public as a whole.

David Sklar

AddThis Social Bookmark Button

Related link: http://www.plosbiology.org/plosonline/?request=index-html

The first issue of PLoS Biology contains an article
detailing a brain-machine interface (where the brains belonged to macaque monkeys and the machine is a robot arm with a gripper) that enabled monkeys to cause robot arm movements by *thinking about* how they would perform the same movements with their own actual monkey arms.

There are obvious potential benefits to humans with spinal cord injuries, but this can also lead to all sorts of (cool, scary, exciting, bizarre, pick your adjective) possibilities: remote controlled golem-avatars, mentally-controlled bionic body additions (extra arms, legs, fingers, an elephant-like trunk, etc.) new “user interfaces” (if such an archaic term would even still apply) for computer systems and VR, recording and playback of neuronal sequences that trigger various actions, etc.

What would you do with a brain-machine interface?

Andy Oram

AddThis Social Bookmark Button

Two AMD employees came by the Friends Of O’Reilly (FOO) camp yesterday
to build two systems based on their 64-bit Opteron processor, running
SuSE Linux. Our lead database programmer, Tim Allwine, is trying out
one of these in the hope it will vastly speed up the generation of
statistics regarding book sales. With the 64-bit addressing and eight
gigabytes RAM, he can instruct MySQL to slurp all the data into memory
and go from there. I also caught some information on GNOME, security,
and the Segway–all in one day.

Opteron on

Rich Brunner, AMD Fellow, introduced their chip in a brief
presentation. Some of the interesting parameters are:

  • The default address size is 64 bits, but the default data size is 32
    bits, which helps keep code small.

  • Doubling the number as well as the size of registers offers a major
    speed-up for 64-bit applications.

  • An integrated memory controller halves the latency of memory accesses.

Memory hangs directly off of each processor, whereas in earlier chips
it had to go through the same bridge as I/O. Furthermore, each
processor can support three of AMD’s HyperTransport interconnects,
which makes it easy to connect multiple processors. Accessing another
processor’s memory takes only a bit longer than accessing attached
memory, making NUMA less of a burden.

Other events

All day yesterday I talked to people I’ve been hoping to meet and
others I was glad to discover. Two authors who had been told not to
come unless they finished their book showed up at a quarter to
midnight. I rode a Segway, finding that I could do some fairly
sophisticated maneuvering right away.

Nat Friedman of Ximian presented his nifty search tool Dashboard,
which he had shown at the O’Reilly Open Source conference last July,
but which now sports a couple new features like an index for
everything on the desktop. He is leaving tomorrow for India, where he
will meet with a large number of programmers employed by Novell, the
company that bought Ximian recently. He will recruit 30 to 60 of these
programmers to work on GNOME and help them learn the social
conventions of working in a free software environment. Meanwhile,
Ximian will make GNOME more robust and get several important
subsystems in shape, such as plug-n-play and printing through CUPS.
One of their upcoming projects is a blogging tool. They hope soon to
make Mono compliant with the .NET 1.0 spec, or close to compliant.

The hall was full for Bernie Krause, who is billed as a “bioacoustics
author.” He has spent several decades making sound recordings of
natural habitats, and has thus inevitably become the chronicler of
natural habitats’ extinction. It really brings home the impact of
heedless policies to hear recordings of an environment’s richness
before selective logging, and the poverty of sound a few years after
that logging. Even noise from jet planes can disrupt the ability of
species to carry on the business of living.

Author and telecom expert Brian McConnell described a phone system
he’s marketing that allows cellular phone users to dictate email
messages. The operator-assisted service delivers them to people’s
email accounts as attached MP3 files. He expounded on the value of
mixing humans and machines, using the strengths of each in the system.
He also mentioned that phone companies should have saved the billions
they invested in 3G and developed slow but useful technologies such as
GPRS instead.

I learned from a security researcher why not so many of our security
problems would go away if all common servers were coded in Java
instead of C. It seems that we’re always inventing new, cool ways to
intertwine systems in more and more complex ways, each of which opens
up new attacks. Typically, security experts can find five exploitable
errors in each thousand lines of C code, while for Java code the error
rate drops to one in each thousand lines. A thousand lines of code is
not a lot nowadays.

Wrap-up

Leaving aside the beautiful setting and the fascinating attendees, a
couple people here put their fingers on what they love about FOO camp:
they say it has the informal get-togethers and impromptu demos they
always loved about conferences without the formal presentations and
keynotes they always hated. FOO camp takes the frosting from
conferences and leaves out the angelfood cake.

Many people took off last night, but those who stayed have continued
at almost the same intensive pace; many sessions are attracting rooms
full of active participants. I have to admit that my session on doing
books under open licenses (offered by popular demand, strangely
enough) attracted only one person, but we had a fascinating
hour-and-a-half long discussion spanning scads of topics.

Hardly anything has been posted on the Internet about the camp. I
think most people don’t know what to make of it yet. But in a few
months one may start hearing people say, “This new project came out of
something I was chatting to folks about at FOO camp.”

I also covered FOO camp in
yesterday’s weblog.

Andy Oram

AddThis Social Bookmark Button

Someone is mixing some kind of complicated baking project by hand in
the kitchen at O’Reilly & Associates, while I sit drinking weak
coffee and listening to someone who works on a collaborative
story-telling system compare notes with a bioinformatician of Inca
extraction who is describing the oral preservation of Inca culture.
This is the very start of the first full day of FOO (Friends Of
O’Reilly) camp.

I do not want to play up the curiosities of this camp too much,
but use it as an example of the fertility of environments that bring
together people of different interests and expert backgrounds. Some of
the same things go on at the O’Reilly Emerging Technology conferences,
or any college.

Last night I joined a group of over 20 people (about one-tenth of the
attendees) who were interested in what Steward Cheshire had to say
about his work to create Apple’s Rendezvous service discovery
technology. It was so interesting he gave two talks and I forced
myself to stay up past my bedtime, with results that were personally
valuable.

I’ve been writing an article where I point out that the email address
is the only persistent address (or at least something approaching
persistence–ask Comcast customers how persistent it is) that can be
used to reach the average person on the Internet. I have been saying
that it’s a shame this persistent address can be used only for email,
and was expressing the wish that DNS could be extended to provide a
general server record that’s like the MX record but that could be used
for arbitrary servers and protocols. It might make it easier to
develop peer-to-peer applications.

Well, I found out during Cheshire’s talk at 11:30 last night that DNS
actually does boast such a record, the SRV record added some five
years ago by DNS maintainer Paul Vixie.

Thanks to the wireless hub that several O’Reilly employees worked hard
to set up at the camp, I quickly found Vixie’s RFC 2782 and other
documents describing the SRV record. In the morning, I found Vixie
himself while he was setting up a complex contraption that made him
some much better coffee than I myself was drinking. I thanked Vixie
for his work and queried him about other uses for the SRV record, and
found out some other interesting background in his still-drowsy state.

Well, the attendees at my table have established that the Inca empire
was probably not the kind of place where this kind of open exchange
could take place. But we could do a lot more of it in our society.

Rod Chavez

AddThis Social Bookmark Button

i’ve spent the last week trying to get my brain wrapped around the issues
involved with performing XML comparisons and how they might be solved. i also
decided early on in my reading and discussions that i’d better attack this
problem in stages, otherwise i’d have nothing to show for it for a very long
time. in other words, this is a non-trivial problem. at least for me <g>

so what makes this a hard problem? after all, there’s a ton of “difference”
programs out there, like diff, windiff, tkdiff, etc. why not take the two
XML-documents to be compared and run one of these programs over them and see
what it says? it will certainly tell us something, why can’t we stop there?

the short answer is that while XML is a textual format, it is more then “just
text”. it has structure too, and in order to provide a useful diff this
structure must be taken into account


the rest of this post assumes some familiarity with XML, at least at the
high-level. if you’ve never seen XML before (ie, you’ve spent the last 5 years
under a rock <g>), i’d recommend taking a quick look at
Norman Walsh’s
A Technical Introduction
to XML

let’s take a look at some real XML and see what some differences are, and
aren’t. consider the following two documents:

<foo>
<bar/>
<baz/>
</foo>
<foo>
<bar></bar>
<baz></baz>
</foo>

from a “pure text” point-of-view, these documents are different. in particular,
lines 2 and 3 are different. but from an XML point-of-view, these documents can
be considered identical. they both have a “root” element foo, which in
turn has two child elements, bar and baz. the fact that in the
first document the elements bar and baz are atomic, and in the second
they aren’t, doesn’t change the fact that they are both childless, entityless
tags. hence, the two documents are considered equal. simple, eh? maybe, but
things get complicated quick

next let’s look at the same two documents, but this time the second will now be
different (both in a text and XML way):

<foo>
<bar/>
<baz/>
</foo>
<foo>
<bar a="x"></bar>
<baz></baz>
</foo>

now things have gotten interesting. in the second document, the element
bar now has an attribute a whose value is “x”. so from an XML
point-of-view, these documents are different. “ok, so they’re different, what’s
so hard about that?”. well, the hard part is what do you display? diff
utilities not only tell you if the documents are different, they tell you
how they are different. they use different mechanisms to indicate where
and what type of differences there are. let’s take a look at some choices,
easiest to hardest:

  1. physical text-diff
    this is the easiest to understand (and implement), but it’s the least useful
    for XML. in the latter case above, both the bar and baz elements
    would be shown as deleted and then inserted. this is due to the line-oriented
    viewpoint of a text-diff
  2. logical XML-diff
    now we’re doing better. the baz element would not be shown as
    having changed, while the bar element would. but not the entire element
    since the only change is that an attribute has been added. so rather then
    having the line with the bar element seeming to be deleted and then
    added, you’d want to see just the added attribute appearing

    you can see an example of this on Microsoft’s
    GotDotNet site where they have an online
    XML Diff and Patch
    tool. if you feed the two documents we’re discussing to their service, they
    will perform a logical XML-diff, and display just the added attribute
    high-lighted in yellow

    but there’s still a problem with this, which is that while you’re now viewing
    the difference between the two documents, you’re not seeing either
    document. you’re just seeing the change relative to the logical structure of
    the documents. why would this be a problem? well, suppose you had changed an
    XML document that was being tracked under a source-code control system like
    cvs. after seeing the differences, you
    might want to merge just some of the changes that have occurred between the
    two documents. in the logical view you don’t even know what line the
    change occurred on. in a small, simple XML document this wouldn’t be that big a
    deal, but what about something more involved, like a
    WSDL or a
    Schema file? trying to take the
    changes you’re interested in and “back” then into one of the files might prove
    to be not only tedious, but error-prone as well

    btw, please note that i’m not trying to pick on Microsoft’s “XML Diff and
    Patch” tool. it’s actually very good. i’m just trying to point out the merits of
    the various types of diff strategies, and point to samples where they
    exist on the web. as tools go, XML Diff and Patch is quite complete

  3. physical XML-diff
    this one has the potential to be very useful. it would understand not just what
    had changed between the two documents from the perspective of XML, but it would
    also understand how to relate any discovered differences back to the physical
    layout of the original document. so you’d see the original document in the
    display, and any changes to be displayed
    (red things being deleted,
    yellow things being added) would
    appear in the right place. the tricky part is how to perform a logical XML-diff
    while still somehow preserving the information needed to relate difference back
    correctly

having laid out the problem space somewhat, i’d like nothing better then then
to tell you i’ve got the whole thing wrapped up, here’s an implementation, now
go play with it… NOT! unfortunately this problem is kinda tricky, and i’m an
incrementalist be nature. i like to start off with the small, easy parts of a
problem, solve those to get familiar with the issues, and then build momentum
towards the hard parts (or, at least, the parts that seem hard to me)

so i decided to attack this whole area in the order i laid out above. i wrote a
physical text-diff online service
that you can play with right now, or you can
download and install it
yourself on any compatible servlet container. i’ve only run it on WLS, but it
should run elsewhere

if you know how to install a WAR (since that’s how it’s packaged) on your
servlet container, you’re all set. check your servers documentation. for WLS,
the following command (all on one line) will do the trick, assuming you’ve got
the install i documented in my last blog
post. otherwise, make the
appropriate changes for your installation

/bea/jdk141_03/bin/java -cp /bea/weblogic81/server/lib/weblogic.jar
    weblogic.Deployer -adminurl http://localhost:80 -username weblogic
    -password weblogic -name diff -deploy diff.war

btw, this WAR includes all the source, so you’re welcome to check it out,
modify it, etc. please let me know if you find any bugs, especially if you fix
them <g>

now let’s walk through what it does and how it works. first, you file upload
two documents. in order to perform file upload, i’m using some
Java classes written by
Jason Hunter. it’s a
very useful set of utilities that help the developer with an aspect of servlet
programming that for some reason has never been addressed by the J2EE standard
itself

let’s take a look at the file upload process in more detail. on the client, you
want a form tag whose encoding is “multipart/form-data” and method is
“post”. here’s the relevant part of the HTML

<form ENCTYPE="multipart/form-data" ACTION="phys_text_diff/" METHOD="POST">
    <input name="rig" type="file" size="50"/>
    <input name="rox" type="file" size="50"/>
    <input value="compare" type="submit"/>
</form>

i’ve stripped out any of the formatting, so you can see just the form and the
file upload controls. i use a table for formatting, and that’s pretty gruesome
to look at. you can see the two file upload controls, the submit button and the
form tag wrapping them all

now let’s look at the server side. as you can see from the client part, these
files are arriving as part of an HTML-POST so we’ve got to have our code inside
the doPost(…) servlet method. i could also have done this inside a
JSP, but i got started writing a straight servlet and never looked back. since
that’s not really the point of this exercise, it doesn’t seem to big a deal,
but i might change that later. what do you think?

protected void doPost(HttpServletRequest req, HttpServletResponse resp)
	throws java.io.IOException
{
	MultipartParser mp = new MultipartParser(req, 1024 * 400);
	Part part;

	...

	while ((part = mp.readNextPart()) != null)
	{
		if (part.isFile())
		{
			// save file contents while they're available
			FilePart filePart = (FilePart) part;

			...
		}
	}

	...
}

again for clarity, i’ve stripped away some of the boilerplate in order to focus
on the details. in this case, you can see a
com.oreilly.servlet.multipart.MultipartParser being created (you have to
hand in the HttpServletRequest and the max file size you want to handle), and
then it loops over each com.oreilly.servlet.multipart.Part testing each
in turn to see if any is a “file”, and if they are, casts the Part to a
com.oreilly.servlet.multipart.FilePart and start pulling out the
contents of each file. note: make sure you drain the contents of each
Part as it’s returned. once you get the next Part, all the contents of the
previous Part are lost. so if you didn’t copy them off, they’re gone. Jason’s
doc is pretty clear on this point, but i still screwed this up in my initial
implementation

if you’re interested in other code samples and information on file-uploads,
here’s something you might want to take a
look at

now that we’ve got the files up on the server, we need to compare them and
report the differences. not surprisingly, this turns out to be a “hard”
problem. for example, let’s imagine a file 10,000 lines long, where each line
is unique with respect to all other lines. then let’s suppose you copy the file
and in the copy, delete the second line. when you compare these two files, you
want to be told that there’s one line missing in the second file. but a valid
(and useless) answer would be that 9,999 lines had been deleted, and a
different 9,998 lines had been added. in a sense, both of these answers is
correct, but only one is useful. we want that one <g>

to understand this better, i spent some time googling around. it turns out that
this problem, comparing long sequences of data to find the useful
differences comes up over and over again, in very diverse settings. for
example, research is being done to compare very long biosequences, in an area
known as Computational Biology. here’s a
paper i found on the
subject

the interesting thing about many of the papers i found is that when you
back-track the references from almost any of them, you either directly or
indirectly arrive at a paper published in 1976 by J. W. Hunt and M. D. McIlroy,

An Algorithm for Differential File Comparison
. it’s pretty readable and
lays out the problem quite nicely. in a nutshell, it describes the problem as
taking two files as input and then computing the minimum number of changes
required to transform one file into the other. if this sounds familiar, this
paper was generated to document the research that went into the original
implementation of diff

while i found this information very interesting, i wasn’t looking forward to
turning this (or any of the other) papers directly into working code. i googled
around for implementations i could use, and found
GNU Diff for Java. it’s a port of
GNU Diff into Java. it’s all represented by the class bmsi.util.Diff.
it’s provided as is, and it’s license is GPL. it works, and pretty much does
exactly what i wanted. here’s how i used it

String[] rigLines = (String[]) rigList.toArray(new String[0]);
String[] roxLines = (String[]) roxList.toArray(new String[0]);
Diff diff = new Diff(rigLines, roxLines);
Diff.change change = diff.diff(Diff.forwardScript);

here you see a bmsi.util.Diff being created while taking as input two
String arrays, each array representing a file, and each element of the array
being a line from that file. next, the diff(…) method is called with an
argument indicating that the results should be returned in “forward” order,
from the top of the file to the bottom. the results themselves arrive in the
form of a linked list of bmsi.util.Diff.change objects

btw, looking at the HTML and Java, you’ll notice that i
use the word rig and rox as an identifier or prefix. some of you
will recognize this as a somewhat subtle science-fiction reference. in any case, rig is
short for “original”, and rox is short for “copy”

finally, it’s time to format the results.

int rigLine = 0;
int roxLine = 0;

while (change != null)
{
	if (change.deleted != 0)	// DELETED
	{
		// deleted - number of lines in rig
		// line0 - first line deleted in rig
		// line1 - line before delete in rox

		if (rigLine < change.line0)
		{
			int clearLines = change.line0 - rigLine;

			printDiffLines(pw, rigLine, roxLine, rigLines, clearLines, 0);
			rigLine += clearLines;
			roxLine += clearLines;
		}

		printDiffLines(pw, rigLine, roxLine, rigLines, change.deleted, -1);
		rigLine += change.deleted;
	}

	if (change.inserted != 0)	// INSERTED
	{
		// inserted - number of lines in rox
		// line0 - line before insert in rig
		// line1 - first line inserted in rox

		if (roxLine < change.line1)
		{
			int clearLines = change.line1 - roxLine;

			printDiffLines(pw, rigLine, roxLine, rigLines, clearLines, 0);
			rigLine += clearLines;
			roxLine += clearLines;
		}

		printDiffLines(pw, rigLine, roxLine, roxLines, change.inserted, 1);
		roxLine += change.inserted;
	}

	change = change.link;
}

if (rigLine < rigLines.length || roxLine < roxLines.length)
	printDiffLines(pw, rigLine, roxLine, rigLines, rigLines.length - rigLine, 0);

so what’s going on here? it’s not too complicated, but it took me a while to
get this just right (famous last words <g>). first, the code loops over each
item in the change object until there are no more changes to process. the first
tricky point i had to deal with is that while we’re dealing with a list of
changes, i still need to display all the lines in both files. in other words,
if the first change occurs on line 5, i’ve still gotta print lines 1 through 4

the two variables rigLine and roxLine are there to deal with
this. you can see these used inside both the “DELETED” and “INSERTED” sections
of the while loop. the if in both those sections check to see if
there are unprinted lines before the current change. if there are, they are
printed. finally, after the loop terminates, if there are any lines left they
are printed as well

at the bottom of the two sections, the lines of the change itself are printed.
and of course, anytime lines are printed, rigLine and roxLine are
incremented

well, that’s pretty much it. as you can tell, there’s still lots to do. i’ve
got lots more info about other products and libraries that take a crack at this
problem, and others besides. for example, how about supporting merge?
there seem to be several approaches to this. and how about schema? so far i
haven’t really seen anyone take on that problem. still looking. finally, what
about having this available as a web-service? and if you have any ideas,
comments or suggestions, please don’t keep them to yourself. let me know

any thoughts on the diff problem in general? or XML-diff in particular?

chromatic

AddThis Social Bookmark Button

Related link: http://news.com.com/2010-1071-5086769.html

Computers had just started to reach schools as I was growing up. The first
time I ever used a computer was second or third grade; there was a new
Commodore 64 in the corner one day. From then, there was the usual assortment
of Apple IIs, Ataris, and, by high school, the venerable PS/2.

Since computers hadn’t yet become pervasive, the libraries used the classic
card catalog: several rows of deep drawers, with thousands of cards apiece,
listing available books by subject, title, and author. This seems like an
obvious use of a computer now, but the expenses of hardware, software, and data
conversion were prohibitive until my undergraduate days.

The card catalog answered two most interesting questions:

  • Does this book exist?
  • Where can I find it?

Granted, several research projects in those days meant flipping through
subject cards, trying to find the right terms to describe books. Then, you’d
write down the appropriate Dewey Decimal System numbers (in most libraries
anyway) and track down the appropriate shelves. If you were lucky, the book
would be available. If you weren’t lucky, you’d be in the right area for the
subject matter.

Of course, that all depended on being able to reach the card catalog.
Sometimes where were long lines. Other times, someone would have pulled out
the Ar - Az drawer to peruse on a table. I don’t recall any instances of
vandalized or missing cards, but it must have happened.

While the old system had worked pretty well for decades, moving all of this
information to computers was very valuable. Since a computerized catalog is
just a client-server system, libraries could buy several dumb terminals and one
decent server. Not only could multiple people search in one “drawer”
simultaneously (as far as there were drawers in the new system), but searching
was much faster.

There was — and is — room for many other improvements. You
could tie in the catalog to the book registry and get real-time view of whether
a book had been checked out. (It’s still a little difficult to know if
someone’s waiting in line to check it out downstairs or if it’s sitting in the
Reshelve cart waiting for a library assistant, but that’s progress anyway.) If
your library is part of a group of libraries sharing catalog information, you
can see if a branch location or another library has the book. You can even
request it from another location. It’s much easier to see related subjects and
works, as well.

With all of this innovation, the computer catalog never stopped answering
the fundamental questions:

  • Does this book exist?
  • Where can I find it?

Woe to any card catalog who answered a query for a non-existent book with a
Dewey Decimal Number of 1024, pointing to an empty shelf labelled “Your Book
Does Not Exist. And now for a word from our sponsor….”

Andy Lester

AddThis Social Bookmark Button

In the arms race of spam prevention, content-based filters, including any Bayesian ones you care to throw at it, have been beaten. Until we get truly intelligent recognition, where a computer is smart enough to know that a subject of “She will love you for it” is Viagra spam, and that “I was at the end of my rope until I found this” is some money scam, the spammers will be able to get any content past the filters.


In addition to the tricks discussed in the
ActiveState Field Guide To Spam,
spammers are already started foiling the filters by throwing in random real words. I regularly get spam through two levels of filtering (SpamAssassin and Eudora) that looks like this:


      Our rates are the lowest!  You can get 3.45% fixed for
rough pencil final happy
      30-years!  Follow this link to get the best rates
napkins canine amazed
      in the country, but only for a limited time!

The extra random non-spam text foils it. And, since the words are random, tactics to get a checksum or signature on it are, or will be, useless. I suspect it won’t be long before spam comes through with three lines of spam content, and a couple K of random words. If we get to where words that are clearly random are somehow caught, then the spammers will turn to pulling random pages off the net for their obscuring text. Maybe they’ll throw in, say, a few pages of Macbeth to foil things.

The answer is to stop the spammers before they get their message in. All content-based filtering depends on the spammer getting their payload to us first, instead of checking them at the gate. This will mean a replacement of SMTP. Until then, SPF seems to have potential, but it has its drawbacks.

Mind you, I’m not throwing away my SpamAssassin install. It helps stop a significant amount of the spam. Unfortunately, content-based filtering is a Band-Aid on the real problem.

Do you see any solution outside of replacing SMTP?