Related link: http://w.eimg.net/partners/sk_minisite/video/SKEarthLink_high.mov
Promotional video on the East vs. West cell phone technology gap and Earthlink’s new partnership with SK.

You can also Google Earthlink + SK
Related link: http://w.eimg.net/partners/sk_minisite/video/SKEarthLink_high.mov
Promotional video on the East vs. West cell phone technology gap and Earthlink’s new partnership with SK.

You can also Google Earthlink + SK
I needed to find out, roughly, how to answer the question:
What is the maximum size text file I can read into a String in Java on a desktop Windows system? Here’s my back-of-the-envelope calculations:
improvements welcome and I certainly don’t want to claim that this
is definitive. We’ll assume Sun’s JRE and a modern Windows on Intel.
For a start, there are some absolute limits to the amount
of addressable memory for Java in Windows.
Sun’s JRE for Windows is 32-bit: so 4 Gig must be a ceiling.
Then desktop
href="http://teamapproach.ca/trouble/Memory.htm">Windows system only have 31-bit User Address Space: so we are down to 2 Gig.
Windows pragmatics intervene again here:
the address that DLLs load at may fragment the available
contiguous space available for Java heap: apparantly
this can reduce the amount of available down as far as
>1.2 Gig in Java 1.4.2, with 1.5 being a
little better. I had never heard of re-basing before
I looked at this: scary! So we now drop down to, say, 1.3 Gig.
Now what about inside Java?
With the default setting for Java on Windows,
the memory is divided into permanent space, new space
and tenured space. Lets take out 100 Meg for permanent
space, which may be a little much. That leaves us with
1.2 Gig, which is what we supply to the -Xmx heap argument.
The default ratio of new to tenured space
is 1:12 (100 meg here), and the tenured space apparantly reserves about the same to allow copying at a pinch.
(More precisely, it reserves enough space to allow all of the “Eden” and the current “survivor” space to be copied into it: by default that is 9/10s of the New space, not a significant difference.) At this stage we have 1 Gig at most available.
Next, we realize that for most Western text data, a Java character doubles the amount of required memory to two bytes per external byte. Which brings us down to 500 Meg filesize. (A large object like this would be get created directly into the Tenured space.)
Of course, there are many things that could reduce this
limit further: in particular, if we tried to copy the String, or if we chose the wrong kinds of IO operations.
And some people talk about Tenured Space Fragmentation,
which is where there is not enough contiguous space to allocate a object of the right size, even though there may be enough available memory scattered about.
So if 500M is our maximum file size, what kind of configuration of
PC do we need to support that? We want all our Java process to fit into physical RAM: otherwise we might get thrashing. So we need 1.3 Gig of RAM, plus some extras for our operating system: lets say 128 Meg for Windows XP.
Decause desktop systems frequently have other applications running and open, and their performance really benefits then the working set is rich, lets say 2 Gig of RAM.
We need to increase the page file so that there is plenty of
memory available for commitment: lets set it at 3-4 Gig.
So that is my current estimation:
the maximum possible text file to process as a String
on ordinary Java on an ordinary desktop PC is 500 Meg,
and 2 Gig of RAM is as much as you need.
When you start to add other things, such as actually doing
work with your program and if there is buffered IO,
then that maximum gets even smaller.
Lets say that we want to read in a text file
into a ByteBuffer, transcode it into a CharacterBuffer,
then read that into some kind String.
That halves the maximum file size again: 250 Meg.
What about we want to do the same, but have it in
a StringBuffer? Java will still try to reserve up to
twice the size for the backing array, and suck up all the available space. So a 167 Meg file may take up as
much RAM as our 250 Meg file!
The bottom line: if you have to read files into your
application that are greater than 167 Meg, you need to
be very careful in tracking memory management:
you probably want to make your own string library
or extend the existing classes where possible. And, on
the face of it, you can forget files greater than 500 Meg.
And parsing the text into smaller objects, such as
a linked list of lines of Strings may not save any space
either. This obviously has implications for XML DOMs.
Of course, the answer for large files is either to
implement your own chunking mechanism to allow files larger than the virtual memory, or to move to streaming processing entirely. NIO direct buffers may have something to offer here as well.
I don’t have a PC with 2 Gig, so I cannot experiment
to find out the answer. If anyone has a large memory system
and wants to experiment, I would certainly print the results!
Thinking about Java memory management is
very complex. Not the least because much material
on the Web is out-of-date or concerns platforms other
than Windows. And much of it concerns Java for server
applications not desktop applications.
For people interesting in this topic, I have two previous
>bl
>og entries, and there is some material at
>Editing the Million-Line XML Document.
Corrections Welcome!
Related link: http://news.com.com/2100-1028_3-5540937.html
Okay, I’ll admit it. I just don’t get it. Los Angeles Sen. Kevin Murray has introduced a bill that would make anyone who sells, advertises, or distributes P2P software responsible for the end-users actions…to the tune of up to $2500 and year in jail.
WHAT?! Anyone who advertises P2P software?
What does advertising consist of? Does it have to be paid advertising? Or can I simply mention BitTorrent? (Is that going to cost me $2500 and a year in jail?)
If this bill passes, I want to know if we can start attaching the same types of laws to gun and alcohol manufacturers. P2P may hurt some companies financially, but guns and alcohol bring real pain and suffering to real people. I have yet to hear about someone being killed by a P2P user who lost control.
Heck, let’s go for companies that sell, advertise, or distribute knifes, baseball bats, piano wire, bleach, pencils…
Should companies be responsible for their consumer’s use of their products?
Related link: http://rkoman.blogspot.com/2005/01/how-osj-could-have-prevented.html
A few days ago I posted a blog called “Are Blogs the New Journalism?” which garnered some lengthy rebuttals both here and on my blog. I learned something from that conversation and some other reading and started thinking about it in terms of open source.
Back in 1999 my friend Andrew Leonard wrote this on Salon.com:
On Monday, Jane’s Intelligence Review, the “international journal of threat analysis” (a must-read on your average CIA spook’s list), solicited feedback on an article about “cyberterrorism” from the geeks who hang out at the Slashdot “news for nerds” Web site. On Thursday, after the Slashdot members sliced and diced Jane’s story into tiny little pieces, an editor at the magazine announced that the story would not be published as planned. Instead, the editor, Johan J Ingles-le Nobel, declared that he would write a new article incorporating the Slashdot comments, and would compensate Slashdot participants whose words made it into the final copy.
“When you ask for feedback you get feedback,” wrote Nobel, “and since roughly 99% of the posters slammed the article, even saying things like ‘we’d expect better from Jane’s’, I’ve informed the author that we’re not going to run with it. Instead I’m going to cull your comments together and make a better, sharper feature out of it — I’ll be getting in touch with several of you for more specific details or for more clarification.”
This week, CBS released a report on the National Guard documents debacle, firing four producers, but not the head of CBS News or Dan Rather. The story was a scoop of the highest degree. Except that is was wrong. Opinions on how they got it wrong include “staff is all leftists who wanted to get Bush” in the words of PowerLine Blog to the systemic problems of a monolithic monopoly. As has been well reported, the story of the debunking of the documents is the story of blogs
– as Time trumpeted a few weeks back.
Conservative bloggers — with an axe to grind — were suspicious. The night the story ran someone named buckethead wrote this on www.freerepublic.com:
“Every single one of these memos to file is in a proportionally spaced font, probably Palatino or Times New Roman. In 1972 people used typewriters for this sort of thing, and typewriters used monospaced fonts. The use of proportionally spaced fonts did not come into common use for office memos until the introduction of laser printers, word processing software, and personal computers. They were not widespread until the mid to late 90’s. Before then, you needed typesetting equipment, and that wasn’t used for personal memos to file. Even the Wang systems that were dominant in the mid 80’s used monospaced fonts. I am saying these documents are forgeries, run through a copier for 15 generations to make them look old. This should be pursued aggressively.”
Pursued it was. Picked up by PowerLine’s Scott Johnson, the idea that the docs were fakes just gained more and more energy, evidence, conspiracy theories, investigation. Johnson’s post, called the 61st Minute, was updated continuously the day after the 60 Minutes piece including a comments like this from Larry Nichols:
As a PSM I had to know every job in Personnel, including the proper filing of documents in individual military records. Memos were NOT used for orders, as the one ordering 1LT Bush to take a physical. This would have done as a letter, of which a copy should have been sent to the CBPO (Consolidated Base Personnel Office) to be filed in 1LT Bush’s military record. Memos DID NOT get filed in personnel records.
Then, over at www.littlegreenfootballs.com, Charles Johnson simply retyped the suspect document in Word and came up with a document that looked identical, he claimed, to the 60 Minutes document. A blog called INDC Journal (www.indcjournal.com) ran a report of an analysis by a forensic scientist. Legitimate signatures of Col. Killian were dug up. Someone pointed out that the memos feature kerning, which typewriters are physically incapable of. Amar Sarwal found that the Gen. Straud that a 1973 memo refers to actually retired the previous year. Theresa McAteer pointed out that a legit memo written on Sept. 5, 1973, a month after the suspect memo is dated, is typed on a monospace 70s-style typewriter.
By the end of the day, PowerLine’s John Hinderaker put it succintly: “60 Minutes is toast.”
In a piece celebrating bloggers as People of the Year, Time described the process like this:
“The more comments Johnson posted, the more e-mail he got, which he then posted, generating even more e-mail, and so on. The process turbocharged itself. In all, he updated the post 15 or 20 times over the course of that day. … By 10:30 a.m., Power Line had an arsenal of arguments attacking the memos-typographical, logical, procedural, historical. … The Drudge Report, the Mondo Cane grandfather of all right-leaning news blogs, linked to their site about midafternoon, sending a torrent of traffic their way and promptly crashing their Web server. By the end of the day, about 500 sites had linked to Power Line. ‘I think it’s fair to say that that post that Scott began is probably the most famous post in the young history of the blogosphere,’ [Power Line blogger John] Hinderaker says proudly. “
What’s interesting about this story is not so much that there are “citizen journalists” out there, doing the job that “real” journalists are not doing. In fact, there was no reporting or investigation to the original post — just a bit of reasoning and reasonable suspicion. It was the flood of posts from readers that created a virtuous circle of other people’s ideas, documentary evidence, and widespread dissemination. It is this ecology of facts, opinions and linking that is best described by the term”blogosphere.”
But of course real journalists — producers, editors, reporters — are supposesd to be doing this job. It’s a massive failure of journalism for this big a story to be based on a hoax, and for CBS to have backed up the story for so long. But it’s hardly the only fiasco facing Big Journalism these days. The bitter taste of Jason Blair must still be fresh in press executives’ mouths.
Now think back to 1999 and that Jane’s piece on cyberterrorism. Is it so crazy to imagine that rather than keeping these documents top secret, going public only in front of millions of viewers and every press outlet in the world, that 60 Minutes would have released a few online, to the blogopshere, and received the benefits of their suspicious, research, and nitpicking? Apparently, a few hundred conservative bloggers have to be tougher than any editor inside of CBS.
When you talk about this, you’re talking about Open Source Journalism, which is the opposite of the Art of the Scoop. Open Source Journalism is about getting it right, rather than getting it first. But getting it all is still a job for news organizations; journalism is not dead but it is going to be changing rapidly from here on out.
Will organizations like CBS change with it? Dan Gillmor isn’t optimistic: “I don’t think CBS is, today, institutionally capable of truly understanding the value of listening to its audience — of grasping how much help the audience can be in the journalistic process. The network’s offhanded dismissal of the grassroots continues even now. (I know there are individual people at CBS who do get it. But they are not running things.) That said, it would have been at least tactically smart for CBS to have acknowledged the grassroots component of this debacle. Common-sense PR should have made this obvious. Is this a cynical comment on my part? I guess so, but I hate to see the network compounding the damage so unnecessarily, in part because (unlike some in the blog world) I still value the good stuff CBS does.”
Gillmor, the author of the influential book “We the Media,” has long practiced his own brand of Open Source Journalism. Back in 2001, he talked to Online Journalism Review about his own blog at the San Jose Mercury News, a job he recently left to pursue a grassroots journalism project. “There have been occasions where I put up a note saying, ‘I’m working on the following and here’s what I think I know,’ and the invitation is for the reader to either tell me I’m on the right track, I’m wrong, or at the very least help me find the missing pieces.”
Despite the buzz about blogs as the new journalism, Jay Rosen, author of the PressThink blog, doesn’t think that blogs by themselves represent the end of Big Journalism. “Blogging is only one part of a larger development–citizen’s media,” he writes, “that forces smart people in the press to confront the paradox of the self-informing public, previously thought to exist only at the level of the primordial village.”
A self-informing public is in fact a movement, but it’s not necessarily an antagonist to mainstream media — if journalists will embrace the amazing power of the many, take advantage of their willlingness to inform themselves, and meet their expectations for accuracy. To be fair, it won’t be easy because bloggers on both extremes of the political spectrum will be out for blood. But a little blood now could prevent major hemmoraghing in the future.
Are these powerliners truly citizen journalists, or just media-bashing conservatives who smelled a royal f-up?
Related link: http://www.dmsolutions.ca/solutions/tsunami.html
A great application built using Open Source GIS and mapping tools.

From http://www.dmsolutions.ca/solutions/tsunami.html
A few quotes from the intro page:
“Aid workers at ground zero and around the globe can have free access to this spatial data and the resulting mapping functionality via the Internet. In addition, these same workers can contribute their specialized data for use by other agencies.”
“We encourage Data Providers to help us respond to Data User requirements by providing no-cost, high quality (geo-referenced) data that can be quickly and effectively integrated within this Website and used to generate meaningful and helpful maps.”
“This application was made possible through the use of open-source mapping technologies developed and supported by DM Solutions Group. Many of the data services leveraged by this application were published to Open Geospatial Consortium (OGC) Standards through MapServer technology. OGC support within these technologies was made possible in part through funding from the GeoConnections program of Natural Resources Canada in support of the Canadian Geospatial Data Infrastructure (CGDI).”
Related link: https://fi.dev.java.net/
I’m quite bullish about W3C’s “Binary XML Infoset”
project, after looking at the java.net
Open Source library
Fast Infoset,
which has been mentioned in
a few blogs
recently.
The thing I like is that it reminds me of
XML’s development: I think respect-based standards
are a win all around.
Here are some of the similarities with XML:
But the the biggest way the that W3C Binary project and
Fast Infoset reminds me of XML’s development is
respect. Under Sun’s Jon Bozak, XML’s development was based on respect that people have different legitimate requirements, respect for standards, developers, open source and commercial imperitives, respect for industry experience, respect for internationalization, and so on.
XML is a rebadged standard technology that
emerged out of this discipline of basic respect.
I was shocked in some other standards groups I have seen
how little respect their was: acting with no respect for the requirements of anyone other than your own company’s
current customers seems to be a sure way to guarantee
a crappy standard. In developing ISO Schematron, I really
tried to accept and correct any “big picture” criticism.
I read recently a criticism of the “Binary XML Infoset”
project as polluting the stream. I believe the lesson
to be learned from XML is not that “Everyone should use one format, it should be simple, it should be Unicode, it should use angle brackets” but the far more challenging “Respect-driven standards development produces
really good and generally applicable results.”
The other really nice thing in Fast Infoset is that,
apparantly, you can define your own datatypes more readily, especially for lists of numbers and so on. XML Schemas datatypes went severely wrong by rejecting the old SGML idea of notations: that there are infinite number of data formats you might want to embed in XML elements or external documents. Extensible embedded data formats have been resurrected in a better form by Jenni Tennison’s extensible XML Datatypes library.
Can’t a guy get any repect around here?
Related link: http://www.nytimes.com/2005/01/11/technology/11soft.html?oref=login&oref=regi
Today IBM announced that they will allow open source developers to share 500 of their patents to establish a patent commons:
I.B.M. will continue to hold the 500 patents. But it has pledged to seek no royalties from and to place no restrictions on companies, groups or individuals who use them in open-source projects, as defined by the Open Source Initiative, a nonprofit education and advocacy group. The group’s definition involves a series of policies allowing for free redistribution, publication of the underlying source code and no restrictions on who uses the software or how it is used.
Just how far I.B.M. intends to go in granting open access to its patents is uncertain. The 500 patents are a small slice of its corporate patent trove of more than 40,000 worldwide and 25,000 in the United States. In recent years, software patents have accounted for about half of the patents granted to I.B.M.
This is the most serious (and cool!) development in Open Source in quite some time. Software patents are one of the few genuine threats to the open source model, and IBM taking a step to ensure that the OSS model isn’t hindered.
While I am far from an expert on patents and IANAL, I see two distinct facets to this issue. First, this means that open source developers can sleep better at night with respect to these 500 patents. IBM is not going to sue you tomorrow because you’ve inadvertendly stepped on one of their patents. Phew
The second facet is the concept of a patent pool. I don’t know how this will shape up with respect to the patent commons that IBM is creating, but consider company A which has a pool of patents that it has licensed to company B. If company C comes along and claims that company B infringes on one of company C’s patents it could go and file a lawsuit against company B. However, since company B has company A’s patent pool at its disposal, it can root around in the pool and see if there are any patents that company C might be infringing. Company B’s defense could then respond to company C by saying: “Ok, we may be infringing your patent, but we strongly believe that you infringe these 20 patents that we have access to. Drop your suit or will counter sue.”
Company C is all of the sudden faced with a tidal wave of infringements that hadn’t been an issue up until they started picking on company B. As you can see, this is a great defense strategy for Company B, and surely lucrative for company A. My question is whether or not IBM will allow the patent commons to be used like a patent pool.
What if some anti-OSS company were to attack an OSS developer on the grounds that said developer infringed on their patent? Could this developer see if the attacking company infringes on one of the patents from the patent commons and if so, use the patent pool to form a defense strategy? If this is the case, then the power of this patent pool is vastly greater than it appears at first glance.
Regardless, I applaud IBM in this move and I am curious to see what will come of this in the future. However, we need to keep in mind that IBM is a large corporation, and while they have drastically changed course from their actions in the 80s, they are still self interested and have to look out for shareholder value. This patent pool is not an altruistic act — IBM has some strategy in mind with this move. Perhaps its as simple as fostering open source and ensuring that the model doesn’t evaporate due to patents. But life is never that simple, is it?
If you know more about patents and if I borked something up, please let me know!
Related link: http://www.eyetop.net/
At the CES show I tried out these eyeglasses with a small QVGA monitor overlay in the right eye. Still a bit big, but getting there - they even had another unit with onboard camera on the left eye, so you can do mobile recordings.
For telepresence (letting others use you as a physical proxy) the technology is getting closer and closer. Just a few more years…
The way TIME magazine saw it in December, 2004 was the start of a “golden age” of blogging — the rapid-fire web publishing scheme where anyone can publish their rants, photos or detailed reporting on the web in a matter of seconds. While blogging has been around for four or five years, the combination of the hotly contested election and the growth in popularity of blogging tools meant that blogging had hit critical mass.
Before this year, says writer Lev Grossman, “blogs kept a relatively modest profile, and the mainstream media could comfortably treat them like amateur productions that could never compete with real news organizations.” But their power has been growing. In 2002 a liberal blog called Talking Points Memo pushed for Trent Lott’s resignation as Senate majority leader. In 2004, Russ Kick obtained photos of US soldiers’ coffins coming back from Iraq. “The next day they were on the front pages of newspapers around the world,” says Grossman.
But the event that pushed blogs into the bigtime, if not the mainstream, was the 60 Minutes debacle, in which a blog called PowerLineBlog suggested that documents presented on “60 Minutes,” which seemed to show that President Bush reniged on his National Guard service, were in fact frauds. The post was famously called the “61st Minute.”
Half an hour after posting, “there were 50 e-mails in [PowerLine contributor Scott Johnson’s] In box from readers offering further arguments and evidence disputing the CBS documents’ authenticity. Johnson sifted through the comments and added some of them to his original post. This created a feedback loop. The more comments he posted, the more e-mail he got, which he then posted, generating even more e-mail, and so on. The process turbocharged itself. In all, he updated the post 15 or 20 times over the course of that day. …
“By 10:30 a.m., Power Line had an arsenal of arguments attacking the memos-typographical, logical, procedural, historical. The three bloggers put up genuine National Guard documents from 1973 so that readers could compare them with the 60 Minutes memos. The Drudge Report, the Mondo Cane grandfather of all right-leaning news blogs, linked to their site about midafternoon, sending a torrent of traffic their way and promptly crashing their Web server. By the end of the day, about 500 sites had linked to Power Line. ‘I think it’s fair to say that that post that Scott began is probably the most famous post in the young history of the blogosphere,’ Hinderaker says proudly. “
What’s interesting about this story is not so much that there are “citizen journalists” out there, doing the job that “real” journalists are not doing. In fact, there was no reporting or investigation to the original post — just a bit of reasoning and reasonable suspicion. It was the flood of posts from readers that created a virtuous circle of other people’s ideas, documentary evidence, and widespread dissemination. It is this ecology of facts, opinions and linking that is best described by the term”blogosphere.”
Dan Gillmor, who recently left his beat as technology columnist for the San Jose Mercury News, notes in his book “We the Media”: “If my readers know more than I do (which I know they do), I can include them in the process of making my journalism better.” Journalism, Gillmor suggests, is moving from a broadcast to a conversation. “The first article may be only the beginning of the conversation in which we can all enlighten each other.”
Since he wrote those words almost a year ago, Gillmor’s thinking has evolved — so much that he has left his job at the Mercury to start a nascent company to take citizen journalism in new directions. Currently, Gillmor is thinking a lot about what he calls distributed journalism. On his Grassroots Journalism blog, Gillmor credits two sites — Talking Points Memo and Daily Delay — with putting the pressure on the Republicans to drop the rule change that would have allowed House Majority Leader Tom deLay to keep his position even if he were to be indicted.
“Something especially important occurred with these two blogs. They asked readers to call their Republican members of Congress and ask how they voted on the original secret vote to give DeLay a break. Readers responded in droves.” They reported the responses back to the bloggers. The results were posted. Did you learn how Republicans voted from NPR or Fox or the New York Times? No. But the blogosphere has ways of finding out.
This is just the start of the new journalism, Gillmor thinks. “Suppose, for example, that we assemble a nationwide group of volunteers — lawyers who are familiar with statutes — and ask each of them to take a small section of one of those immense congressional bills that the members of Congress don’t even read themselves. Suppose, further, that we could get this analysis posted before the House and Senate did their final votes. We might catch a lot of sleazy stuff before it became law. Today we’re lucky if we know about any of it before it actually passes.”
This blogging thing is starting to look interesting.
Adapt, improve …