What is the largest text file that Java can read into a String?
I needed to find out, roughly, how to answer the question:
What is the maximum size text file I can read into a String in Java on a desktop Windows system? Here's my back-of-the-envelope calculations:
improvements welcome and I certainly don't want to claim that this
is definitive. We'll assume Sun's JRE and a modern Windows on Intel.
For a start, there are some absolute limits to the amount
of addressable memory for Java in Windows.
Sun's JRE for Windows is 32-bit: so 4 Gig must be a ceiling.
Then desktop
href="http://teamapproach.ca/trouble/Memory.htm">Windows system only have 31-bit User Address Space: so we are down to 2 Gig.
Windows pragmatics intervene again here:
the address that DLLs load at may fragment the available
contiguous space available for Java heap: apparantly
this can reduce the amount of available down as far as
>1.2 Gig in Java 1.4.2, with 1.5 being a
little better. I had never heard of re-basing before
I looked at this: scary! So we now drop down to, say, 1.3 Gig.
Now what about inside Java?
With the default setting for Java on Windows,
the memory is divided into permanent space, new space
and tenured space. Lets take out 100 Meg for permanent
space, which may be a little much. That leaves us with
1.2 Gig, which is what we supply to the -Xmx heap argument.
The default ratio of new to tenured space
is 1:12 (100 meg here), and the tenured space apparantly reserves about the same to allow copying at a pinch.
(More precisely, it reserves enough space to allow all of the "Eden" and the current "survivor" space to be copied into it: by default that is 9/10s of the New space, not a significant difference.) At this stage we have 1 Gig at most available.
Next, we realize that for most Western text data, a Java character doubles the amount of required memory to two bytes per external byte. Which brings us down to 500 Meg filesize. (A large object like this would be get created directly into the Tenured space.)
Of course, there are many things that could reduce this limit further: in particular, if we tried to copy the String, or if we chose the wrong kinds of IO operations. And some people talk about Tenured Space Fragmentation, which is where there is not enough contiguous space to allocate a object of the right size, even though there may be enough available memory scattered about.
So if 500M is our maximum file size, what kind of configuration of PC do we need to support that? We want all our Java process to fit into physical RAM: otherwise we might get thrashing. So we need 1.3 Gig of RAM, plus some extras for our operating system: lets say 128 Meg for Windows XP. Decause desktop systems frequently have other applications running and open, and their performance really benefits then the working set is rich, lets say 2 Gig of RAM. We need to increase the page file so that there is plenty of memory available for commitment: lets set it at 3-4 Gig.
So that is my current estimation: the maximum possible text file to process as a String on ordinary Java on an ordinary desktop PC is 500 Meg, and 2 Gig of RAM is as much as you need. When you start to add other things, such as actually doing work with your program and if there is buffered IO, then that maximum gets even smaller.
Lets say that we want to read in a text file into a ByteBuffer, transcode it into a CharacterBuffer, then read that into some kind String. That halves the maximum file size again: 250 Meg.
What about we want to do the same, but have it in a StringBuffer? Java will still try to reserve up to twice the size for the backing array, and suck up all the available space. So a 167 Meg file may take up as much RAM as our 250 Meg file!
The bottom line: if you have to read files into your application that are greater than 167 Meg, you need to be very careful in tracking memory management: you probably want to make your own string library or extend the existing classes where possible. And, on the face of it, you can forget files greater than 500 Meg. And parsing the text into smaller objects, such as a linked list of lines of Strings may not save any space either. This obviously has implications for XML DOMs.
Of course, the answer for large files is either to implement your own chunking mechanism to allow files larger than the virtual memory, or to move to streaming processing entirely. NIO direct buffers may have something to offer here as well.
I don't have a PC with 2 Gig, so I cannot experiment to find out the answer. If anyone has a large memory system and wants to experiment, I would certainly print the results!
Thinking about Java memory management is very complex. Not the least because much material on the Web is out-of-date or concerns platforms other than Windows. And much of it concerns Java for server applications not desktop applications. For people interesting in this topic, I have two previous blog entries, and there is some material at Editing the Million-Line XML Document.
Corrections Welcome!
Categories
WebComments (3)
Read More Entries by Rick Jelliffe.

CharSequence
Yes, CharSequence is a good possibility, and I do mention NIO. Unfortunately many APIs require Strings rather than CharSequence, so they are not directly substitutable.
As far as I am aware, NIO is not clever enough
to store 8 bit data then give an interface to them as 16-bit characters in the way you suggest: ByteBuffer.asCharBuffer() merely concatenates two adjacent bytes, it doesn't take a byte and pad it into 16-bits. (I would be happy to be wrong here!) Furthermore, even if it is did, unless the transcoding libraries worked that way also, the technique would only work for ASCII and ISO 8859-1.
CharSequence
You'll be much better off with a giant CharSequence than a giant String. Since CharSequence is merely an interface, you can do lots of clever things:
-Jessehttp://publicobject.com/glazedlists/
Western numbers
A famous Java/XML writer thought I should emphasize that the kids of numbers I am talking about with "filesize" are for Western and 8-bit character sets: ASCII, ISO 8859-1 etc. For 16-bit character sets and East Asian (CJK) data, one stage of halving will be omitted: the maximum file sizes (that could be read by magic into a String) will be 1 Gig not 500 Meg. Good catch.