|
Related Reading
|
This excerpt is Chapter 10 from Java RMI, published in October 2001 by O'Reilly.
Serialization is the process of converting a set of object instances that contain references to each other into a linear stream of bytes, which can then be sent through a socket, stored to a file, or simply manipulated as a stream of data. Serialization is the mechanism used by RMI to pass objects between JVMs, either as arguments in a method invocation from a client to a server or as return values from a method invocation. In the first section of this book, I referred to this process several times but delayed a detailed discussion until now. In this chapter, we drill down on the serialization mechanism; by the end of it, you will understand exactly how serialization works and how to use it efficiently within your applications.
The "write" methods
The stream manipulation methods
Methods that customize the serialization mechanismThe "read" methods
The stream manipulation methods
Methods that customize the serialization mechanism
How to Make a Class Serializable
Implement the Serializable Interfac
Declaring transient fields
Implementing writeObject() and readObject( )
Declaring serialPersistentFieldsMake Sure That Superclass State Is Handled Correctly
Override equals( ) and hashCode( ) if Necessary
Making DocumentDescription Serializable
Implement the Serializable interface
Make sure that superclass state is handled correctly
Override equals() and hashCode( ) if necessary
A Simplified Version of the Serialization Algorithm
RMI Customizes the Serialization Algorithm
The Two Types of Versioning Problems
Serialization Depends on Reflection
Envision the banking application while a client is executing a withdrawal. The part of the application we're looking at has the runtime structure shown in Figure 10-1.
|
What does it mean for the client to pass an instance of
Moneyto the server? At a minimum, it means that the
server is able to call public methods on the instance of
Money.
|
Just to be clear: doing things this way would be a bad idea (and this is not the way RMI passes instances over the wire). |
Moneyinto a server as well.
For example, imagine that the client sends the following two pieces of
information whenever it passes an instance as an argument:
Money.The RMI runtime layer in the server can use this information to
construct a stub for the instance of
Money, so that
whenever the
Accountserver calls a method on what
it thinks of as the instance of
Money, the method
call is relayed over the wire, as shown in Figure
10-2.
|
Attempting to do things this way has three significant drawbacks:
You can't access fields on the objects that have been passed as arguments.
|
Related articles: Learning Command Objects and RMI -- O'Reilly's Java RMI author William Grosso introduces you to the basic ideas behind command objects by providing a translation service from a remote server and using command objects to structure the RMI made from a client program. Seamlessly Caching Stubs for Improved Performance -- In Part 2 of this RMI series, William Grosso addresses a common problem with RMI apps -- too many remote method calls to a naming service. In this article he extends the framework introduced in Part 1 to provide seamless caching of stubs. Generics and Method Objects -- O'Reilly's Java RMI author William Grosso introduces you to the new Generics Specification and rebuilds his command object framework using it. |
Stubs work by implementing an interface. They implement the methods in the interface by simply relaying the method invocation across the network. That is, the stub methods take all their arguments and simply marshall them for transport across the wire. Accessing a public field is really just dereferencing a pointer--there is no method invocation and hence, there isn't a method call to forward over the wire.
It can result in unacceptable performance due to network latency.
Even in our simple case, the instance of
Accountis going to need to call
getCents( )on the instance of
Money. This means that a simple call to
makeDeposit( )really involves at least two distinct
networked method calls:
makeDeposit( )from the
client and
getCents( )from the server.
It makes the application much more vulnerable to partial failure.
Let's say that the server is busy and doesn't get around to handling the request for 30 seconds. If the client crashes in the interim, or if the network goes down, the server cannot process the request at all. Until all data has been requested and sent, the application is particularly vulnerable to partial failures.
This last point is an interesting one. Any time you have an application that requires a long-lasting and durable connection between client and server, you build in a point of failure. The longer the connection needs to last, or the higher the communication bandwidth the connection requires, the more likely the application is to occasionally break down.
TIP: The original design of the Web, with its stateless connections, serves as a good example of a distributed application that can tolerate almost any transient network failure.
These three reasons imply that what is really needed is a way to
copy objects and send them over the wire. That is, instead of turning
arguments into implicit servers, arguments need to be completely copied so
that no further network calls are needed to complete the remote method
invocation. Put another way, we want the result of
makeWithdrawal( )to involve creating a copy of the
instance of
Moneyon the server side. The runtime
structure should resemble Figure 10-3.
|
The desire to avoid unnecessary network dependencies has two significant consequences:
Once an object is duplicated, the two objects are completely independent of each other.
Any attempt to keep the copy and the original in sync would involve propagating changes over the network, entirely defeating the reason for making the copy in the first place.
The copying mechanism must create deep copies.
If the instance of
Moneyreferences another
instance, then copies must be made of both instances. Otherwise, when a
method is called on the second object, the call must be relayed across the
wire. Moreover, all the copies must be made immediately--we can't wait until
the second object is accessed to make the copy because the original might
change in the meantime.
These two consequences have a very important third consequence:
If an object is sent twice, in separate method calls, two copies of the object will be created.
In addition to arguments to method calls, this holds for objects that are referenced by the arguments. If you pass object A, which has a reference to object C, and in another call you pass object B, which also has a reference to C, you will end up with two distinct copies of C on the receiving side.
To see why this last point holds, consider a client that executes a withdrawal and then tries to cancel the transaction by making a deposit for the same amount of money. That is, the following lines of code are executed:
server.makeWithdrawal(amount);
....
server.makeDeposit(amount);
The client has no way of knowing whether the server still has a
copy of
amount. After all, the server may have used
it and then thrown the copy away once it was done. This means that the client
has to marshall
amountand send it over the wire to
the server.
The RMI runtime can demarshall
amount, which is the instance of
Moneythe client sent. However, even if it has the
previous object, it has no way (unless
equals( )has been overridden) to tell whether the instance it just demarshalled is
equal to the previous object.
More generally, if the object being copied isn't immutable, then
the server might change it. In this case, even if the two objects are
currently equal, the RMI runtime has no way to tell if the two copies will
always be equal and can potentially be replaced by a single copy. To see why,
consider our
Printerexample again. At the end of
Chapter 3, we considered a list of possible feature requests that could be
made. One of them was the following:
Managers will want to track resource consumption. This will involve logging print requests and, quite possibly, building a set of queries that can be run against the printer's log.
This can be implemented by adding a few more fields to
DocumentDescriptionand having the server store an
indexed log of all the
DocumentDescriptionobjects
it has received. For example, we may add the following fields to
DocumentDescription:
public Time whenPrinted;
public Person sender;
public boolean printSucceeded;
Now consider what happens when the user actually wants to print two copies of the same document. The client application could call:
server.printDocument(document);
twice with the "same" instance of
DocumentDescription. And it would be an error for the RMI
runtime to create only one instance of
DocumentDescriptionon the server side. Even though the
"same" object is passed into the server twice, it is passed as parts of
distinct requests and therefore as different objects.
TIP: This is true even if the runtime can tell that the two instances of
DocumentDescriptionare equal when it finishes demarshalling. An implementation of a printer may well have a notion of a job queue that holds instances ofDocumentDescription. So our client makes the first call, and the copy ofdocumentis placed in the queue (say, at number 5), but not edited because the document hasn't been printed yet. Then our client makes the second call. At this point, the two copies ofdocumentare equal. However, we don't want to place the same object in the printer queue twice. We want to place distinct copies in the printer queue.
Thus, we come to the following conclusion: network latency, and the desire to avoid vulnerability to partial failures, force us to have a deep copy mechanism for most arguments to a remote method invocation. This copying mechanism has to make deep copies, and it cannot perform any validation to eliminate "extra" copies across methods.
TIP: While this discussion provides examples of implementation decisions that force two copies to occur, it's important to note that, even without such examples, clients should be written as if the servers make independent copies. That is, clients are written to use interfaces. They should not, and cannot, make assumptions about server-side implementations of the interfaces.
|
Serialization is a mechanism built into the core Java libraries for writing a graph of objects into a stream of data. This stream of data can then be programmatically manipulated, and a deep copy of the objects can be made by reversing the process. This reversal is often called deserialization.
In particular, there are three main uses of serialization:
FileOutputStream, then the data will automatically be
written to a file.ByteArrayOutputStream, then the data will be written to
a byte array in memory. This byte array can then be used to create
duplicates of the original objects.The important thing to note is that the use of serialization is independent of the serialization algorithm itself. If we have a serializable class, we can save it to a file or make a copy of it simply by changing the way we use the output of the serialization mechanism.
As you might expect, serialization is implemented using a pair
of streams. Even though the code that underlies serialization is quite
complex, the way you invoke it is designed to make serialization as
transparent as possible to Java developers. To serialize an object, create an
instance of
ObjectOutputStreamand call the
writeObject( )method; to read in a serialized object,
create an instance of
ObjectInputStreamand call
the
readObject( )object.
ObjectOutputStream, defined in the
java.iopackage, is a stream that implements the
"writing-out" part of the serialization algorithm. (RMI actually uses a subclass of
ObjectOutputStreamto customize its behavior.)
The methods implemented by
ObjectOutputStreamcan
be grouped into three categories: methods that write information to the
stream, methods used to control the stream's behavior, and methods used to
customize the serialization algorithm.
The first, and most intuitive, category consists of the "write" methods:
public void write(byte[] b);
public void write(byte[] b, int off, int len);
public void write(int data);
public void writeBoolean(boolean data);
public void writeByte(int data);
public void writeBytes(String data);
public void writeChar(int data);
public void writeChars(String data);
public void writeDouble(double data);
public void writeFields( );
public void writeFloat(float data);
public void writeInt(int data);
public void writeLong(long data);
public void writeObject(Object obj);
public void writeShort(int data);
public void writeUTF(String s);
public void defaultWriteObject( );
For the most part, these methods should seem familiar.
writeFloat( ), for example, works exactly as you would
expect after reading Chapter 1 -- it takes a floating-point number and encodes
the number as four bytes. There are, however, two new methods here:
writeObject( )and defaultWriteObject( ).
writeObject( )serializes an object.
In fact,
writeObject( )is often the instrument of
the serialization mechanism itself. In the simplest and most common case,
serializing an object involves doing two things: creating an
ObjectOuptutStreamand calling
writeObject( )with a single "top-level" instance. The
following code snippet shows the entire process, storing an object--and all
the objects to which it refers--into a file:
FileOutputStream underlyingStream = new FileOutputStream("C:\\temp\\test");
ObjectOutputStream serializer = new ObjectOutputStream(underlyingStream);
serializer.writeObject(serializableObject);
Of course, this works seamlessly with the other methods for writing data. That is, if you wanted to write two floats, a String, and an object to a file, you could do so with the following code snippet:
FileOutputStream underlyingStream = new FileOutputStream("C:\\temp\\test");
ObjectOutputStream serializer = new ObjectOutputStream(underlyingStream);
serializer.writeFloat(firstFloat);
serializer.writeFloat(secongFloat);
serializer.writeUTF(aString);
serializer.writeObject(serializableObject);
TIP:
ObjectOutputStream's constructor takes anOutputStreamas an argument. This is analagous to many of the streams we looked at in Chapter 1.ObjectOutputStreamandObjectInputStreamare simply encoding and transformation layers. This enables RMI to send objects over the wire by opening a socket connection, associating theOutputStreamwith the socket connection, creating anObjectOutputStreamon top of the socket'sOutputStream, and then callingwriteObject( ).
The other new "write" method is
defaultWriteObject().
defaultWriteObject( )makes it much easier to customize
how instances of a single class are serialized. However,
defaultWriteObject( )has some strange restrictions
placed on when it can be called. Here's what the documentation says about
defaultWriteObject( ):
Write the nonstatic and nontransient fields of the current class to this stream. This may only be called from the
writeObjectmethod of the class being serialized. It will throw theNotActiveExceptionif it is called otherwise.
That is,
defaultWriteObject( )is a
method that works only when it is called from another specific method at a
particular time. Since
defaultWriteObject( )is
useful only when you are customizing the information stored for a particular
class, this turns out to be a reasonable restriction. We'll talk more about
defaultWriteObject( )later in the chapter, when we
discuss how to make a class serializable.
ObjectOutputStreamalso implements
four methods that deal with the basic mechanics of manipulating the
stream:
public void reset( );
public void close( );
public void flush( );
public void useProtocolVersion(int version);
With the exception of
useProtocolVersion(
), these methods should be familiar. In fact,
reset( ),
close( ), and
flush( )are standard stream methods.
useProtocolVersion( ), on the other hand, changes the
version of the serialization mechanism that is used. This is necessary because
the serialization format and algorithm may need to change in a way that's not
backwards-compatible. If another application needs to read in your serialized
data, and the applications will be versioning independently (or running in
different versions of the JVM), you may want to standardize on a protocol
version.
TIP: There are two versions of the serialization protocol currently defined: PROTOCOL_VERSION_1 and PROTOCOL_VERSION_2. If you send serialized data to a 1.1 (or earlier) JVM, you should probably use PROTOCOL_VERSION_1. The most common case of this involves applets. Most applets run in browsers over which the developer has no control. This means, in particular, that the JVM running the applet could be anything, from Java 1.0.2 through the latest JVM. Most servers, on the other hand, are written using JDK1.2.2 or later. (The main exception is EJB containers that require earlier versions of Java. At this writing, for example, Oracle 8i's EJB container uses JDK 1.1.6.) If you pass serialized objects between an applet and a server, you should specify the serialization protocol.
The last group of methods consists mostly of protected methods that provide hooks that allow the serialization mechanism itself, rather than the data associated to a particular class, to be customized. These methods are:
public ObjectOutputStream.PutField putFields( );
protected void annotateClass(Class cl);
protected void annotateProxyClass(Class cl);
protected boolean enableReplaceObject(boolean enable);
protected Object replaceObject(Object obj);
protected void drain( );
protected void writeObjectOverride(Object obj);
protected void writeClassDescriptor(ObjectStreamClass classdesc);
protected void writeStreamHeader( );
These methods are more important to people who tailor the serialization algorithm to a particular use or develop their own implementation of serialization. As such, they require a deeper understanding of the serialization algorithm. We'll discuss these methods in more detail later, after we've gone over the actual algorithm used by the serialization mechanism.
ObjectInputStream, defined in the
java.iopackage, implements the "reading-in" part
of the serialization algorithm. It is the companion to
ObjectOutputStream--objects serialized using
ObjectOutputStreamcan be deserialized using
ObjectInputStream. Like
ObjectOutputStream, the methods implemented by
ObjectInputStreamcan be grouped into three categories:
methods that read information from the stream, methods that are used to
control the stream's behavior, and methods that are used to customize the
serialization algorithm.
The first, and most intuitive, category consists of the "read" methods:
public int read( );
public int read(byte[] b, int off, int len);
public boolean readBoolean( );
public byte readByte( );
public char readChar( );
public double readDouble( );
public float readFloat( );
public intreadInt( );
public long readLong( );
public Object readObject( );
public short readShort( );
public byte readUnsignedByte( );
public short readUnsignedShort( );
public String readUTF( );
void defaultReadObject( );
Just as with
ObjectOutputStream's
write( )methods, these methods should be familiar.
readFloat( ), for example, works exactly as you
would expect after reading Chapter 1: it reads four bytes from the stream and
converts them into a single floating-point number, which is returned by the
method call. And, again as with
ObjectOutputStream,
there are two new methods here:
readObject( )and
defaultReadObject( ).
Just as
writeObject( )serializes an
object,
readObject( )deserializes it.
Deserializing an object involves doing two things: creating an
ObjectInputStreamand then calling
readObject( ). The following code snippet shows the
entire process, creating a copy of an object (and all the objects to which it
refers) from a file:
FileInputStream underlyingStream = new FileInputStream("C:\\temp\\test");
ObjectInputStream deserializer = new ObjectInputStream(underlyingStream);
Object deserializedObject = deserializer.readObject( );
This code is exactly inverse to the code we used for serializing the object in the first place. If we wanted to make a deep copy of a serializable object, we could first serialize the object and then deserialize it, as in the following code example:
ByteArrayOutputStream memoryOutputStream = new ByteArrayOutputStream( );
ObjectOutputStream serializer = new ObjectOutputStream(memoryOutputStream);
serializer.writeObject(serializableObject);
serializer.flush( );
ByteArrayInputStream memoryInputStream = new ByteArrayInputStream(memoryOutputStream.
toByteArray( ));
ObjectInputStream deserializer = new ObjectInputStream(memoryInputStream);
Object deepCopyOfOriginalObject = deserializer.readObject( );
This code simply places an output stream into memory, serializes the object to the memory stream, creates an input stream based on the same piece of memory, and runs the deserializer on the input stream. The end result is a deep copy of the object with which we started.
There are five basic stream manipulation methods defined for
ObjectInputStream:
public boolean available( );
public void close( );
public void readFully(byte[] data);
public void readFully(byte[] data, int offset, int size);
public int skipBytes(int len);
Of these,
available( )and
skip( )are methods first defined on
InputStream.
available( )returns a boolean flag indicating whether data is immediately available, and
close( )closes the stream.
The three new methods are also straightforward.
skipBytes( )skips the indicated number of bytes in the
stream, blocking until all the information has been read. And the two
readFully( )methods perform a batch read into a byte
array, also blocking until all the data has been read in.
The last group of methods consists mostly of protected methods that provide hooks, which allow the serialization mechanism itself, rather than the data associated to a particular class, to be customized. These methods are:
protected boolean enableResolveObject(boolean enable);
protected Class resolveClass(ObjectStreamClass v);
protected Object resolveObject(Object obj);
protected class resolveProxyClass(String[] interfaces);
protected ObjectStreamClass readClassDescriptor( );
protected Object readObjectOverride( );
protected void readStreamHeader( );
public void registerValidation(ObjectInputValidation obj, int priority);
public GetFields readFields( );
These methods are more important to people who tailor the serialization algorithm to a particular use or develop their own implementation of serialization. Like before, they also require a deeper understanding of the serialization algorithm, so I'll hold off on discussing them right now.
|
So far, we've focused on the mechanics of serializing an object. We've assumed we have a serializable object and discussed, from the point of view of client code, how to serialize it. The next step is discussing how to make a class serializable.
There are four basic things you must do when you are making a class serializable. They are:
Serializableinterface.equals( )and
hashCode( ).Let's look at each of these steps in more detail.
This is by far the easiest of the steps. The
Serializableinterface is an empty interface; it declares
no methods at all. So implementing it amounts to adding "implements
Serializable" to your class declaration.
Reasonable people may wonder about the utility of an empty
interface. Rather than define an empty interface, and require class
definitions to implement it, why not just simply make every object
serializable? The main reason not to do this is that there are some classes
that don't have an obvious serialization. Consider, for example, an instance
of
File. An instance of
Filerepresents a file. Suppose, for example, it was
created using the following line of code:
File file = new File("c:\\temp\\foo");
It's not at all clear what should be written out when this is
serialized. The problem is that the file itself has a different lifecyle than
the serialized data. The file might be edited, or deleted entirely, while the
serialized information remains unchanged. Or the serialized information might
be used to restart the application on another machine, where
"C:\\temp\\foo"is the name of an entirely different
file.
Another example is provided by the
Thread class. (If you don't know much about threads, just wait a few chapters and then revisit this example. It will make more sense then.)
A thread represents a flow of execution within a particular JVM. You
would not only have to store the stack, and all the local variables, but also
all the related locks and threads, and restart all the threads properly when
the instance is deserialized.
TIP: Things get worse when you consider platform dependencies. In general, any class that involves native code is not really a good candidate for serialization.
Class definitions contain variable declarations. The
instance-level, locally defined variables (e.g., the nonstatic variables) are
the ones that contain the state of a particular instance. For example, in our
Moneyclass, we declared one such field:
public class Money extends ValueObject {
private int _cents;
....
}
The serialization mechanism has a nice default behavior -- if all
the instance-level, locally defined variables have values that are either
serializable objects or primitive datatypes, then the serialization mechanism
will work without any further effort on our part. For example, our
implementations of
Account, such as
Account_Impl, would present no problems for the default
serialization mechanism:
public class Account_Impl extends UnicastRemoteObject implements Account {
private Money _balance;
...
}
While
_balancedoesn't have a
primitive type, it does refer to an instance of
Money, which is a serializable class.
If, however, some of the fields don't have primitive types, and
don't refer to serializable classes, more work may be necessary. Consider, for
example, the implementation of
ArrayListfrom the
java.utilpackage. An
ArrayListreally has only two pieces of state:
public class ArrayList extends AbstractList implements List, Cloneable, java.io.
Serializable {
private Object elementData[];
private int size;
...
}
But hidden in here is a huge problem:
ArrayListis a generic container class whose state is
stored as an array of objects. While arrays are first-class objects in Java,
they aren't serializable objects. This means that
ArrayListcan't just implement the
Serializableinterface. It has to provide extra
information to help the serialization mechanism handle its nonserializable
fields. There are three basic solutions to this problem:
writeObject( )/
readObject(
) methods can be implemented.serialPersistentFields can be declared.The first, and easiest, thing you can do is simply mark some
fields using the
transientkeyword. In
ArrayList, for example,
elementDatais really declared to be a transient
field:
public class ArrayList extends AbstractList implements List, Cloneable, java.io.
Serializable {
private transient Object elementData[];
private int size;
...
}
This tells the default serialization mechanism to ignore the
variable. In other words, the serialization mechanism simply skips over the
transient variables. In the case of
ArrayList, the
default serialization mechanism would attempt to write out
size, but ignore
elementDataentirely.
This can be useful in two, usually distinct, situations:
Suppose that the first case applies. A field takes values that
aren't serializable. If the field is still an important part of the state of
our instance, such as
elementDatain the case of an
ArrayList, simply declaring the variable to be
transientisn't good enough. We need to save and
restore the state stored in the variable. This is done by implementing a pair
of methods with the following signatures:
private void writeObject(java.io.ObjectOutputStream out) throws IOException
private void readObject(java.io.ObjectInputStream in) throws IOException,
ClassNotFoundException;
When the serialization mechanism starts to write out an object,
it will check to see whether the class implements
writeObject( ). If so, the serialization mechanism will
not use the default mechanism and will not write out any of the instance
variables. Instead, it will call
writeObject( )and
depend on the method to store out all the important state. Here is
ArrayList's implementation of
writeObject( ):
private synchronized void writeObject(java.io.ObjectOutputStream stream) throws java.
io.IOException {
stream.defaultWriteObject( );
stream.writeInt(elementData.length);
for (int i=0; i<size; i++)
stream.writeObject(elementData[i]);
}
The first thing this does is call
defaultWriteObject( ).
defaultWriteObject( )invokes the default serialization
mechanism, which serializes all the nontransient, nonstatic instance
variables. Next, the method writes out
elementData.lengthand then calls the stream's
writeObject( )for each element of
elementData.
There's an important point here that is sometimes missed:
readObject( )and
writeObject(
)are a pair of methods that need to be implemented together. If you do
any customization of serialization inside one of these methods, you need to
implement the other method. If you don't, the serialization algorithm will
fail.
Unit Tests and Serialization
Unit tests are used to test a specific piece of functionality in a class. They are explicitly not end-to-end or application-level tests. It's often a good idea to adopt a unit-testing harness such as
JUnitwhen developing an application.JUnitgives you an automated way to run unit tests on individual classes and is available from http://www.junit.org/.If you adopt a unit-testing methodology, then any serializable class should pass the following three tests:
- If it implements
readObject( ), it should implementwriteObject( ), and vice-versa.- It is equal (using the
equals( )method) to a serialized copy of itself.- It has the same hashcode as a serialized copy of itself.
Similar constraints hold for classes that implement the
Externalizableinterface.
The final option that can be used is to explicitly declare which
fields should be stored by the serialization mechanism. This is done using a
special static final variable called
serialPersistentFields, as shown in the following code
snippet:
private static final ObjectStreamField[] serialPersistentFields = { new
ObjectStreamField("size", Integer.TYPE), .... };
This line of code declares that the field named
size, which is of type
int, is
a serial persistent field and will be written to the output stream by the
serialization mechanism. Declaring
serialPersistentFieldsis almost the opposite of
declaring some fields
transient. The meaning of
transient is, "This field shouldn't be stored by serialization," and the
meaning of
serialPersistentFieldsis, "These fields
should be stored by serialization."
But there is one important difference between declaring some
variables to be
transientand others to be
serialPersistentFields. In order to declare variables to
be transient, they must be locally declared. In other words, you must have
access to the code that declares the variable. There is no such requirement
for
serialPersistentFields. You simply provide the
name of the field and the type.
TIP: What if you try to do both? That is, suppose you declare some variables to be
transient, and then also provide a definition forserialPersistentFields? The answer is that thetransientkeyword is ignored; the definition ofserialPersistentFieldsis definitive.
So far, we've talked only about instance-level state. What about class-level state? Suppose you have important information stored in a static variable? Static variables won't get saved by serialization unless you add special code to do so. In our context, (shipping objects over the wire between clients and servers), statics are usually a bad idea anyway.
After you've handled the locally declared state, you may still
need to worry about variables declared in a superclass. If the superclass
implements the
Serializableinterface, then you
don't need to do anything. The serialization mechanism will handle everything
for you, either by using default serialization or by invoking
writeObject( )/
readObject( )if they are declared in the superclass.
If the superclass doesn't implement
Serializable, you will need to store its state. There are
two different ways to approach this. You can use
serialPersistentFieldsto tell the serialization
mechanism about some of the superclass instance variables, or you can use
writeObject( )/
readObject(
)to handle the superclass state explicitly. Both of these,
unfortunately, require you to know a fair amount about the superclass. If
you're getting the .class files from another source,
you should be aware that versioning issues can cause some really nasty
problems. If you subclass a class, and that class's internal representation of
instance-level state changes, you may not be able to load in your serialized
data. While you can sometimes work around this by using a sufficiently
convoluted
readObject( )method, this may not be a
solvable problem. We'll return to this later. However, be aware that the
ultimate solution may be to just implement the
Externalizableinterface instead, which we'll talk about
later.
Another aspect of handling the state of a nonserializable superclass is that nonserializable superclasses must have a zero-argument constructor. This isn't important for serializing out an object, but it's incredibly important when deserializing an object. Deserialization works by creating an instance of a class and filling out its fields correctly. During this process, the deserialization algorithm doesn't actually call any of the serialized class's constructors, but does call the zero-argument constructor of the first nonserializable superclass. If there isn't a zero-argument constructor, then the deserialization algorithm can't create instances of the class, and the whole process fails.
WARNING: If you can't create a zero-argument constructor in the first nonserializable superclass, you'll have to implement the
Externalizableinterface instead.
Simply adding a zero-argument constructor might seem a little problematic. Suppose the object already has several constructors, all of which take arguments. If you simply add a zero-argument constructor, then the serialization mechanism might leave the object in a half-initialized, and therefore unusable, state.
However, since serialization will supply the instance variables with correct values from an active instance immediately after instantiating the object, the only way this problem could arise is if the constructors actually do something with their arguments--besides setting variable values.
If all the constructors take arguments and actually execute
initialization code as part of the constructor, then you may need to refactor
a bit. The usual solution is to move the local initialization code into a new
method (usually named something like
initialize(
)), which is then called from the original constructor:
public MyObject(arglist) {
// set local variables from arglist
// perform local initialization
}
to something that looks like:
private MyObject( ) {
// zero argument constructor, invoked by serialization
// and never by any other
// piece of code.
// note that it doesn't call initialize( )
}
public void MyObject(arglist) {
// set local variables from arglist
initialize( );
}
private void initialize( ) {
// perform local initialization
}
After this is done,
writeObject(
)/
readObject( )should be implemented, and
readObject( )should end with a call to
initialize( ). Sometimes this will result in code that
simply invokes the default serialization mechanism, as in the following
snippet:
private void writeObject(java.io.ObjectOutputStream stream) throws
java.io.IOException {
stream.defaultWriteObject( );
}
private void readObject(java.io.ObjectInputStream stream) throws
java.io.IOException {
stream.defaultReadObject( );
intialize( );
}
TIP: If creating a zero-argument constructor is difficult (for example, you don't have the source code for the superclass), your class will need to implement the
Externalizableinterface instead ofSerializable.
The default implementations of
equals(
)and
hashCode( ), which are inherited from
java.lang.Object, simply use an instance's location
in memory. This can be problematic. Consider our previous deep copy code
example:
ByteArrayOutputStream memoryOutputStream = new ByteArrayOutputStream( );
ObjectOutputStream serializer = new ObjectOutputStream(memoryOutputStream);
serializer.writeObject(serializableObject);
serializer.flush( );
ByteArrayInputStream memoryInputStream = new ByteArrayInputStream(memoryOutputStream.
toByteArray( ));
ObjectInputStream deserializer = new ObjectInputStream(memoryInputStream);
Object deepCopyOfOriginalObject = deserializer.readObject( );
The potential problem here involves the following boolean test:
serializableObject.equals(deepCopyOfOriginalObject)
Sometimes, as in the case of
Moneyand
DocumentDescription, the answer should be
true. If two instances of
Moneyhave the same values for
_cents, then they are equal. However, the implementation
of
equals( )inherited from
Objectwill return
false.
The same problem occurs with
hashCode(
). Note that
Objectimplements
hashCode( )by returning the memory address of the
instance. Hence, no two instances ever have the same
hashCode( )using
Object's
implementation. If two objects are equal, however, then they should have the
same hashcode. So if you need to override
equals(
), you probably need to override
hashCode( )as well.
TIP: With the exception of declaring variables to be transient, all our changes involve adding functionality. Making a class serializable rarely involves significant changes to its functionality and shouldn't result in any changes to method implementations. This means that it's fairly easy to retrofit serialization onto an existing object hierarchy. The hardest part is usually implementing
equals( )andhashCode( ).
|
To make this more concrete, we now turn to the
DocumentDescriptionclass from the RMI version of our
printer server, which we implemented in Chapter 4. The code for the first
nonserializable version of
DocumentDescriptionwas
the following:
public class DocumentDescription implements PrinterConstants {
private InputStream _actualDocument;
private int _length;
private int _documentType;
private boolean _printTwoSided;
private int _printQuality;
public DocumentDescription(InputStream actualDocument) throws IOException {
this(actualDocument, DEFAULT_DOCUMENT_TYPE, DEFAULT_PRINT_TWO_SIDED,
DEFAULT_PRINT_QUALITY);
}
public DocumentDescription(InputStream actualDocument, int documentType, boolean
printTwoSided, int printQuality)
throws IOException {
_documentType = documentType;
_printTwoSided = printTwoSided;
_printQuality = printQuality;
BufferedInputStream buffer = new BufferedInputStream(actualDocument);
DataInputStream dataInputStream = new DataInputStream(buffer);
ByteArrayOutputStream temporaryBuffer = new ByteArrayOutputStream( );
_length = copy(dataInputStream, new DataOutputStream(temporaryBuffer));
_actualDocument = new DataInputStream(new
ByteArrayInputStream(temporaryBuffer.toByteArray( )));
}
public int getDocumentType( ) {
return _documentType;
}
public boolean isPrintTwoSided( ) {
return _printTwoSided;
}
public int getPrintQuality( ) {
return _printQuality;
}
private int copy(InputStream source, OutputStream destination) throws
IOException {
int nextByte;
int numberOfBytesCopied = 0;
while(-1!= (nextByte = source.read( ))) {
destination.write(nextByte);
numberOfBytesCopied++;
}
destination.flush( );
return numberOfBytesCopied;
}
}
We will make this into a serializable class by following the steps outlined in the previous section.
This is easy. All we need to do is change the class declaration:
public class DocumentDescription implements Serialiazble, PrinterConstants
We have five fields to take care of:
private InputStream _actualDocument;
private int _length;
private int _documentType;
private boolean _printTwoSided;
private int _printQuality;
Of these, four are primitive types that serialization can handle
without any problem. However,
_actualDocumentis a
problem.
InputStreamis not a serializable class.
And the contents of
_actualDocumentare very
important;
_actualDocumentcontains the document we
want to print. There is no point in serializing an instance of
DocumentDescriptionunless we somehow serialize
_actualDocumentas well.
If we have fields that serialization cannot handle, and they
must be serialized, then our only option is to implement
readObject( )and
writeObject(
). For
Document-
Description, we declare
_actualDocumentto be transient and then implement
readObject( )and
writeObject(
)as follows:
private transient InputStream _actualDocument;
private void writeObject(java.io.ObjectOutputStream out) throws IOException {
out.defaultWriteObject( );
copy(_actualDocument, out);
}
private void readObject(java.io.ObjectInputStream in) throws IOException,
ClassNotFoundException {
in.defaultReadObject( );
ByteArrayOutputStream temporaryBuffer = new ByteArrayOutputStream( );
copy(in, temporaryBuffer, _length);
_actualDocument = new DataInputStream(new
ByteArrayInputStream(temporaryBuffer.toByteArray( )));
}
private void copy(InputStream source, OutputStream destination, int length)
throws IOException {
int counter;
int nextByte;
for (counter = 0; counter <length; counter++) {
nextByte = source.read( );
destination.write(nextByte);
}
destination.flush( );
}
Note that we declare
_actualDocumentto be transient and call
defaultWriteObject( )in
the first line of our
writeObject( )method. Doing
these two things allows the standard serialization mechanism to serialize the
other four instance variables without any extra effort on our part. We then
simply copy
_actualDocumentto the stream.
Our implementation of
readObject( )simply calls
defaultReadObject( )and then reads
_actualDocumentfrom the stream. In order to read
_actualDocumentfrom the stream, we used the length
of the document, which had previously been written to the stream. In essence,
we needed to encode some metadata into the stream, in order to correctly pull
our data out of the stream.
This code is a little ugly. We're using serialization, but we're
still forced to think about how to encode some of our state when we're sending
it out of the stream. In fact, the code for
writeObject(
)and
readObject( )is remarkably similar to
the marshalling code we implemented directly for the socket-based version of
the printer server. This is, unfortunately, often the case. Serialization's
default implementation handles simple objects very well. But, every now and
then, you will want to send a nonserializable object over the wire, or improve
the serialization algorithm for efficiency. Doing so amounts to writing the
same code you write if you implement all the socket handling yourself, as in
our socket-based version of the printer server.
TIP: There is also an order dependency here. The first value written must be the first value read. Since we start writing by calling
defaultWriteObject( ), we have to start reading by callingdefault-ReadObject( ). On the bright side, this means we'll have an accurate value for_lengthbefore we try to read_actualDocumentfrom the stream.
This isn't a problem. The superclass,
java.lang.Object, doesn't actually have any important
state that we need to worry about. Since it also already has a zero-argument
constructor, we don't need to do anything.
In our current implementation of the printer server, we don't
need to do this. The server never checks for equality between instances of
DocumentDescription. Nor does it store them in a
container object that relies on their hashcodes.
Did We Cheat When Implementing Serializable for DocumentDescription?
It may seem like we cheated a bit in implementing
DocumentDescription. Three of the five steps in making a class serializable didn't actually result in changes to the code. Indeed, the only work we really did was implementingreadObject( )andwriteObject( ). But it's not really cheating. Serialization is just designed to be easy to use. It has a good set of defaults, and, at least in the case of value objects intended to be passed over the wire, the default behavior is often good enough.
By now, you should have a pretty good feel for how the serialization mechanism works for individual classes. The next step in explaining serialization is to discuss the actual serialization algorithm in a little more detail. This discussion won't handle all the details of serialization (Though we'll come close). Instead, the idea is to cover the algorithm and protocol, so you can understand how the various hooks for customizing serialization work and how they fit into the context of an RMI application.
The first step is to discuss what gets written to the stream when an instance is serialized. Be warned: it's a lot more information than you might guess from the previous discussion.
An important part of serialization involves writing out
class-related metadata associated with an instance. Most instances are more
than one class. For example, an instance of
Stringis also an instance of
Object. Any given instance,
however, is an instance of only a few classes. These classes can be written as
a sequence:
C1,
C2...
CN, in which
C1is a superclass of
C2,
C2is a superclass of
C3, and so on. This is actually a linear sequence because
Java is a single inheritance language for classes. We call
C1the least superclass and
CNthe most-derived class. See Figure
10-4.
|
After writing out the associated class information, the serialization mechanism stores out the following information for each instance:
And so on until:
So what really happens is that the type of the instance is stored out, and then all the serializable state is stored in discrete chunks that correspond to the class structure. But there's a question still remaining: what do we mean by "a description of the most-derived class?" This is either a reference to a class description that has already been recorded (e.g., an earlier location in the stream) or the following information:
.class fileswriteObject( )/
readObject( )are implementedObjectOutputStream's
annotateClass(
)methodThis should, of course, immediately seem familiar. The class descriptions consist entirely of metadata that allows the instance to be read back in. In fact, this is one of the most beautiful aspects of serialization; the serialization mechanism automatically, at runtime, converts class objects into metadata so instances can be serialized with the least amount of programmer work.
In this section, I describe a slightly simplified version of the serialization algorithm. I then proceed to a more complete description of the serialization process in the next section.
Because the class descriptions actually contain the metadata, the basic idea behind the serialization algorithm is pretty easy to describe. The only tricky part is handling circular references.
The problem is this: suppose instance
Arefers to instance
B. And
instance
Brefers back to instance
A. Completely writing out
Arequires you to write out
B. But writing out
Brequires you to write out
A.
Because you don't want to get into an infinite loop, or even write out an
instance or a class description more than once
you need to keep track of what's already been written to the stream. (Serialization is a slow process
that uses the reflection API quite heavily in addition to the bandwidth)
ObjectOutputStreamdoes this by
maintaining a mapping from instances and classes to handles. When
writeObject( )is called with an argument that has
already been written to the stream, the handle is written to the stream, and
no further operations are necessary.
If, however,
writeObject( )is passed
an instance that has not yet been written to the stream, two things happen.
First, the instance is assigned a reference handle, and the mapping from
instance to reference handle is stored by
ObjectOutputStream. The handle that is assigned is the
next integer in a sequence.
TIP: Remember the
reset( )method onObjectOutputStream? It clears the mapping and resets the handle counter to 0x7E0000 .RMI also automatically resets its serialization mechanism after every remote method call.
Second, the instance data is written out as per the data format described earlier. This can involve some complications if the instance has a field whose value is also a serializable instance. In this case, the serialization of the first instance is suspended, and the second instance is serialized in its place (or, if the second instance has already been serialized, the reference handle for the second instance is written out). After the second instance is fully serialized, serialization of the first instance resumes. The contents of the stream look a little bit like Figure 10-5.
|
From the description of writing, it's pretty easy to guess most
of what happens when
readObject( )is called.
Unfortunately, because of versioning issues, the implementation of
readObject( )is actually a little bit more complex than
you might guess.
When it reads in an instance description,
ObjectInputStreamgets the following information:
The problem is that the class descriptions that the instance of
ObjectInputStreamreads from the stream may not be
equivalent to the class descriptions of the same classes in the local JVM. For
example, if an instance is serialized to a file and then read back in three
years later, there's a pretty good chance that the class definitions used to
serialize the instance have changed.
This means that
ObjectInputStreamuses the class descriptions in two ways:
RMI doesn't actually use
ObjectOutputStreamand
ObjectInputStream. Instead, it uses custom subclasses so
it can modify the serialization process by overriding some protected methods.
In this section, we'll discuss the most important modifications that RMI makes
when serializing instances. RMI makes similar changes when deserializing
instances, but they follow from, and can easily be deduced from, the
description of the serialization changes.
Recall that
ObjectOutputStreamcontained the following protected methods:
protected void annotateClass(Class cl)
protected void annotateProxyClass(Class cl)
protected boolean enableReplaceObject(boolean enable)
protected Object replaceObject(Object obj)
protected void drain( )
protected void writeObjectOverride(Object obj)
protected void writeClassDescriptor(ObjectStreamClass classdesc)
protected void writeStreamHeader( )
These all have default implementations in
ObjectOutputStream. That is,
annotateClass( )and
annotateProxyClass( )do nothing.
enableReplaceObject( )returns
false, and so on. However, these methods are still called
during serialization. And RMI, by overriding these methods, customizes the
serialization process.
The three most important methods from the point of view of RMI are:
protected void annotateClass(Class cl)
protected boolean enableReplaceObject(boolean enable)
protected Object replaceObject(Object obj)
Let's describe how RMI overrides each of these.
ObjectOutputStreamcalls
annotateClass( )when it writes out class descriptions.
Annotations are used to provide extra information about a class that comes
from the serialization mechanism and not from the class itself. The basic
serialization mechanism has no real need for annotations; most of the
information about a given class is already stored in the stream.
TIP: RMI's dynamic classloading system uses
annotateClass( )to record where.classfiles are stored. We'll discuss this more in Chapter 19.
RMI, on the other hand, uses annotations to record codebase information. That is,
RMI, in addition to recording the class descriptions, also records information
about the location from which it loaded the class's bytecode. Codebases are
often simply locations in a filesystem. Incidentally, locations in a
filesystem are often useless information, since the JVM that deserializes the
instances may have a very different filesystem than the one from where the
instances were serialized. However, a codebase isn't restricted to being a
location in a filesystem. The only restriction on codebases is that they have
to be valid URLs. That is, a codebase is a URL that specifies a location on
the network from which the bytecode for a class can be obtained. This enables
RMI to dynamically load new classes based on the serialized information in the
stream. We'll return to this in Chapter 19.
The idea of replacement is simple; sometimes the instance that
is passed to the serialization mechanism isn't the instance that ought to be
written out to the data stream. To make this more concrete, recall what
happened when we called
rebind( )to register a
server with the RMI registry. The following code was used in the bank
example:
Account_Impl newAccount = new Account_Impl(serverDescription.balance);
Naming.rebind(serverDescription.name, newAccount);
System.out.println("Account " + serverDescription.name + " successfully launched.");
Account_Impl newAccount = new Account_Impl(serverDescription.balance);
Naming.rebind(serverDescription.name, newAccount);
System.out.println("Account " + serverDescription.name + " successfully launched.");
This creates an instance of
Account_Impland then calls
rebind(
)with that instance.
Account_Implis a
server that implements the
Remoteinterface, but
not the
Serializableinterface. And yet, somehow,
the registry, which is running in a different JVM, is sent something.
What the registry actually gets is a stub. The stub for
Account_Impl, which was automatically generated by
rmic, begins with:
public final class Account_Impl_Stub extends java.rmi.server.RemoteStub
java.rmi.server.RemoteStubis a class
that implements the
Serializableinterface. The RMI
serialization mechanism knows that whenever a remote server is "sent" over the
wire, the server object should be replaced by a stub that knows how to
communicate with the server (e.g., a stub that knows on which machine and port
the server is listening).
Calling
Naming.rebind( )actually
winds up passing a stub to the RMI registry. When clients make calls to
Naming.lookup( ), as in the following code snippet, they
also receive copies of the stub. Since the stub is serializable, there's no
problem in making a copy of it:
_account = (Account)Naming.lookup(_accountNameField.getText( ));
In order to enable this behavior,
ObjectOutputStreamcalls
enableReplaceObject( )and
replaceObject( )during the serialization process. In
other words, when an instance is about to be serialized,
ObjectOutputStreamdoes the following:
enableReplaceObject(
)to see whether instance replacement is enabled.replaceObject( ), passing in the instance it was about
to serialize, to find out which instance it should really write to the
stream.A question that frequently arises as distributed applications
get more complicated involves message forwarding. For example, suppose that we
have three communicating programs:
A,
B, and
C. At the start,
Ahas a stub for
B,
Bhas a stub for
C, and
Chas a stub for
A. See Figure 10-6.
|
Now, what happens if
Acalls a
method, for example,
getOtherServer( ), on
Bthat "returns"
C? The answer
is that
Agets a deep copy of the stub
Buses to communicate with
C.
That is,
Anow has a direct connection to
C; whenever
Atries to send a
message to
C,
Bis not
involved at all. This is illustrated in Figure
10-7.
|
This is very good from a bandwidth and network latency point of
view. But it can also be somewhat problematic. Suppose, for example,
Bimplements load balancing. Since
Bisn't involved in the
Ato
Ccommunication, it has no direct way of knowing
whether
Ais still using
C, or how heavily. We'll revisit this in Chapters and ,
when we discuss the distributed garbage collector and the
Unreferencedinterface.
|
A few pages back, I described the serialization mechanism:
The serialization mechanism automatically, at runtime, converts class objects into metadata so instances can be serialized with the least amount of programmer work.
This is great as long as the classes don't change. When classes change, the metadata, which was created from obsolete class objects, accurately describes the serialized information. But it might not correspond to the current class implementations.
There are two basic types of versioning problems that can occur.
The first occurs when a change is made to the class hierarchy (e.g., a
superclass is added or removed). Suppose, for example, a personnel application
made use of two serializable classes:
Employeeand
Manager(a subclass of
Employee). For the next version of the application, two
more classes need to be added:
Contractorand
Consultant. After careful thought, the new hierarchy is
based on the abstract superclass
Person, which has
two direct subclasses:
Employeeand
Contractor.
Consultantis
defined as a subclass of
Contractor, and
Manageris a subclass of
Employee. See Figure
10-8.
|
While introducing
Personis probably
good object-oriented design, it breaks serialization. Recall that
serialization relied on the class hierarchy to define the data format.
The second type of version problem arises from local changes to
a serializable class. Suppose, for example, that in our bank example, we want
to add the possibility of handling different currencies. To do so, we define a
new class,
Currency, and change the definition of
Money:
public class Money extends ValueObject {
public float amount;
public Currency typeOfMoney;
}
This completely changes the definition of
Moneybut doesn't change the object hierarchy at all.
The important distinction between the two types of versioning problems is that the first type can't really be repaired. If you have old data lying around that was serialized using an older class hierarchy, and you need to use that data, your best option is probably something along the lines of the following:
The second type of versioning problem, on the other hand, can be handled locally, within the class definition.
In order for serialization to gracefully detect when a versioning problem has occurred, it needs to be able to detect when a class has changed. As with all the other aspects of serialization, there is a default way that serialization does this. And there is a way for you to override the default.
The default involves a hashcode. Serialization creates a single
hashcode, of type
long, from the following
information:
privatemethods and constructorsprivate,
static, and
private transientThis single
long, called the class's
stream unique identifier (often abbreviated
suid),
is used to detect when a class changes. It is an extraordinarily sensitive
index. For example, suppose we add the following method to
Money:
public boolean isBigBucks( ) {
return _cents > 5000;
}
We haven't changed, added, or removed any fields; we've simply
added a method with no side effects at all. But adding this method changes the
suid. Prior to adding it, the
suidwas
6625436957363978372L;
afterwards, it was
-3144267589449789474L. Moreover,
if we had made
isBigBucks( )a protected method,
the
suidwould have been
4747443272709729176L.
TIP: These numbers can be computed using the serialVer program that ships with the JDK. For example, these were all computed by typing
serialVer com.ora.rmibook.chapter10.Moneyat the command line for slightly different versions of theMoneyclass.
The default behavior for the serialization mechanism is a
classic "better safe than sorry" strategy. The serialization mechanism uses
the
suid, which defaults to an extremely sensitive
index, to tell when a class has changed. If so, the serialization mechanism
refuses to create instances of the new class using data that was serialized
with the old classes.
While this is reasonable as a default strategy, it would be
painful if serialization didn't provide a way to override the default
behavior. Fortunately, it does. Serialization uses only the default
suidif a class definition doesn't provide one. That is,
if a class definition includes a
static final longnamed
serialVersionUID, then serialization will use
that
static
final longvalue as the
suid. In the case of our
Moneyexample, if we included the line:
private static final long serialVersionUID = 1;
in our source code, then the
suidwould be 1, no matter how many changes we made to the rest of the class.
Explicitly declaring
serialVersionUIDallows us to
change the class, and add convenience methods such as
isBigBucks( ), without losing backwards compatibility.
TIP:
serialVersionUIDdoesn't have to be private. However, it must bestatic,final, andlong.
The downside to using
serialVersionUIDis that, if a significant change is made
(for example, if a field is added to the class definition), the
suidwill not reflect this difference. This means that
the deserialization code might not detect an incompatible version of a class.
Again, using
Moneyas an example, suppose we
had:
public class Money extends ValueObject {
private static final long serialVersionUID = 1;
protected int _cents;
and we migrated to:
public class Money extends ValueObject {
private static final long serialVersionUID = 1;
public float amount;
public Currency typeOfMoney;
}
The serialization mechanism won't detect that these are completely incompatible classes. Instead, when it tries to create the new instance, it will throw away all the data it reads in. Recall that, as part of the metadata, the serialization algorithm records the name and type of each field. Since it can't find the fields during deserialization, it simply discards the information.
The solution to this problem is to implement your own versioning
inside of
readObject( )and
writeObject( ). The first line in your
writeObject( )method should begin:
private void writeObject(java.io.ObjectOutputStream out) throws IOException {
stream.writeInt(VERSION_NUMBER);
....
}
In addition, your
readObject( )code
should start with a switch statement based on the version number:
private void readObject(java.io.ObjectInputStream in) throws IOException,
ClassNotFoundException {
int version = in.readInt( );
switch(version) {
// version specific demarshalling code.
....}
}private void readObject(java.io.ObjectInputStream in) throws IOException,
ClassNotFoundException {
int version = in.readInt( );
switch(version) {
// version specific demarshalling code.
....}
}
Doing this will enable you to explicitly control the versioning
of your class. In addition to the added control you gain over the
serialization process, there is an important consequence you ought to consider
before doing this. As soon as you start to explicitly version your classes,
defaultWriteObject( )and
defaultReadObject( )lose a lot of their usefulness.
Trying to control versioning puts you in the position of explicitly writing all the marshalling and demarshalling code. This is a trade-off you might not want to make.
Serialization is a generic marshalling and demarshalling algorithm, with many hooks for customization. As an experienced programmer, you should be skeptical--generic algorithms with many hooks for customization tend to be slow. Serialization is not an exception to this rule. It is, at times, both slow and bandwidth-intensive. There are three main performance problems with serialization: it depends on reflection, it has an incredibly verbose data format, and it is very easy to send more data than is required.
The dependence on reflection is the hardest of these to
eliminate. Both serializing and deserializing require the serialization
mechanism to discover information about the instance it is serializing. At a
minimum, the serialization algorithm needs to find out things such as the
value of
serialVersionUID, whether
writeObject( )is implemented, and what the superclass
structure is. What's more, using the default serialization mechanism, (or
calling
defaultWriteObject( )from within
writeObject( )) will use reflection to discover all the
field values. This can be quite slow.
TIP: Setting
serialVersionUIDis a simple, and often surprisingly noticeable, performance improvement. If you don't setserialVersionUID, the serialization mechanism has to compute it. This involves going through all the fields and methods and computing a hash. If you setserialVersionUID, on the other hand, the serialization mechanism simply looks up a single value.
Serialization's data format has two problems. The first is all
the class description information included in the stream. To send a single
instance of
Money, we need to send all of the
following:
ValueObjectclassMoneyclassMoney.This isn't a lot of information, but it's information that RMI computes and sends with every method invocation. (Recall that RMI resets the serialization mechanism with every method call.) Even if the first two bullets comprise only 100 extra bytes of information, the cumulative impact is probably significant.
The second problem is that each serialized instance is treated as an individual unit. If we are sending large numbers of instances within a single method invocation, then there is a fairly good chance that we could compress the data by noticing commonalities across the instances being sent.
Serialization is a recursive algorithm. You pass in a single
object, and all the objects that can be reached from that object by following
instance variables, are also serialized. To see why this can cause problems,
suppose we have a simple application that uses the
Employeeclass:
public class Employee implements Serializable {
public String firstName;
public String lastName;
Public String socialSecurityNumber;
}
In a later version of the application, someone adds a new piece
of functionality. As part of doing so, they add a single additional field to
Employee:
public class Employee implements Serializable {
public String firstName;
public String lastName;
Public String socialSecurityNumber;
Public Employee manager;
}
What happens as a result of this? On the bright side, the application still works. After everything is recompiled, the entire application, including the remote method invocations, will still work. That's the nice aspect of serialization--we added new fields, and the data format used to send arguments over the wire automatically adapted to handle our changes. We didn't have to do any work at all.
On the other hand, adding a new field redefined the data format
associated with
Employee. Because
serialVersionUIDwasn't defined in the first version of
the class, none of the old data can be read back in anymore. And there's an
even more serious problem: we've just dramatically increased the bandwidth
required by remote method calls.
Suppose Bob works in the mailroom. And we serialize the object associated with Bob. In the old version of our application, the data for serialization consisted of:
EmployeeIn the new version, we send:
EmployeeAnd so on...
The new version of the application isn't backwards-compatible because our old data can't be read by the new version of the application. In addition, it's slower and is much more likely to cause network congestion.
To solve the performance problems associated with making a class
Serializable, the serialization mechanism allows
you to declare that a class is
Externalizableinstead. When
ObjectOutputStream's
writeObject( )method is called, it performs the
following sequence of actions:
Externalizable. If so, it uses externalization to
marshall the object.Externalizable, it tests to see whether the object is
an instance of
Serializable. If so, it uses
serialization to marshall the object.Externalizableis an interface that
consists of two methods:
public void readExternal(ObjectInput in);
public void writeExternal(ObjectOutput out);
These have roughly the same role that
readObject( )and
writeObject(
)have for serialization. There are, however, some very important
differences. The first, and most obvious, is that
readExternal( )and
writeExternal(
)are part of the
Externalizableinterface.
An object cannot be declared to be
Externalizablewithout implementing these methods.
However, the major difference lies in how these methods are used. The serialization mechanism always writes out class descriptions of all the serializable superclasses. And it always writes out the information associated with the instance when viewed as an instance of each individual superclasses.
Externalization gets rid of some of this. It writes out the
identity of the class (which boils down to the name of the class and the
appropriate
serialVersionUID). It also stores the
superclass structure and all the information about the class hierarchy. But
instead of visiting each superclass and using that superclass to store some of
the state information, it simply calls
writeExternal(
)on the local class definition. In a nutshell: it stores all the
metadata, but writes out only the local instance information.
TIP: This is true even if the superclass implements
Serializable. The metadata about the class structure will be written to the stream, but the serialization mechanism will not be invoked. This can be useful if, for some reason, you want to avoid using serialization with the superclass. For example, some of the Swing classes, while they claim to implementSerializable, do so incorrectly (and will throw exceptions during the serialization process). (JTextAreais one of the most egregious offenders.) If you really need to use these classes, and you think serialization would be useful, you may want to think about creating a subclass and declaring it to beExternalizable. Instances of your class will be written out and read in using externalization. Because the superclass is never serialized or deserialized, the incorrect code is never invoked, and the exceptions are never thrown.
|
Of course, this efficiency comes at a price.
Serializablecan be frequently implemented by doing two
things: declaring that a class implements the
Serializableinterface and adding a zero-argument
constructor to the class. Furthermore, as an application evolves, the
serialization mechanism automatically adapts. Because the metadata is
automatically extracted from the class definitions, application programmers
often don't have to do anything except recompile the program.
On the other hand,
Externalizableisn't particularly easy to do, isn't very flexible, and requires you to
rewrite your marshalling and demarshalling code whenever you change your class
definitions. However, because it eliminates almost all the reflective calls
used by the serialization mechanism and gives you complete control over the
marshalling and demarshalling algorithms, it can result in dramatic
performance improvements.
To demonstrate this, I have defined the
EfficientMoneyclass. It has the same fields and
functionality as
Moneybut implements
Externalizableinstead of
Serializable:
public class EfficientMoney extends ValueObject implements Externalizable {
public static final long serialVersionUID = 1;
protected int _cents;
public EfficientMoney(Integer cents) {
this(cents.intValue( ));
}
public EfficientMoney(int cents) {
super(cents + " cents.");
_cents = cents;
}
public void readExternal(ObjectInput in) throws IOException,
ClassNotFoundException {
_cents = in.readInt( );
_stringifiedRepresentation = _cents + " cents.";
}
public void writeExternal(ObjectOutput out) throws IOException {
out.writeInt(_cents);
}
}
We now want to compare
Moneywith
EfficientMoney. We'll do so using the following
application:
public class MoneyWriter {
public static void main(String[] args) {
writeOne( );
writeMany( );
}
private static void writeOne( ) {
try {
System.out.println("Writing one instance");
Money money = new Money(1000);
writeObject("C:\\temp\\foo", money);
}
catch(Exception e){}
}
private static void writeMany( ) {
try {
System.out.println("Writing many instances");
ArrayList listOfMoney = new ArrayList( );
for (int i=0; i<10000; i++) {
Money money = new Money(i*100);
listOfMoney.add(money);
}
writeObject("C:\\temp\\foo2", listOfMoney);
}
catch(Exception e){}
}
private static void writeObject(String filename, Object object) throws
Exception {
FileOutputStream fileOutputStream = new FileOutputStream(filename);
ObjectOutputStream objectOutputStream = new
ObjectOutputStream(fileOutputStream);
long startTime = System.currentTimeMillis( );
objectOutputStream.writeObject(object);
objectOutputStream.flush( );
objectOutputStream.close( );
System.out.println("Time: " + (System.currentTimeMillis( ) - startTime));
}
}
On my home machine, averaging over 10 trial runs for both
Moneyand
EfficientMoney, I
get the results shown in Table
10-1. (We need to average because the
elapsed time can vary (it depends on what else the computer is doing). The
size of the file is, of course, constant.)
| Table 10-1: Testing Money and EfficientMoney | |||
|
Class |
Number of instances |
File size |
Elapsed time |
|
|
1 |
266 bytes |
60 milliseconds |
|
|
10,000 |
309 KB |
995 milliseconds |
|
|
1 |
199 bytes |
50 milliseconds |
|
|
10,000 |
130 KB |
907 milliseconds |
These results are fairly impressive. By simply converting a leaf class in our hierarchy to use externalization, I save 67 bytes and 10 milliseconds when serializing a single instance. In addition, as I pass larger data sets over the wire, I save more and more bandwidth--on average, 18 bytes per instance.
TIP: Which numbers should we pay attention to? The single-instance costs or the 10,000-instance costs? For most applications, the single-instance cost is the most important one. A typical remote method call involves sending three or four arguments (usually of different types) and getting back a single return value. Since RMI clears the serialization mechanism between calls, a typical remote method call looks a lot more like serializing 3 or 4 single instances than serializing 10,000 instances of the same class.
If I need more efficiency, I can go further and remove
ValueObjectfrom the hierarchy entirely. The
ReallyEfficientMoneyclass directly extends
Objectand implements
Externalizable:
public class ReallyEfficientMoney implements Externalizable {
public static final long serialVersionUID = 1;
protected int _cents;
protected String _stringifiedRepresentation;
public ReallyEfficientMoney(Integer cents) {
this(cents.intValue( ));
}
public ReallyEfficientMoney(int cents) {
_cents = cents;
_stringifiedRepresentation = _cents + " cents.";
}
public void readExternal(ObjectInput in) throws IOException,
ClassNotFoundException {
_cents = in.readInt( );
_stringifiedRepresentation = _cents + " cents.";
}
public void writeExternal(ObjectOutput out) throws IOException {
out.writeInt(_cents);
}
}
ReallyEfficientMoneyhas much better
performance than either
Moneyor
EfficientMoneywhen a single instance is serialized but
is almost identical to
EfficientMoneyfor large
data sets. Again, averaging over 10 iterations, I record the numbers in Table
10-2.
| Table 10-2: Testing ReallyEfficientMoney | |||
|
Class |
Number of instances |
File size |
Elapsed time |
|
|
1 |
74 bytes |
20 milliseconds |
|
|
10,000 |
127 KB |
927 milliseconds |
Compared to
Money, this is quite
impressive; I've shaved almost 200 bytes of bandwidth and saved 40
milliseconds for the typical remote method call. The downside is that I've had
to abandon my object hierarchy completely to do so; a significant percentage
of the savings resulted from not including
ValueObjectin the inheritance chain. Removing
superclasses makes code harder to maintain and forces programmers to implement
the same method many times (
ReallyEfficientMoneycan't use
ValueObject's implementation of
equals( )and
hashCode( )anymore). But it does lead to significant performance improvements.
|
Related Reading
|
An important point is that you can decide whether to implement
Externalizableor
Serializableon a class-by-class basis. Within the same
application, some of your classes can be
Serializable, and some can be
Externalizable. This makes it easy to evolve your
application in response to actual performance data and shifting requirements.
The following two-part strategy is often quite nice:
Serializable.Externalizableinstead.This gets you most of the convenience of serialization and lets
you use
Externalizableto optimize when
appropriate.
Experience has shown that, over time, more and more objects will
gradually come to directly extend
Objectand
implement
Externalizable. But that's fine. It
simply means that the code was incrementally improved in response to
performance problems when the application was deployed.
View catalog information for Java RMI
Related articles:
Learning Command Objects and RMI -- O'Reilly's Java RMI author William Grosso introduces you to the basic ideas behind command objects by providing a translation service from a remote server and using command objects to structure the RMI made from a client program.
Seamlessly Caching Stubs for Improved Performance -- In Part 2 of this RMI series, William Grosso addresses a common problem with RMI apps -- too many remote method calls to a naming service. In this article he extends the framework introduced in Part 1 to provide seamless caching of stubs.
Generics and Method Objects -- O'Reilly's Java RMI author William Grosso introduces you to the new Generics Specification and rebuilds his command object framework using it.
Return to ONJava.com.
Copyright © 2007 O'Reilly Media, Inc.