I don’t agree with everything Sean McGrath writes in his latest post as I think there are a lot of really smart people who have developed some really smart ways to handle the variable width nature of XML w/o turning to malloc() every time the length of an element or attribute name reaches past any given preset constraints. That said, I can’t help but agree with,
Memory-based caches of “cooked” data structures are your friend.
Absolutely!
For you .NET developers here’s a pre-written recipe that handles all of the dirty work of determining whether to create a new XmlReader or return the in-memory cached version based on the generated ETag for the source file (see Extended Overview below for a deeper understanding of how this works.) To use this recipe you need to do nothing more than create a new XmlServiceOperationManager when your application starts up like so,
XmlServiceOperationManager myXmlServiceOperationManager = new XmlServiceOperationManager(new Dictionary<int, XmlReader>());
and then use the GetXmlReader method of the XmlServiceOperationManager, passing in the Uri (an actual System.Uri object, not the string value of the URI, though I guess it would be easy enough to create an overload that takes the string value of the URI. Another task for another day. ;-)) of the desired XML file to get an XmlReader in return like so,
XmlReader reader = myXmlServiceOperationManager.GetXmlReader(requestUri);
That’s it! Now you can use your “new” XmlReader however you might need and the next time that file is requested for processing if it hasn’t changed you save all of the time it would normally take to read the source file and convert it into an XmlReader which is fairly significant.
Source code and extended explanation inline below. Enjoy!
Oh, and stay tuned for the next installment of this recipe where we learn how adding,
1 Part memcached
1 Part ETag's
and
1 Part GZip encoding
… can turn your lame a$$ performance sucking web application into a lean, mean, kick a$$ performing machine. For a precursor, see Joe Gregorio’s AtomPub presentation slides from this past OSCON. I assure you, it’s worth every second you spend studying this gem of a resource.
Extended Overview
Here’s how it works: If any given XML file on the file system has changed since it was last requested, or if it hasn’t been requested since the application first started, the XmlServiceOperationManager loads the source it into an XmlReader, storing that XmlReader into a Hashtable (well, a generic Dictionary which helps avoid the cost of casting the object to an XmlReader the next time it’s requested, but it’s still a Hashtable underneath), and then monitors changes to the source file using a generated ETag (I use the term ETag because I use this code base in a web application, so it lends well to the terminology of a REST-influenced HTTP architecture, but it’s just the combined hash code of various properties of the requested file plus some additional information related to the request itself) for each new request for that file.
There are two generic Dictionary’s: One that uses the hash of the requested URI as the key, and stores the generated ETag for that requested file as the value, and one that uses the hash of the requested URI as the key, and the pre-cooked XmlReader as the value. With each new request the ETag (which is generated with each new request; a *VERY* cheap operation compared to the creation of a new XmlReader) for the requested XML source file is cross checked against the current value of the request URI’s hash and then uses the following rules to determine what to do next,
Rule #1: If there is no key that matches the value of hashed URI of the current request a new XmlReader is created, an ETag for the source file generated, and the value of that ETag stored as the value in the URI Dictionary. That same key is then used as the key for the XmlReader Dictionary where, of course, the pre-cooked XmlReader is then stored as the value. That new XmlReader is then returned to the requesting process.
Rule #2: If there is a key that matches the value of hashed URI of the current request, the ETag of the current request is compared against that entries value. If they are the same, the process then accesses the related entry in the XmlReader Dictionary. If they are different, a new XmlReader is created and the ETag of that XmlReader is then used as the value to store inside of the URI lookup table (okay, Dictionary but for this exercise they are one in the same (and for that matter they really are one in the same ;-)). The old XmlReader (which is no longer of interest to us) is then replaced in the XmlReader Dictionary with the new XmlReader. The process then returns that XmlReader to the requesting process.
PLEASE NOTE: In addition to the XmlServiceOperationManager below, you will also want the HashcodeGenerator (also below) which is referenced inside of the XmlServiceOperationManager for generating the ETag with each new request.
The Code
// Copyright (c) 2007 by M. David Peterson
// The code contained in this file is licensed under The MIT License
// Please see http://www.opensource.org/licenses/mit-license.php for specific detail.
using System;
using System.Collections;
using System.Collections.Generic;
using Saxon.Api;
using System.Xml;
using System.IO;
using System.Web;
using Nuxleus.Cryptography;
namespace Nuxleus.Web {
public struct XmlServiceOperationManager {
Dictionary<int, XmlReader> m_xmlReaderDictionary;
Dictionary<int, int> m_xmlSourceETagDictionary;
static String m_hashkey = (String)HttpContext.Current.Application["hashkey"];
// NOTE: For your own purpose you will probably want to change the above m_hashkey value
// to some random value. (In fact, in looking at this I should really update this code to look
// for an environment variable value (or generate a random key at app startup) instead
// of using the HttpContext.Current hack, something I only did because this
// operation only occurs once at startup time.
// None-the-less: NOTE-TO-SELF: Kick this lame a$$ stack walking hack to the curb!
static HashAlgorithm m_hashAlgorithm = HashAlgorithm.MD5;
public XmlServiceOperationManager (Dictionary<int, XmlReader> xmlReaderDictionary)
: this(xmlReaderDictionary, new Dictionary<int, int>()) {
}
public XmlServiceOperationManager (Dictionary<int, XmlReader> xmlReaderDictionary, Dictionary<int, int> xmlSourceETagDictionary) {
m_xmlReaderDictionary = xmlReaderDictionary;
m_xmlSourceETagDictionary = xmlSourceETagDictionary;
}
public bool HasXmlSourceChanged (int eTag, Uri uri) {
int uriHashcode = uri.GetHashCode();
if (m_xmlSourceETagDictionary.ContainsKey(uriHashcode)) {
if (m_xmlSourceETagDictionary[uriHashcode] == eTag) {
//Console.WriteLine("Source has not changed. {0}. Count: {1}", eTag, m_xmlSourceETagDictionary.Count);
return false;
} else {
//Console.WriteLine("Source has changed. {0}. Count: {1}", eTag, m_xmlSourceETagDictionary.Count);
return true;
}
} else {
//Console.WriteLine("Source has changed. {0}. Count: {1}", eTag, m_xmlSourceETagDictionary.Count);
return true;
}
}
public void AddXmlReader (Uri uri) {
addXmlReader(GenerateETagKey(uri), uri);
}
private void addXmlReader (int key, Uri uri) {
XmlReader reader = createNewXmlReader(uri.OriginalString);
int uriHashcode = uri.GetHashCode();
m_xmlReaderDictionary[uriHashcode] = reader;
m_xmlSourceETagDictionary[uriHashcode] = key;
}
public XmlReader GetXmlReader (Uri uri) {
return getXmlReader(GenerateETagKey(uri), uri);
}
public XmlReader GetXmlReader (int eTagKey, Uri uri) {
return getXmlReader(eTagKey, uri);
}
private XmlReader getXmlReader (int key, Uri xmlUri) {
int uriHashcode = xmlUri.GetHashCode();
if (m_xmlSourceETagDictionary.ContainsKey(uriHashcode)) {
//Console.WriteLine("Dictionary contains key: {0}", uriHashcode);
if (m_xmlSourceETagDictionary[uriHashcode] == key) {
//Console.WriteLine("{0} matches {1}", m_xmlSourceETagDictionary[uriHashcode], key);
return getXmlReader(uriHashcode, xmlUri, false);
} else {
//Console.WriteLine("{0} does not match {1}", m_xmlSourceETagDictionary[uriHashcode], key);
m_xmlSourceETagDictionary[uriHashcode] = key;
return getXmlReader(uriHashcode, xmlUri, true);
}
} else {
//Console.WriteLine("Dictionary does not contain key: {0}", uriHashcode);
m_xmlSourceETagDictionary[uriHashcode] = key;
return getXmlReader(uriHashcode, xmlUri, true);
}
}
private XmlReader getXmlReader (int key, Uri xmlUri, bool replaceExistingXmlReader) {
if (m_xmlReaderDictionary.ContainsKey(key) && !replaceExistingXmlReader) {
//Console.WriteLine("Dictionary contains key: {0} and is not being replaced.", key);
return m_xmlReaderDictionary[key];
} else {
//if (m_xmlReaderDictionary.ContainsKey(key)) {
// Console.WriteLine("Dictionary contains key: {0} but is being replaced because the source file has changed.", key);
//} else {
// Console.WriteLine("Dictionary does not contain key: {0}, so a new XmlReader is being created.", key);
//}
XmlReader reader = createNewXmlReader(xmlUri.OriginalString);
m_xmlReaderDictionary[key] = reader;
//Console.WriteLine("XmlReaderDictionary currently contains: {0} entries.", m_xmlReaderDictionary.Count);
return reader;
}
}
private static XmlReader createNewXmlReader (string xmlSourceUri) {
return XmlReader.Create(xmlSourceUri);
}
public static int GenerateETagKey (Uri sourceUri, params object[] objectParams) {
FileInfo fileInfo = new FileInfo(sourceUri.LocalPath);
return HashcodeGenerator.GetHMACBase64Hashcode(m_hashkey, m_hashAlgorithm, fileInfo.LastWriteTimeUtc, fileInfo.Length, sourceUri, objectParams);
}
public Dictionary<int, XmlReader> XmlReaderDictionary {
get { return m_xmlReaderDictionary; }
set { m_xmlReaderDictionary = value; }
}
public Dictionary<int, int> XmlSourceETagDictionary {
get { return m_xmlSourceETagDictionary; }
set { m_xmlSourceETagDictionary = value; }
}
}
}
NOTE: You really don’t need all of the methods in this file, but for the purpose of this recipe it’s easier to leave them in as I am pointing to production code where there’s good reason to have all of these methods even though we are using only one of the public methods, one of the private method (called from the public method), and an enum (used to specify which cryptographic hash algorithm to use (MD5, SHA1, and SHA256 are supported, though it’s certainly easy enough to add support for a broader range if you so desire.) If you have no idea what I just said, just use the whole file. ;-)
// Copyright (c) 2007 by M. David Peterson
// The code contained in this file is licensed under The MIT License
// Please see http://www.opensource.org/licenses/mit-license.php for specific detail.
using System;
using System.Security.Cryptography;
using System.Text;
namespace Nuxleus.Cryptography {
public enum HashAlgorithm { MD5, SHA1, SHA256 };
public struct HashcodeGenerator {
static Encoding encoder = new UTF8Encoding();
static HashAlgorithm _defaultAlgorithm = HashAlgorithm.MD5;
static string _defaultFormat = "x2";
static bool _defaultReturnBase64 = true;
static string _defaultKey = Guid.NewGuid().ToString();
string _key;
HashAlgorithm _algorithm;
bool _returnBase64;
string _format;
object[] _hashArray;
public HashcodeGenerator (params object[] hashArray)
: this(_defaultKey, hashArray) {
}
public HashcodeGenerator (String key, params object[] hashArray)
: this(key, _defaultAlgorithm, hashArray) {
}
public HashcodeGenerator (String key, HashAlgorithm algorithm, params object[] hashArray)
: this(key, algorithm, _defaultFormat, hashArray) {
}
public HashcodeGenerator (String key, HashAlgorithm algorithm, String format, params object[] hashArray)
: this(key, algorithm, format, Guid.NewGuid(), true, hashArray) {
}
public HashcodeGenerator (String key, HashAlgorithm algorithm, String format, Guid guid, params object[] hashArray)
: this(key, algorithm, format, guid, _defaultReturnBase64, hashArray) {
}
public HashcodeGenerator (String key, HashAlgorithm algorithm, String format, Guid guid, bool returnBase64, params object[] hashArray) {
_returnBase64 = returnBase64;
_hashArray = hashArray;
_algorithm = algorithm;
_format = format;
_key = FormatKey(key, _format);
}
public bool ReturnBase64 { get { return _returnBase64; } set { this._returnBase64 = value; } }
public object[] HashArray { get { return _hashArray; } set { this._hashArray = value; } }
public HashAlgorithm HashAlgorithm { get { return _algorithm; } set { this._algorithm = value; } }
public String Format { get { return _format; } set { this._format = value; } }
public String Key { get { return _key; } set { this._key = FormatKey(value, _format); } }
public static String FormatKey (string key, string format) {
StringBuilder builder = new StringBuilder();
byte[] bytes = encoder.GetBytes(key);
for (int i = 0; i < bytes.Length; i++) {
builder.Append(bytes[i].ToString(format));
}
return builder.ToString();
}
public String GetHMACHashString () {
return getHMACHashcode(_key, _algorithm, _returnBase64, _hashArray);
}
public String GetHMACHashString (params object[] hashArray) {
return getHMACHashcode(_key, _algorithm, _returnBase64, _hashArray, hashArray);
}
public String GetHMACHashString (string key, params object[] hashArray) {
return getHMACHashcode(key, _algorithm, _returnBase64, _hashArray, hashArray);
}
public String GetHMACHashString (HashAlgorithm algorithm, params object[] hashArray) {
return getHMACHashcode(_key, algorithm, _returnBase64, _hashArray, hashArray);
}
public static int GetHMACBase64Hashcode(string key, HashAlgorithm algorithm, params object[] hashArray) {
return getHMACHashcode(key, algorithm, true, hashArray).GetHashCode();
}
public static String GetHMACHashBase64String (string key, HashAlgorithm algorithm, params object[] hashArray) {
return getHMACHashcode(key, algorithm, true, hashArray);
}
public static String GetHMACHashString (string key, HashAlgorithm algorithm, bool useBase64, params object[] hashArray) {
return getHMACHashcode(key, algorithm, useBase64, hashArray);
}
public static int GetHMACHashcode (string key, HashAlgorithm algorithm, bool useBase64, params object[] hashArray) {
return getHMACHashcode(key, algorithm, useBase64, hashArray).GetHashCode();
}
private static String getHMACHashcode (string key, HashAlgorithm algorithm, bool useBase64, params object[] hashArray) {
StringBuilder builder = new StringBuilder();
HMAC hmacProvider;
switch (algorithm) {
case HashAlgorithm.SHA1:
hmacProvider = new HMACSHA1(encoder.GetBytes(key));
break;
case HashAlgorithm.SHA256:
hmacProvider = new HMACSHA256(encoder.GetBytes(key));
break;
case HashAlgorithm.MD5:
default:
hmacProvider = new HMACMD5(encoder.GetBytes(key));
break;
}
foreach (object obj in hashArray) {
builder.Append(obj);
}
byte[] computeHash = hmacProvider.ComputeHash(encoder.GetBytes(builder.ToString()));
if (useBase64) {
return Convert.ToBase64String(computeHash);
} else {
return computeHash.ToString();
}
}
}
}


This code looks very interesting. I have used XML functions in .Net for years, and I know the value of eTag header for caching, but I never seen anything like this.
Something seems wrong though. I would love to be able to test the code, but I need a little help getting the complete code from SVN as well as setting up the Saxon and other external libraries.
Who do I have to knock off?
@Ric,
Actually, you don't need Saxon or any other external libraries. That was in there by force of habit. I'll take it out on the SVN copy.
What problems are you having checking out the repository? There's some external dependencies, but none of them should require authentication. Let me know what you're running up against and I'll look into it deeper.
@Ric,
>> Something seems wrong though.
I think you might be right. I just read through the code above and it seems that I described the process incorrectly. I'll update the post to represent what's actually taking place which is that the hash of the URI is used as the key for both tables. Using the hash of the URI as the key ensures I can save the cost of removing the key which really isn't all that much, but when the entire purpose of the code is to *save* resources then it makes sense to save as much as you possibly can, everywhere you can.
With this in mind, if you see somewhere I can make the code more efficient, please don't hesitate to let me know. Every tick counts! :D