In recent articles and presentations I have been postulating that a concept called “next generation Grid Enabled SOA”, a.k.a. “SOA Grid” and “Not your MOM’s Bus”, combines conventional SOA infrastructure technologies such as BPEL and ESB with middle tier data grid technology to provide a new level of predictable scalability and high availability for SOA based applications.

I often get asked - “How much better is it? What’s the ROI?”

In our case, a key component of the implementation of the SOA grid is the Tangosol Coherence product (acquired early this year, now Oracle Coherence). A common use for this technology is as a horizontally scalable fault-tolerant shared memory cache, which can act as a staging area for application state.

Recently at Oracle Open World I had the opportunity to meet one of the Coherence customers, a VP of Architecture at a major investment banking firm, who talked about several use cases during a panel discussion in front of a live audience. The most fascinating use case had to do with a dramatic improvement in an overnight credit risk analysis calculation, which went from 17 hours down to 20 minutes.

The situation is one of regulatory compliance. The bank must keep enough cash on hand to cover their current credit risk, and every night they need to run a risk calculation to “prove” what this credit risk is. The challenge with the previous risk calculation application was that even though it utilized compute grid technology, it still took 17 hours to run a risk calculation across a volume of 40,000 trades per day. They needed to allow for growth of up to 150,000 trades per day. And by the way the calculation needs to run at 2:00am every night so don’t have 17 hours anyhow.

They determined that the problem was that even though they were using compute grid technology they were disk I/O bound. They rewrote the app to use Coherence and the now the same calculation takes 20 minutes to run!

Why the dramatic improvement? They weren’t beating the crap out of the disk resources. They were using the Coherence data grid to stage their data, which provides near in-memory access speeds without sacrificing reliability.

Doing some more digging I discovered some more interesting factoids about the new solution -

- Not only did they decrease the amount of time, but they also decreased the number of machines from a 200 node grid down to a 60 node grid.
- They are able to handle a 3X increase in exotic instruments trading volume, generating up to $1M profit per deal.
- They increased their ability to run simulations from 4000 to 100,000, which is a 25X improvement.

BTW, to be fair — In order for the risk calculation to run in 20 minutes there is a prefetch of 60 gigs of data from a variety of sources…databases, flat files, etc into the Coherence grid, which is an operation that takes about 6 hours. In this case it wasn’t an Oracle db and I’m not making any judgements on whether 6 hours is good or bad. Presumably the prefetch into Coherence can happen once and the 20 minute calculations can occur as many times as needed.

Even so…whether you look at it as 20 minutes or 6 hours 20 minutes that’s a dramatic improvement over 17 hours. And regardless of how you count the overall time it still using 60 machines instead of 200.

- Dave