A question that is raised quite often in the context of “SOA” is that of how to deal with data.  Specifically, people are increasingly interested in (and concerned about) appropriate caching strategies. What I see described in that context is often motivated by the fundamental misunderstanding that the SO tenet that speaks about ”automony” is perceived to mean “autonomous computing” while it really means “avoid coupling”. The former is an architecture prescription, the latter is just a statement about the quality of a network edge.

I will admit that it the use of “autonomy” confused me for a while as well. Specifically, in my 5/2004 “Data Services” post, I’ve shown principles of autonomous computing and how there is a benefit to loose coupling at the network edge when combined with autonomous computing principles, but at the time I did not yet fully understand how orthogonal those two things really are. I guess that one of the aspects of blogging is that you’ve got to be ready to learn and evolve your knowledge in front of all people. Mind that I stand by the architectural patterns and the notion of data services that I explained in that post, except for the notion that the “Autonomy” SO tenet speaks about autonomous computing.

The picture here illustrates the difference. By autonomous computing principles the left shape of the service is “correct”. The service is fully autonomous and protects its state. That’s a model that’s strictly following the Fiefdoms/Emissaries idea that Pat Helland formulated a few years back. Very many applications look like the shape on the right. There are a number of services sticking up that share a common backend store. That’s not following autonomous computing principles. However, if you look across the top, you’ll see that the endpoints (different colors, different contracts) look precisely alike from the outside for both pillars. That’s the split: Autonomous computing talks very much about how things are supposed to look behind your service boundary (which is not and should not be anyone’s business but yours) and service orientation really talks about you being able to hide any kind of such architectural decision between a loosely coupled network edge. The two ideas compose well, but they are not the same, at all.

Which leads me to the greater story: In terms of software architecture, “SOA” introduces very little new. All distributed systems patterns that have evolved since the 1960 stay true. I haven’t really seen any revolutionary new architecture pattern come out since we speak about Web Services. Brokers, Intermediaries, Federators, Pub/Sub, Queuing, STP, Conversations – all of that has been known for a long time. We’ve just commonly discovered that loose coupling is a quality that’s worth something.

In all reality, the “SOA” hype is about the notion of aligning business functions with software in order to streamline integration. SOA doesn’t talk about software architecture; in other words: SOA does not talk about how to shape the mechanics of a system. From a software architecture perspective, any notion of an “SOA revolution” is complete hogwash. From a Business/IT convergence perspective – to drive analysis and high-level design – there’s meat in the whole story, but I see the SOA term being used mostly for describing technology pieces. “We are building a SOA” really means “we are building a distributed system and we’re trying to make all parts loosely coupled to the best of our abilities”. Whether that distributed system is indeed aligned with the business functions is a wholly different story.

However, I digress. Coming back to the data management issue, it’s clear that a stringent autonomous computing design introduces quite a few challenges in terms of data management. Data consolidation across separate stores for the purposes of reporting requires quite a bit of special consideration and so does caching of data. When the data for a system is dispersed across a variety of stores and comes together only through service channels without the ability to freely query across the data stores and those services are potentially “far” away in terms of bandwidth and latency, data management becomes considerably more difficult than in a monolithic app with a single store. However, this added complexity is a function of choosing to make the service architecture follow autonomous computing principles, not one of how to shape the service edge and whether you use service orientation principles to implement it.

To be clear: I continue to believe that aligning data storage with services is a good thing. It is an additional strategy for looser coupling between services and allows the sort of data patterns and flexibility that I have explained in the post I linked to above. However, “your mileage may vary” is as true here as anywhere. For some scenarios, tightly coupling services in the backyard might be the right thing to do. That’s especially true for “service-enabling” existing applications. All these architectural considerations are, however, strictly orthogonal to the tenets of SO.

Generally, my advice with respect to data management in distributed systems is to handle all data explicitly as part of the application code and not hide data management in some obscure interception layer. There are a lot of approaches that attempt to hide complex caching scenarios away from application programmers by introducing caching magic on the call/message path. That is a reasonable thing to do, if the goal is to optimize message traffic and the granularity that that gives you is acceptable. I had a scenario where that was a just the right fit in one of my last newtelligence projects. Be that as it may, proper data management, caching included, is somewhat like the holy grail of distributed computing and unless people know what they’re doing, it’s dangerous to try to hide it away.

That said, I believe that it is worth a thought to make caching a first-class consideration in any distributed system where data flows across boundaries. If it’s known at the data source that a particular record or set of records won’t be updated until 1200h tomorrow (many banks, for instance, still do accounting batch runs just once or twice daily) then it is helpful to flow that information alongside the data to allow any receiver determine the caching strategy for the particular data item(s). Likewise, if it’s know that a record or record set is unlikely to change or even guaranteed to not change within an hour/day/week/month or if some staleness of that record is typically acceptable, the caching metadata can indicate an absolute or relative time instant at which the data has to be considered stale and possibly a time instant at which it absolutely expires and must be cleaned from any cache. Adding caching hints to each record or set of records allows clients to make a lot better informed decisions about how to deal with that data. This is ultimately about loose coupling and giving every participant of a distributed system enough information to make their own decisions about how to deal with things.

Which leaves the question about where to cache stuff. The instant “obvious best idea” is to hold stuff in memory. However, if the queries into the cached data become more complex than “select all” or reasonably simple hashtable lookups, it’s not too unlikely that, if you run on Windows, a local SQL Server (-Express) instance holding the cache data will do as good or better (increasingly with data volume) compared a custom query “engine” in terms of performance – even if it serves data out from memory. That’s especially true for caching frameworks that can be written within the time/budget of a typical enterprise project. Besides, long-lived cached data whose expiration window exceeds the lifetime of the application instance needs a home, too. One of the bad caching scenarios is that the network gets saturated at 8 in the morning when everybody starts up their smart client apps and tries to suck the central database dry at once – that’s what in-memory database approaches cause.

Thursday, June 01, 2006 11:01:52 PM UTC
I'm not sure I agree. Obviously, the services' implementation is of no interest to the service consumer, and thus you are correct in pointing out that as the service interfaces stay the same, there is no difference between the two scenarios in your diagram.

The problem is that the fact that two service implementations share a single data store is extremely likely to leak into your service interface layer. For example, you may - without becoming really aware of it - rely on having some data you created with service "Blue" ready for retrieval through service "Green". This assumption will only hold true as long as you have the hidden link between them through their shared DB.

For this reason, I'm a strong advocate of data ownership, at least for high-level services. The service interface should provide a logical, coherent set of functions to provide the service with all the information it needs to fulfil its business purpose - no hidden link should be encouraged or tolerated.
Friday, June 02, 2006 5:11:09 PM UTC
Hi Clemmens , i have been working full time for a few years on several new internal application based on SOA/webservices and agree with all the comments above.

2 applications sharing the same data is a problem , but replication/ import / export procedures are also very problemtic . Its very much horses for courses. And i can see extending a service , with a different roll as a strong candidate for this.
It also depends on how fine grained your services are - previously i found the cost of web service calls to be so high that i made my services large. The performance gains of WCF/tcp for internal networks means im now experimenting with smaller services where a DB per service gets a bit expensive .

With regards to caching , i have neglected caching architectures before till late in the project and paid the price. Caching architecture is really part of business requirements - the client will tell you we need this information daily , this information once a minute and this information immediate. Leaving it till the end is more efficient as you may have unnecesarily optomised some code , but trying to fix problems at the end can be difficult and far more costly.

It is also worth noting a lot of data can be stored whole in memory, especially if your caching requirements is long , which is memory efficient and CPU efficient leaving the DB to deal with just transaction data . This is a whole topic in itself , i find it also means a lot less SQL queries ( where a lot of bugs still come from) with easy to maintain/ modify code to handle it. Obviously in this case you create a requirement that all request must go through the service managing this list.

Also with caching architectures i find it usefull to have a bypass cache property on some message for a lot of services. This is especially useful for maintenance programs , testing and debugging.


Regards,

Ben

Ben
Monday, June 05, 2006 2:37:49 AM UTC
FYI, please allow me to point your extensive readership to some relevant content:

"From a Business/IT convergence perspective – to drive analysis and high-level design – there’s meat in the whole story..." - Yes! Take a big bite of that story here: http://www.architecturejournal.net/2006/issue7/F7_Modeling1/default.aspx

And for a great discussion on caching, Roger Walter's blog "Whither IMDB?" makes similar comments: http://blogs.msdn.com/rogerwolterblog/

Thanks as always for your insights.

- Arvindra
Comments are closed.