Sunday, December 30, 2007

Data Partitioning

Those who have been reading my newsletters and white papers know I believe that complexity in IT systems and business processes is best thought of as the amount of disorder in the system in question.

The best way to control this disorder is to partition the system into a small number of subsets. Mathematically, partitioning can be shown to have a huge impact on the amount of disorder in a system, and therefore, its complexity. For those interested in more information on this topic, see my briefing papers on SIP (Simple Iterative Partitions) at the ObjectWatch web site.

Recently, one of my readers sent me an email asking about my philosophy about data sharing. He asked
I understand the general philosophy [of partitioning] but am not sure how that would work at the implementation level for data(say database). Does your solution mean that HR database tables can never participate in queries with say planning module database tables?
In general, data sharing can be represented by the following diagram:

It should be pretty obvious from this diagram that data sharing severly breaks down the boundaries that logically separate applications. If application A decides to change how it stores data, application B will be broken.

Anytime we destabilize the boundaries between two applications, we add tremendously to the overall disorder of the system. Just a little bit of destabilization adds a huge amount to the disorder.

Some organizations attempt to deal with this by creating a common data access layer, but this doesn't significantly change the problem. If the implementation of A changes so that it no longer uses the common data access layer, application B is broken.

In order to ensure strong application boundaries (and therefore minimal system complexity), we take all possible steps to minimize dependencies between our applications. The best way to accomplish this is to insist that all interactions between applications go through intermediaries.

If application A wants to ask application B for some information, it makes the request through an intermediary that wraps A. That intermediary makes the request to an intermediary that wraps B. The intermediary on the requesting side is called an "envoy" and the intermediary on the receiving side is called a "guard".

How the envoy and guard send and receive requests is not the problem of either application, but the most common way of doing so is through the use of messages that are defined by the general family of standards known as web services.

Notice how this protects the two applications from implementation changes on the other side. If the implementation of A changes, it is likely that A's envoy might also need to change. But nothing on B's side needs to change.

In this architecture, we have strong boundaries between the two applications, and these boundaries serve to control the over all disorder of the system .

But still, sometimes application A needs access to application B's data. What do we do then?

There are two possibilities. The first is that application A can ask application B for the data it needs (using, of course, intermediaries). The second is that a third application can be created that manages the shared knowledge of the organization. Let's call this application the KnowledgeManager. When any application makes changes to data that is considered part of this shared knowledge, that application is responsible for letting the KnowlegeManager know about those changes (using intermediaries, of course).

Some may argue that this architecture is inefficient. Why not just go directly to the database for what you want? The answer has to do with our ultimate goals. I believe that our most important goal is controlling IT complexity. Controlling IT complexity is far more important that a few milliseconds of performance.

Complexity is expensive. Disk access is cheap.

No comments: