Friday, July 5, 2013

The 3000 Year Old IT Problem


It was first described 3000 years ago by Sun Tzu in his timeless book, The Art of War.

We can form a single united body, while the enemy must split up into fractions. Hence there will be a whole pitted against separate parts of a whole, which means that we shall be many to the enemy's few.1

This is the first description of a problem solving technique that would be named 600 years later by Julius Caesar as divide et impera. Or, as we know it, divide and conquer.

The Art of War is frequently named as one of the top ten must reads in business2. I have long been fascinated by The Art of War and especially the divide and conquer strategy. Back in the days when I was a researcher at the National Cancer Institute, divide and conquer was a common research strategy for testing drugs. Divide and conquer is used extensively in business, economics, legal, politics, and social sciences, to name a few.

Oddly enough, the only field I can think of that divide and conquer has not been used successfully is IT. Whereas virtually every other field has been able to solve large complex problems using a divide and conquer strategy, IT is alone in its failure. Try to divide and conquer a large IT system, and you get an interconnected web of small systems that have so many interdependencies that is practically impossible to coordinate their implementation.

This is the reason so many large IT systems are swamped by IT complexity3. The one problem solving strategy that universally works against complexity seems to be inexplicably inept when it comes to IT.

How can this be? What is so special about IT that makes it unable to apply this strategy that has found such widespread applicability?

There are only two possible answers. The first is that IT is completely different from every other field of human endeavor. The second is that IT doesn't understand how divide and conquer works. I think the second explanation is more likely.

To be fair, it is not all IT's fault. Some of the blame must be shared by Julius Caesar. He is the one, remember, who came up with the misleading name divide and conquer (or, as he said it, divide et impera.) Unfortunately, IT has taken this name all too literally.

What is the problem with the name divide and conquer? The name seems to imply that we conquer big problems by breaking them down into smaller problems. In fact, this is not really what we do. What we do do is learn to recognize the natural divisions that exist within the larger problem. So we aren't really dividing as much as we are observing.

Let's take a simple example of divide and conquer: delivering mail.

The U.S. Postal service delivers around 600 million letters each day. Any address can send a letter to any other address. The number of possible permutations of paths for any given letter are astronomical. So how has the Postal Service simplified this problem?

They have observed the natural boundaries and population densities in different areas of the country and divided the entire country up into about 43,000 geographic chunks of roughly equal postal load. Then they have assigned a unique zip code to each chunk.

In the following picture, you can see a zip code map for part of New York City.
Partial Zip Code Map for New York City

The areas that are assigned to zip codes are not equal sized. They vary depending on the population density. The area that includes Rockefeller Center (10020) is very dense, so the area is small. The area where I grew up (10011) has medium density and the area assigned to the zip code is consequently average sized.

If we blow up my home zip code we can see some other features of the system. Here is 10011 enlarged:
Zip Code 10011
You can see that the zip code boundary makes some interesting zigs and zags. For example, the left most boundary winds around the piers of the Hudson River. Towards the bottom, the boundary takes a sharp turn to the South about half way through the southern boundary. This is to follow Greenwich Avenue. On the right side, the boundary follows Fifth Avenue for quite a while until we hit the relatively chaotic Flatiron district at 20th street.

So zip codes are not randomly assigned. They take into account population densities, street layouts, and natural boundaries. The main point here is that a lot of observation takes place before the first zip code is assigned.

Suppose we created the zip code map by simply overlaying a regular grid on top of New York City. We would end up with zip codes in the middle of the Hudson River, zip codes darting back and forth across Fifth Avenue, and some zip codes with huge population densities while others would consist of only a few bored pigeons.

You can see why the so-called divide and conquer algorithm is probably better named observe and conquer.

The general rule of thumb with observe and conquer is that your ability to solve a large complex problem is highly dependent on your ability to observe the natural boundaries that define the sub problems.

Let's consider one other example, this time from warfare.

In about 50 BCE, Julius Caesar and 60,000 troops completed the defeat of Gaul, a region consisting of at least 300,000 troops. How did Caesar conquer 300,000 troops with one fifth that number? He did it through divide and conquer. Although the Gallic strength was 300,000 in total, this number was divided into a number of tribes that had a history of belligerence among themselves. Caesar was able to form alliances with some tribes, pit other tribes against each other, and pick off remaining tribes one by one. But he was only able to do this through carefully observing and understanding the "natural" political tribal boundaries. Had Caesar laid an arbitrary grid over Gaul and attempted to conquer one square at a time, he could never have been successful.

This brings us to the reason that divide and conquer has been so dismally unsuccessful in IT. Because IT hasn't understood that observation must occur before division.

If you don't believe me, try this simple experiment. The next time an IT person, say John,  suggests breaking up a large IT project into smaller pieces, ask John this question: what observations do we need to make to determine the natural boundaries of the smaller pieces? I can pretty much guarantee how John will respond: with a blank look. Not only will John not be able to answer the question, the chances are he will have no idea what you are talking about.

With such a result, is it any wonder that divide and conquer almost never works in IT? It is as if you assigned zip codes by throwing blobs of paint at a map of New York City and assigning all of the red blobs to one zip code and all of the green blobs to another.

For the first time in the history of IT, we now have a solution to this problem; a scientific, reproducible, and verifiable approach to gathering the observations necessary to make divide and conquer work. I call this approach synergistic partitioning. It is the basis for the IT architectural approach called The Snowman Practice. And if you are ever going to try to build a multi-million dollar IT system, Snowmen is where you need to start. I can assure you, it will work a lot better than throwing blobs of paint at the wall.

You can start reading about The Snowman Practice [here]. Or contact me [here]. What do Snowmen have to do with defining divide and conquer boundaries? Ask me. I'll be glad to tell you.

Subscription Information

Your one stop signup for information any new white papers, blogs, webcasts, or speaking engagements by Roger Sessions and The Snowman Methodology is [here].

References

(1) 6:14, translation from http://suntzusaid.com
(2) See, for example, Inc.'s list [here].
(3) I have written about the relationship between complexity and failure rates in a number of places, for example, my Web Short The Relationship Between IT Project Size and Failure available [here].

Legal Notices and Acknowledgements

Photograph from brainrotting on Flickr via Creative Commons.

This blog and all of these blogs are Copyright 2013 by Roger Sessions. This blog may be copied and reproduced as long as it is not altered in any way and that full attribution is given. 

10 comments:

Alex said...

Example of "observe and conquer" in IT http://improving-bpm-systems.blogspot.com/2011/10/enterprise-pattern-structuring-it.html

Thanks,
AS

Roger Sessions said...

I think you are on the right track, but I don't think a matrix approach can scale up. What is your perspective on scaling a matrix?

Alex said...

Sure, the matrix is a primitive tool. For more complex case, I would use clustering techniques.

Thanks,
AS

Charlie said...

The natural boundaries in IT would have to include organizational (people) boundaries, too. And that's where it could get similar to Caesar's challenge: forming alliances so that resistance is futile. While it may make perfect sense for another system to decouple itself from ours and instead form a more discreet service API, that system's VP may have little incentive to change.

Unknown said...

Out of curiosity I've been asking the "how would you partition" question to folks in big consulting organizations and the most common response I get is "aligned to business process", which I think is a problem carried out of EA definitions being business processes and the technological tools to support them. I support your partitioning strategy, but I think we have a long way before it's really understood.

Roger Sessions said...

Charlie: I agree that the organizational boundaries must include people as well. That is why I don't believe you can define a single set of boundaries and expect it to work in any organization in any industry. Every organization is unique.

Roger Sessions said...

Rodrigo: I think you are right. We do have a long way to go before this issue is understood. Our approach needs to be iterative:
1. Find an organization battling complexity.
2. Show them the uniqueness and the value of the Snowman Practice.
3. Apply the methodology on selected projects.
4. Measure the benefits in lower costs, higher reliability, greater agility, etc.
5. Publicize the results.
6. Go to 1.

Alex said...

Rodrigo,

Many years ago, when the RAM was expensive, we optimised the execution of programs in the limited RAM by rearranging the source code. We tried to reduce the amount of page faults because only part of the programs can be loaded into the RAM. Some kind of partition. We used to main techniques - static (looking at the code) and dynamic (measuring the programs' behaviour). I think that your "process perspective" is similar to static methods. Dynatic methods finally showed better result. So, those techniques are available.

Thanks,
AS

Thanks,
AS

Johan Theunissen said...

Information and systems in IT are always networks, with streams and hierarchies contained in them.

People focus on the streams and the hierarchies, but forget most of the time to overall network.

And graphs (networks) consists of objects and links.

Divide and conquer focusses on the objects, but mostly forgets about the links (the interfaces), the natural boundaries.

So interface based systems should be the way to go. Just as interface-based programming is getting popular by the day.

Roger Sessions said...

Johan,
The problem with interface based systems is that they add complexity through dependencies. Dependencies add huge complexity. So there is a trade off. Most people blindly add interfaces without doing the trade off calculation to see if they are adding enough business value to compensate for the added complexity.