Wednesday, October 21, 2015

Why Big IT Systems Fail


Small IT systems usually deliver successfully. They are delivered on time and on budget. When they are delivered, they usually meet the needs of the business, they are secure, they are reliable, and they are easy to change.

Large IT systems usually do not deliver successfully. They are delivered late and over budget, if they deliver at all. If delivered, they usually fail to meet the needs of the business, they are rife with security problems, they fail frequently, and they are hard to change.

Why are we so good at delivering small IT systems and so bad at delivering large ones? The obvious answer is that large systems are more complex than small systems. But that is not the problem. 

The problem is not the fact that IT systems get more complex as they get larger. The problem is how they get more complex. Or more specifically, the rate at which they get more complex.

The complexity of an IT system increases at a rate we describe as exponential. For most modern IT systems, such as service-oriented architectures, the exponential increase is driven by the number of dependencies in the system. As the system gets bigger, the number of dependencies increase. As the number of dependencies increase, the complexity increases. But the increase in complexity is not linear, it is exponential. 

The difference between a linear increase and an exponential increase is critical. 

An example of a problem that increases linearly is a leaky faucet. Say a faucet leaks at a rate of one ounce per hour and the water is going into a clogged sink that can hold 20 ounces. After three hours, a three ounce container will empty the sink. If you don't get to the sink for ten hours, you know you need to bring a ten ounce container to empty the sink. The water leaks out at a steady rate. It doesn't start leaking faster just because more water has leaked.

But think of a forest fire. Forest fires increase at an exponential rate. Say in the first minute of the fire it has engulfed one square foot. You cannot assume that in twenty minute the fire will have engulfed twenty square feet. That is because forest fires spread exponentially; the bigger they get, the faster they spread.

The mathematics of IT complexity follow the mathematics of forest fires. Say we are building an IT system at the rate of one function per week. It will take almost one year to reach 100,000 standard complexity units (SCUs). But it only takes 10 more weeks to reach the next 100,000 SCUs. And then only 7 more weeks to reach the next 100,000 SCUs. By the end of the second year we are adding more than 30,000 SCUs per week!

Except that we won't, because this rate of complexity increase is unsustainable. Just like a forest fire will eventually burn itself out once it has consumed all possible fuel, so will an IT system. It will grow until the resources are no longer available to support the massive complexity increase. At that point, it will do what all complex systems do when they reach a level that is no longer sustainable. They collapse.

Does this mean that it is impossible to build large IT systems? No, it doesn't. It does mean that we need to figure out how to attack the complexity growth. We can't prevent the IT system from getting more complex (that is impossible), but we do need to figure out how to make the complexity increase linearly rather exponentially. 

In other words, we need to figure out how to make IT systems behave more like leaky faucets and less like forest fires.



We will email you alerts when new IT complexity related blogs or white papers are available. Subscribe <here>.

You can learn more about our IT Simplification Initiative (ITSI) <here>.

Photo of the forest fire is by the U.S. Forest Service, Region 5, made available through Creative Commons and Flickr. The photo of the faucet is by John X, also made available through Creative Commons and Flickr.

2 comments:

XTRAN guru said...

Roger -- actually, we have known for decades how to prevent exponential increase in the complexity of large IT systems. It's called by various names, but the essence is modularity and decoupling.

"Small IT systems usually deliver successfully" -- I wish that were true, but it isn't. The state of our "discipline" is so sad that even small projects are mostly badly done. And many do fail over time, although it may take longer than for larger projects. Of course, small projects turn into large (and messy) projects through accretion, and since decoupling isn't usually practiced in that process, they fall victim to the same ills as projects that started out larger.

"Large IT systems...are hard to change" -- that's because they are usually tightly coupled and rife with cloned code, sloppy coding practices, and badly designed and undocumented interfaces (as with small projects).

"The complexity of an IT system increases at a rate we describe as exponential", "the exponential increase is driven by the number of dependencies in the system" -- that can only happen due to bad design and implementation practices. I recommend what I call the "cocktail shaker" approach -- a combination of concurrent top-down functional decomposition and bottom-up primitives identification, meeting in the middle with the functional decomp being implemented using the primitives. Some primitives will be domain-specific, while others are aspect-specific (cross-cutting). Some may have existing implementations, in the form of in-house or commercially available / freeware run libraries and/or classes / methods, while others will need to be implemented (and added to the shop's stock).

The result of the modularity and decoupling that approach provides is that complexity (including dependencies) increases much less than exponentially. In fact, what sometimes happens is that, as the problem being solved becomes better understood, the solution's complexity actually decreases, as it undergoes what I call the "collapse into elegance" -- the underlying structure of the problem becomes more apparent and a more elegant solution can be achieved through refactoring.

An added advantage of good coding practice, and of the decoupling it provides, is that small teams can be given pieces of the work to do, secure in the knowledge that what they do won't break anyone else's code. So the complexity of managing the development process also increases less than exponentially, and may actually decrease.

"We can't prevent the IT system from getting more complex (that is impossible)" -- in terms of overall inherent complexity, yes; in terms of difficulty of maintenance and enhancement, no. Good architecture and the resulting decoupling will prevent that.

I agree that the problem you have described exists. But the solution has been known for many years, and always works when practiced properly. So the problem is caused not by any inherent property of IT systems, but by an appallingly low standard of craftsmanship and professionalism in the software development industry.

An example -- I created (originally in 1984), and currently babysit, a system comprising about 1/2 million net LOC, in thousands of modules, with thousands of included files. I spend about 1% of my development time doing maintenance, compared to 60%-80% in the industry generally. Why? Because a) I maintain extremely high standards of code and documentation quality, b) I never "cut corners", and c) I refactor as soon as it's needed. And the system, while experiencing major growth in functionality, actually gets stronger and more robust over time. So of course its overall inherent complexity increases, but from a software development point of view, that complexity is bounded by decoupling so it can be treated piece-wise, providing essentially no increase in difficulty of maintenance and enhancement.

Roger Sessions said...

XTRAN guru,
I assume that by decoupling, you mean moving from a call-based system to a message-based system. Is that right? If so, I would disagree that this will reduce complexity. A message between systems counts as a dependency, which increases the dependency related complexity. The problem is that when we break a large tightly coupled system into a number of small loosely coupled systems, we go from a system that has high functional complexity (lots of functionality) to a system that has high dependency complexity (lots of dependencies.) This is why so many SOA projects fail.

I don't think you are right that most small projects fail. At least, that is not what the research I have seen shows. I do agree that once small projects grow into large projects, they fail, but that is because they are now large projects, not because they were once small projects.

As far as decoupling being an approach that reduces the exponential growth, if that was true we would expect to see all SOAs delivered successfully, but that doesn't happen. They seem to follow the same failure rates as any other architecture. Now of course most SOA projects succeed, but that is because most are small.

I completely agree with you on the need for good coding practice. But I don't think that is the answer to the problem I am discussing. Large systems fail not because of bad code, but because of bad architecture. Good architecture can mitigate the problems of bad code. Good code can't mitigate the problems of bad architecture.

As far as your 500K LOC system being stable, I think that is great and certainly is to your credit as an excellent coder and capable architect. I'm not sure that qualifies as a big system. I usually consider a big system a $10M+ system. A 500K LOC system is probably more like a $5M system. So it may be that we agree, but we are just looking at different scales.

I agree with you on that the exponential rise in complexity is due to "bad architecture." I would phrase this as architecture that has not incorporated a complexity management strategy. I think the problem is that we don't differentiate between small systems and large systems. We try to take practices that work on small systems and apply them to big systems, and that is when we find the limitations of those practices.