Simple Architectures for Complex Enterprises: October 2009

Thursday, October 29, 2009

The Problem With Standish

In my recent white paper, The IT Complexity Crisis, I discussed how much IT failures are costing the world economy. I calculated the worldwide cost to be over $6 trillion per year. You can read the white paper here.

In this white paper I discuss the Standish Chaos numbers, but many readers have continued to question whether my conclusions are in agreement with Standish. I think my conclusions are in agreement, but I also think the Standish numbers are flawed. So I have mixed feeling about them. Let me explain.

The Standish Group has been publishing their annual study of IT failure, their "CHAOS Report" since 1994 and it is widely cited throughout the industry. According to the 2009 report, 24% of all IT projects failed outright, 44% were "challenged", and only 32% were delivered on time, on budget, and with required features and functions.

To be honest, I have never read the Standish Report. Given the $1000 price tag, not many people have. So, like most people, I am basing my analysis of it on the limited information that Standish has made public.

The problem with the Standish Report is not that it is analyzing the numbers wrong. The problem is that Standish is looking at the wrong numbers. It analyzes the percentage of IT projects that are successful, challenged (late, overbudget, etc.), or outright failures. This sounds like useful information. It isn't.

The information we really need is not what percentage of projects are successful, but what percentage of IT budgets are successful.

What is the difference between percentage of projects and percentage of budget? A lot. Let me give you an example.

Suppose you are an IT department with a $1M budget. Say you have six IT projects completed this year, four that cost $50K, one that cost $100K, and one that cost $700K.

Which of these projects is most likely to fail? All other things equal, the $700K project is most likely to fail. It is the largest and most complex. The less the project costs, the simpler the project is. The simpler the project is, the more likely it is to succeed.

So let's assume that three of the four $50K projects succeed, the $100K project succeeds, and the $700K project fails.

Standish would report this as 4/6 success rate, or a 67% success, 23% failure rate. I look at these same numbers and see something quite different.

I look at the percentage of IT budget that was successfully invested. I see $250 K of $1M budget invested in successful projects and $750 K in failed projects. I report this as a 25% success rate, a 75% failure rate.

So both Standish and I are looking at the same numbers, yet we have almost exactly opposite conclusions. Whose interpretation is better?

I argue that, from the organizational perspective, my interpretation is much more reasonable. The CEO wants to know how much money is being spent and what return that money is delivering. The CEO doesn't care how well the IT department does one-off minor projects, which are the projects that dominate the Standish numbers.

So the bottom line is that I have major issues with the Standish Report. It isn't that the Standish analysis is wrong. It is just that it is irrelevant.

Thursday, October 15, 2009

Notes from ITARC NYC Open Meeting on IT Failures

At the recent ITARC conference in NYC, I facilitated a open meeting on IT Failures. We only had one hour, but some interesting ideas were discussed. Thanks to Eric Weinstein for taking these notes.

Reasons people gave for IT failures they experienced:

-Lack of change management

- Scope of the requirements too high level/incomplete or fleshed out leading to bad outcomes

- Analysis of cost estimation was wrong b/c the requirements were not fleshed out

- Cost estimation is an art, man power, resource time, are hard to estimate.

- Lack of accurate communication and feedback AND whether the project is understood

- Final delivery had no bearing on value for the customer - all communication from the developers back to the business stakeholders that came back was totally ignored

- Functional requirements get a lot of attention BUT the non-functional requirements is invisible/doesn't get credit, how to quantify the cost avoidance, non-functional requirements

-Trade off of quick vs. correctly - executive irresponsibility

-Business has unrealistic expectations of delivery dates OR tech people in general estimating time - may skimp out on upfront analysis or testing...

-Implementation side - developers failing - tools to control SDLC process - source control system (full integration of code
check in - what requirements that code is fulfilling - must be reviewed sign/off)

- Main causes of failure is managing complexity of large systems - failure vs complexity has high relationship - more complex a system, the harder it is to scope...must learn how to take big monolithic systems and break down to smaller systems

Solutions

- "The Wrench in the System" Book recomendation

- Ask the business to delineate the success criteria, prioritize in numbers

- Understand timeframe, scope - rescope

- White paper - US Gov't - 66% of IT budget is high risk projects and half of those will fail

Sunday, October 4, 2009

Attacking Architectural Complexity

When I advocate for reducing the complexity in a large IT system, I am recommending partitioning the system into subsystems such that the overall complexity of the union of sub-systems is as low as possible while still solving the business problem.

To give an example, say we want to build a system with 10 business functions, F1, F2, F3, ... F10. Before we start building the system we want to subdivide the system into subsystems. And we want to do it in the least complex collection of subsystems.

There are a number of ways we could partition F1, F2, ... F10. We could, for example, put F1, F2, F3, F4, and F5 in S1 (for subsystem 1) and F6, F7, F8, F9, and F10 in S2 (for subsystem 2). Let's call this A1, for Architecture 1. So A1 has two subsystems, S1 with F1-F5 and S2 with F6-10.

Or we could have five subsystems with F1, F2 in S1, F3, F4 in S2, etc.. Let's call this A2, for Architecture 2. So A2 has five subsystems, each with two business functions.

Which is simpler, A1 or A2? Or, to be more accurate, which is less complex, A1 or A2. Or, to be as accurate as possible, which has the least complexity, A1 or A2?

We can't answer this question without measuring the complexity of both A1 and A2. But once we have done so we know which of the two architectures has less complexity. Let's say, for example, that A1 weighs in at 1000 SCUs (Standard Complexity Units, a measure that I use for complexity) and A2 weighs in at 500 SCUs. Now we know which is least complex and by how much. We know that A2 has half the complexity of A1. All other things being equal, we can predict that A2 will cost half as much to build as A1, give twice the agility, and cost half as much to maintain.

But is A2 the best possible architecture? Perhaps there is another architecture, say A3, that is even less complex, say, 250 SCUs. Then A3 is better than either A1 or A2.

One way we can attack this problem is to generate a set of all possible architectures that solve the business problem. Let's call this set AR. Then AR = {A1, A2, A3, ... An}. Then measure the complexity of each element of AR. Now we can choose the element with the least complexity. This method is guaranteed to yield the least complex architectural solution.

But there is a problem with this. The number of possible architectures for a non-trivial problem is very large. Exactly how large is given by Bell's Number. I won't go through the equation for Bell's number, but I will give you the bottom line. For an architecture of 10 business functions, there are 21,147 possible solution architectures. By the time we increase the system to 20 business functions, the number of architectures in the set AR is more than 5 trillion.

So it isn't practical to exhaustively look at each possible architecture.

Another possibility is to hire the best possible architects we can find on the assumption that their experience will guide them to the least complex architecture. But this is largely wishful thinking. Given a 20 business function system, the chances that even experienced architects will just happen to stumble on the least complex architecture our of more than 5 trillion possibilities is slim at best. You have a much better chance of winning the Texas lottery.

So how can we find the simplest possible architecture? We need to follow a process that leads us to the architecture of least complexity. This process is called SIP, for Simple Iterative Partitions. SIP promises to lead us directly to the least complex architecture that still solves the business problem. SIP is not a process for architecting a solution. It is a process for partitioning a system into smaller subsystems that collectively represent the least complex collection of subsystems that solve the business problem.

In a nutshell, SIP focuses exclusively on the problem of architectural complexity. More on SIP later. Stay tuned.

Thursday, October 1, 2009

Why I Focus On Complexity

When it comes to IT failure, there is no lack of "the usual suspects". We can look to failures in project definition, project management, and needs assessment. We can point to flawed architectures, implementations, and testing procedures. We can focus on communications failures between the business and IT, between IT and the user community, and between different business units.

Yet given this extensive collection of failure factors any of which can doom an IT project, why do I focus almost exclusively on the issue of complexity?

I see all of the failure factors as falling into one or more of three categories:

1. The factor is caused by complexity.
2. The factor is greatly exacerbated by complexity.
3. The factor is solved as a side effect of solving the complexity problem.

Examples of failure factors that are directly caused by complexity are the various technical failure factors, such as poor security or scalability. It is very difficult to make a complex system secure or scalable. Solve the problem of complexity, and these problems become much easier to solve.

Examples of failure factors that are greatly exacerbated by complexity include those related to communications. As a project increases in complexity, people tend to fall more and more into specialized jargon which makes communications more difficult and adds yet more complexity to the already complex project. Different groups tend to see each other as the enemy. As a side effect of learning to solve complexity, the groups learn to see not each other as the enemy, but complexity as their common enemy. Their common language becomes the language of simplification.

Examples of factors that are solved as a side effect of solving the complexity problem include those related to organization. For example, a well known failure factor is the lack of executive sponsorship. But it is difficult to find sponsors for large complex expensive projects. Once a project is broken down into small, simpler, less expensive projects, finding executive sponsors for those projects is much easier.

The other reason I focus so much on complexity is that of all these failure factors, complexity is the only one that is universally present in all failed projects. The fact is that we are quite good at doing simple projects. Our skills just don't scale up to complex projects. So we can either tackle the failure factors piecemeal and try to figure out how to scale each up to higher levels of complexity, or we can try to figure out how to scale down the complex projects into simple projects that we already know how to solve.

So while I pay close attention to all of these failure factors, I continue to believe that the one that deserves our undivided attention is the problem of complexity.

Simple Architectures for Complex Enterprises

Pages