Tuesday, December 4, 2012

The Snowman Architecture Part Three: The Technical Benefits


This is the third part of a four part blog.about the Snowman Architecture. The first part was The Snowman Architecture: An Overview. The second part was The Snowman Architecture: The Economic Benefits. In this part, I will be discussing the technical benefits of the Snowman Architecture. But don't read this until you have read the overview! And if you care about things like ROI, check out part two.

But whatever you do, don't miss this part! Over the next twelve pages or so (yes, I know it's a bit long) I'm going to take you through fourteen of the most compelling technical reasons why the Snowman Architecture is a huge improvement over today's approaches to large IT. You are now reading nothing less than my declaration of war on traditional IT methodologies.The first snowball has now been fired!

Review

I gave an overview of the Snowman Architecture in part one, but let's review briefly.

The Snowman Architecture breaks down a large IT system into small vertically partitioned subsystems called Snowmen. These snowmen interact with each other through asynchronous messages. Snowmen are designed to be as autonomous as possible using a design methodology known as Simple Iterative Partitions1 (SIP).

Snowmen come in three layers. The head of the snowman consists of the business functions that make up a capability. The torso of the snowman consists of the technical systems that support those business functions. The bottom of the snowman consists of the data that is used by those technical systems. Each of these layers is strongly partitioned based on the business functions that make up the head.

Snowmen reach out to each other through their arms, the asynchronous messaging system. Often this is implemented as an SOA.


Snowmen reach out to each other through their arms.

Contrast to Traditional Architectures

A traditional IT architecture is also implemented in three layers. These layers are the same as those of the Snowman Architecture: business architecture, technical architecture, and data architecture. What differentiates the Snowman Architecture from a traditional IT architecture is the strong vertical partitioning, as shown in Figure 1.

Figure 1. Traditional IT Architecture vs. Snowman Architecture

It turns out that this strong vertical partitioning has a major impact on the effectiveness of the architecture. Let's take a look at fourteen key non-functional attributes of a large IT system. As you will see, every single one of them is improved by the strong vertical partitioning that characterizes the Snowman Architecture.

In this analysis, I assume that the system we are evaluating is a large (greater than ten million dollar) system. This is the point at which traditional IT architectural methodologies are no longer able to keep up with the exponential increases in system complexity2. I also assume that SIP was used to assign business functions to the head of the snowman, an essential step to minimizing the overall complexity of the Snowman Architecture.

Okay, given these two assumptions, let's see why the Snowman Architecture outperforms all traditional approaches to large IT system design. I'll start by listing the fourteen attributes and then go through them one by one. The attributes I will look at are these:
  • Business Alignment 
  • Regulatory Compliance 
  • Auditing 
  • Security
  • Agile Friendliness 
  • Maintainability 
  • Testability 
  • Reliability 
  • Recovery 
  • Throughput 
  • Scalability 
  • Flexibility 
  • Cloud Effectiveness 
  • Vendor Lock in
You might as well do a quick check-point. Are any of these attributes important to your IT systems? If not, you might as well stop reading now. If one or more of these are of interest then keep reading.

Okay, now let's go through them one by one. Feel free to skip those you don't care about.

Business Alignment

A system is well aligned when it meets the needs of the business. You can think of business alignment as the Wow factor. When the system is delivered, does the business say, "Wow!" Or does it shake its collective head and reach for the nearest bottle of Tequila?

In any system design life cycle, there is a phase in which the business requirements are gathered. In the traditional approach (the left hand side of Figure 1) the requirements are gathered more or less immediately after the project has been approved and before the technical architecture is designed.

The size of the requirements document(s) is always proportional to the size of the project. Massive projects require lots of requirements documentation, often tens of thousands of pages. The larger the stack of requirements, the lower the chances are that those requirements accurately reflect what the business actually needs.



In SIP (the guiding design methodology for the Snowman Architecture) an additional project phase is introduced: the Partitioning Phase. This is when the basic shape of the Snowman is identified.  

This is when the basic shape of the Snowman is identified.
What is important from an alignment perspective is that this partitioning of the larger system into smaller autonomous snowmen takes place before the requirements have been gathered. Since a typical snowman rarely exceeds one million dollars in cost, it's requirements are modest. Since the requirements are relatively modest, it is more likely that those requirements accurately reflect the actual business need.

Since the Snowman Architecture has more accurate requirements than the traditional IT architecture, it is likely to actually meet the business need.

Regulatory Compliance 

An IT system is considered compliant when it can be shown to operate within the constraints of regulatory laws and regulations. Some enterprises such as the video gaming industry have few if any restraints. Others, such as financial organizations, have a complex web of laws and regulations. 

Our ability to show that a given IT system operates within its regulatory constrains is dependent on the complexity of the system. The more complex the system, the more difficult it is to prove compliance.

Large traditional IT systems (the left side of Figure 1.)  have a highly complex web of relationships between business functions, technical processes, and data. Thus it is very difficult to prove compliance.

The Snowman Architecture (the right side of Figure 1.) has a number of simple relationships between business functions, technical processes, and data. The architectural simplicity is guaranteed by the SIP directed partitioning of the business functions into snowmen heads and the strong vertical partitioning that occurs once the technical and data architectures are created.

Because each snowman is relatively simple, it is correspondingly easy to prove that it operates in compliance with any relevant regulatory constrains.

Auditing 

An IT system is considered auditable when we can accurately trace data changes back to technical processes, from there back to business functions, and from there back to human beings. The more paths there are to the data, the more difficult it is to trace these paths. 

The traditional IT architecture has a large number of complex paths to the data. It is nearly impossible for an examiner to determine which of these paths resulted in a particular item of data being updated .

The Snowman Architecture has a small number of simple paths to the data. An examiner can easily get the whole picture and then determine which of these few candidate paths resulted in a particular item of data being updated. 

An examiner can easily get the whole picture.

Security

We when talk about the security of a system, we are generally talking about our ability to protect data. Data, of course, resides in a database. So security comes down to our ability to configure the database so that unauthorized updates are not possible. In a traditional architecture, there are so many processes that need to update so many parts of the database under so many different circumstances that it is difficult to figure out a secure configuration. And even if one does manage a secure configuration, the next process that is added will change everything.

In the Snowman Architecture, database configuration is much easier. Because the partitioning of the head of the snowman (the business processes) dictates the partitioning of the technical layer and then the partitioning of the data layer, the only processes that will ever need to access the data in the snowman are the processes in the torso of the snowman. This makes configuration easy: allow the processes in the snowman torso to access the data in the snowman bottom and don't allow any process outside the snowman to access any data in the snowman. Viola. Done.

Agile Friendliness 

Many organizations are attracted to Agile Development Methodologies. I agree that Agile development has a lot of promise. However I also think it doesn't scale. 

A recent paper by Vikash Lalsing et al.3 indicates that Agile projects of 0.5 person years or less are excellent candidates for Agile development. They predict such projects will be less than 10% over budget. By the time the project  size reaches 3.6 person years, the budget overrun increases to 18%. And by the time the project size reaches 8.2 person years, the budget overrun increases to 66%.

The Snowman Architecture is ideally suited to projects of greater than $10M. This equates to an effort of close to 100 person years. This is more than ten times the project size that yielded the 66% budget overrun. 

In the Snowman Architecture, the larger project is broken down into relatively simple, autonomous chunks of project work. Each of these chunks becomes an individual snowman. The use of the SIP methodology ensures that not only is each snowman as simple as possible, but the relationships between snowmen are as simple as possible. 

The project size of any one snowmen is unlikely to exceed $1M and in many cases will be much less. A $1M project is around 7 person years. This is still large by Agile standards, but far closer to a workable agile number than a project that does not have the benefit of the Snowman Architecture.

Maintainability 

A system is maintainable when it is easy to locate the source of bugs. The more complex the system, the more difficult it is to find the source of bugs.

The complexity of the traditional architecture (the left side of Figure 1.) is much higher than the complexity of the Snowman Architecture. The maintainability of a traditional architecture is therefore much lower. The use of the SIP methodology guarantees that not only is the overall complexity of the Snowman Architecture low, it is as low as it can possibly be4. Simplicity is important when it comes to snowmen. A simple snowman is simple to maintain.

A simple snowman is simple to maintain.

Testability 

System bugs can manifest themselves at any point in the system life cycle. The later in the life cycle the bug is manifest, the more problems it causes. Our goal in system testing is to find bugs in the system as early as possible and definitely before the system is delivered to customers. The most common strategy for system testing is to write code or scripts that exercise the system and ensure it is working correctly. 

To be sure a system is working correctly you must write test code that exercises every possible logical path through the system. The more logical paths there are through the system, the more difficult it will be to create the test code and the more likely it will be that you will have missed an important path. This translates to a greater likelihood that you will ship buggy code.

There are two reasons the Snowman Architecture is more testable than the traditional architecture.  

The first reason the Snowman Architecture is more testable has to do with the number of paths. 

A traditional IT architecture has many possible paths. By the time the system reaches a few million dollars in size, it effectively has an infinite number of paths and there is no way they can all be tested.

In the Snowman Architecture, each snowman can be tested independently. Since each snowman is relatively small and simple, there are relatively few paths through the snowman. Once you have tested all of the snowman and the connections between them, you have effectively tested the system as a whole. Thus your chances of shipping buggy code are greatly reduced if you are using the Snowman Architecture.

The second reason the Snowman Architecture is more testable has to do with how pieces of the system are connected together. In a traditional IT architecture, segments of code are often connected by shared data in a database. In the Snowman Architecture, snowmen are almost always connected through asynchronous messages. 

These two approaches to connections are very different from a testability perspective. Shared data connections are almost impossible to test. There are just too many ways the data can be accessed. Asynchronous messages, in contrast, are very easy to test. One need only write a messaging harness, a common practice among service-oriented architectures, and the connection points become easily tested. 

So we see two reasons the Snowman Architecture is so easier to test than the traditional IT architecture. First, it has fewer code paths. Second, it uses asynchronous messages for its connection points. It is hard to test a traditional IT architecture. It is easy to test a Snowman Architecture.

It is easy to test a Snowman Architecture.

Reliability 

Reliability is a measure of the typical amount of time a system will remain running before it unexpectedly drops dead. Reliability is often described as mean time between failures. 

Reliability is related to testability, the last attribute I discussed. The more testable a system is, the less likely it is to have post-delivery bugs. It is these post-delivery bugs that cause systems to fail. The fewer bugs, the less likely the system is to fail. Since the Snowman Architecture is easier to test than the traditional IT architecture, is is likely to have fewer bugs and thus will be less likely to fail.

But there is another factor that favors reliability of the Snowman Architecture. This has to do with how easy it is to quarantine a bug. Consider Figure 2, which is a blowup of the left hand side of Figure 1 with some labels added for reference.

Figure 2. Blow-up of Traditional IT Architecture.

Assume that database D crashes. We have three processes dependent on D, namely, n, o, and p. So these three processes crash. Processes g and i are both dependent on n, so they both go down. Process i is also dependent on o, but since it has already crashed, we need not worry about it further. Processes i, d, and f are all dependent on p. Process i is already down, but now d and f join the fun. So now we have D, n, o, p, i, d, and f all down. This can corrupt any databases they are involved with which includes C and E. This brings down their dependent processes b and h. Which in turn... you get the picture. There is no quarantine, so when one part of a system catches a bug, that bug can rapidly propagate to the entire system.

Contrast this to Figure 3., which shows a closeup of the Snowman Architecture.

Figure 3. Closeup of Snowman Architecture

Assume in Figure 3. that database B crashes. It can bring down processes d, e, f, and g. But that's it. The boundaries of the snowman have effectively quarantined the bug from spreading further. The only connections between d, e, f, g and other processes are through asynchronous messages, and these channels can easily be protected. So while the bug in B may crash the entire snowman, there is no pathway for the bug to spread further.

The bottom line is that bugs occur less frequently with the Snowman Architecture (because it is easier to test) and when they do occur they tend to have only a local impact. In the traditional architecture, bugs occur more frequently (because it is harder to test) and when they do occur, they tend to have a global impact.

Recovery 

Recovery is related to reliability (the last section.) Whereas reliability measures how often the system fails, recovery measures how long the failure lasts. In an ideal system, we have high reliability and fast recovery, meaning that the system rarely crashes and when it does, the crash doesn't last long.

It is difficult to develop an effective recovery strategy for a traditional IT architecture. There are too many databases, too many processes, and too many ways everything can be related to each other. When this web of relationships goes down, what do you do? You try to protect the entire system but this is difficult because the system is a large, it is complex, and it is a moving target. 

In contrast, it is easy to develop an effective recovery strategy for a Snowman architecture. All you need to do is shadow any requests to the snowman to a backup snowman. Then if a failure occurs, reroute all new requests to the backup. This rerouting can occur as quickly as one can notice that the primary snowman has failed. This is shown if Figure 4.

Figure 4. Recovery Mechanism for Snowman
Taking this and the last two sections together, I can make the following claims about the Snowman Architecture relative to the traditional IT architecture:
  • The Snowman Architecture will have fewer bugs.
  • The bugs will have less impact.
  • Recovery from that impact will be faster.

Throughput 

Throughput refers to the amount of work a system can process in a unit of time. Often we measure throughput in transactions per minute. Throughput should not be confused with response time which measures how long a single user waits for work to be completed.

Throughput is important because it directly influences cost. If a system has low throughput, then a lot of resources are needed to process a given workload. If a system has high throughput, the number of resources needed to process the same workload is much less. 

There are two architectural factors that strongly influence throughput: the number of synchronous connections and the amount of shared data. Synchronous connections slow down throughput by blocking processes until connected processes have completed their work. Shared data slows down throughput by blocking databases.

Both of these factors come together in the traditional IT architecture. These systems heavily favor synchronous connections and make extensive use of shared data. Between the two, throughput is substantially degraded.

In the Snowman Architecture, synchronous connections are only used within a snowman. All (or almost all) connections between snowman occur through asynchronous messaging. Which means no blocked processes.

In the Snowman Architecture, shared data is the equivalent of a multi-headed snowman. This is anathema to the Snowman Architecture. The only processes that are allowed to share data are those that live within a single snowman. Since only a few processes ever share data, database blocking is kept to a minimum.

Between the judicious use of asynchronous messaging and non-shared data, the Snowman Architecture performs at a much higher throughput that does the traditional IT architecture. This means lower costs per unit of work which means lower IT costs.

Shared data is the equivalent to a multi-headed snowman. This is anathema to the Snowman Architecture

Scalability 

Scalability refers to our ability to support larger and larger workloads. Say we have designed a system to support 100 concurrent customers and then our system become so popular that we must support 500 concurrent customers. Our ability to adapt to the higher customer load is dependent on our scalability.

In the past, scalability was seen as a hardware power problem. It was assumed that to allow a system to process larger and larger workloads, it had to run on more and more powerful hardware. When the current system could no longer support the workload, the hardware would be upgraded. This could involve faster processors, more memory, or larger disk drives. In the worst case, this involved replacing smaller cheaper machines by larger expensive machines.This is the model that served the power computer companies like IBM and Sun so well.

Today, scalability is seen as a hardware numbers problem rather than a hardware power problem. We now assume that to process a larger workload we don't replace cheap machines with expensive machines, instead we get more cheap machines. This is the model that powers the most scalable systems in the world today such as Google. Google runs its entire system on inexpensive throw-away hardware and has for more than a decade5.

Given this modern view of scalability, there are three factors that determine how scalable a system is.

The first scalability factor is the compactness of the system. Smaller, more compact systems are easier to scale. Larger, more disperse systems are harder to scale.

The second scalability factor is the usage of asynchronous messages. The judicious use of asynchronous messages goes a long way toward making a system scalable. Think of an asynchronous message system as like a mailbox. As mail comes in faster, one adds more receivers. As long as any of the receiver's can process the mail, scalability becomes limited only by the number of receivers you can support.

As mail comes in faster, one adds more receivers.
The third scalability factor is the size of the database on which the system depends. Because databases have such specialized hardware requirements, they are the most difficult part of a system to scale up.

To compare the scalability of the Snowman Architecture versus a traditional IT architecture, we must start by defining the unit of scalability. In the Snowman Architecture, the unit of scalability is an individual snowman. In a traditional IT architecture, it is the entire system.

In comparing the two architectures, we see the Snowman Architecture outperforming the traditional large IT architecture in all three scalability factors. First, it is much more compact. Second, it uses asynchronous messaging in all the right places, at the boundaries to the snowmen. Third, it minimizes the size of the data pool that must be scaled by enforcing the concept of strict vertical partitioning.

As a result, the Snowman Architecture is much more amenable to scaling using the modern efficient approach to scalability, scaling by numbers. The traditional IT architecture is largely consigned to the much more expensive and inefficient approach to scalability, scaling by power. Today, scaling by power seems as quaint as vinyl records.

Flexibility 

Flexibility refers to our ability to modify the system as our business needs evolve. Say we have build our payment system to take credit cards and we now want to take debit cards. How easy is it to update our system to take debit cards as well as credit cards?

Our ability to modify our system depends on how complex the system is. The more complex the system, the more difficult the modifications will be to implement  Traditional large IT systems are very complex. They are therefore very difficult to modify. Frequently changes in one part of the system causes unexpected problems in other parts of the system. 

The Snowman Architecture is composed of a series of autonomous, self-contained, relatively simple snowmen. Because of the synergy algorithms used by SIP to partition business functionality across snowmen, it is highly likely than any modifications necessary for a specific business change will all be located within a single snowman. Since any given snowman is simple (certainly relative to a traditional IT system) we can expect that the modifications will be much more straightforward than they would be with a traditional architecture.

Cloud Effectiveness

The cloud is an attractive platform because of its "pay for what you eat when you eat it" pricing model. But to leverage this platform, it is important to structure your systems so that you eat the least amount possible to accomplish your work.

Traditional large IT systems are poorly organized to leverage this model. Because of their sprawling nature, all or most of the system must be running on the cloud to accomplish even the most trivial of tasks. This means that you are paying for all or most of the system even when you are using only a small part of it. Even worse, when you need to add new instances to handle larger workload, you are adding sprawling new instances that quickly drive the cost out of sight.

The Snowman Architecture is a collection of smaller snowmen, each dedicated to a group of closely related ("synergistic") tasks. In most scenarios, a given workload will require only a single snowman. This means that you are paying only for the resources that that snowman requires. And when you add new instances, you add them in small, inexpensive, snowman sized amounts.

Figure 5 contrasts the traditional IT architecture and the Snowman Architecture running on the cloud.
Figure 5. The Cloud: Traditional IT Architecture versus
The Snowman Architecture.

Vendor Lock-in

A system exhibits vendor lock-in when it is dependent on a single vendor for some aspect of its life support. Usually this vendor is the one providing the software platform.

Vendor lock-in is either good or bad, depending on your perspective. If you are the client, vendor lock-in is bad. It puts you in a weak bargaining position with your vendor. If you are the software platform provider, vendor lock-in is good. It puts you in a strong bargaining position with your customer.

The standard customer approach to avoiding vendor lock-in is through the use of standards. If the customer builds a system on a standard API, then the customer can easily port the system to another software platform that supports that same API. Or at least, that is the logic.

How do vendors achieve lock-in in the face of a plethora of standards covering everything from data storage to virtual systems? Vendors achieve lock-in through the tried and true process called embrace and extend. Embrace and extend is a two part process. First, the vendors embrace a particular standard. Then the vendor extends the standard in vendor specific ways. These extensions are the bait that draws in the customer. The goal is to make the extensions so powerful that they are irresistible. Once the customer has taken the bait, they are trapped. Lock-in is complete.

I have seen many customers try to resist the bait with corporate edicts forbidding the use of any vendor extension. In the end, resistance is futile. You will be assimilated.

The larger and the more complex the system, the more difficult it is to locate and remove the vendor extensions. This mean it is more difficult to port the system to another vendor. If you can't take your code to another vendor, you are locked-in. And your future is now in the hands of a company whose main goal is wringing as much money as possible out of you in the next contract negotiation.

As I said, resisting vendor extensions is pointless. The best strategy for avoiding vendor lock-in is to make it as easy as possible to locate and rewrite those sections of a system that have used the vendor extensions. Your ability to locate and rewrite those sections is dependent on how small and simple the system is. We are dealing with the same issues I discussed in the section on Modifiability. Small, simple systems are easy to modify. Large, complex systems are not.

Thus small and simple is your best defense against vendor lock-in. And if you want small and simple, don't look to standards. Look to snowmen.

Summary

In part one of this blog, I introduced the Snowman Architecture. In part two, I discussed the non-technical advantages of this architecture. In this part, I have discussed the many technical advantages of this architectural approach.

If you are building a large IT system (say, over $10M) the Snowman Architecture offers a huge number of compelling advantages over traditional approaches. These advantages range from better security to improved reliability to lower cost to greater flexibility. In fact, there is not a single non-functional requirement that will not benefit from the Snowman Architecture.

If, at this point, you are preparing to build a large IT system and you aren't seriously considering the Snowman Architecture, then I don't know what else I can say. One of us is crazy.

One of us is crazy.
Stay tuned for part four of this blog, in which I will discuss the arguments against the Snowman Architecture and why they are all flawed.

- Roger Sessions
Houston, Texas

Did you find any errors (even spelling) in this blog? Let me know. I'd love to correct them.

Would you like to subscribe to notifications about my blogs, white papers, and webshorts? Sign up here.

References

(1) See, for example, the Web Short SIP Methodology for Project Optimization by Roger Sessions. Available here.

(2) See, for example, the Web Short The Relationship Between IT Project Size and Failure Rates by Roger Sessions. Available here.

(3) PEOPLE FACTORS IN AGILE SOFTWARE DEVELOPMENT AND PROJECT MANAGEMENT by Vikash Lalsing, Somveer Kishnah and Sameerchand Pudaruth in International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.1, January 2012. Available here.

(4) The Mathematics of IT Optimization by Roger Sessions. (White Paper). Available here.

(5) WEB SEARCH FOR A PLANET: THE GOOGLE CLUSTER ARCHITECTURE by by Luiz André Barroso, Jeffrey Dean, and Urs Hölzle in IEEE Micro March/April 2003 Available here.

Acknowledgements

The snowman photos are all from Flickr under Creative Commons license. The photographers are, in order of appearance: 


Legal Notices

This blog is copyright (c) 2012 by Roger Sessions. It may be copied, reposted, and printed as long as it is not modified in any way. Other than that, unauthorized usage prohibited. Ask, though. I'll probably agree.

SIP is a trademark (t) of ObjectWatch, Inc. ObjectWatch is a registered trademark of ObjectWatch, Inc. All other trademarks are owned by their respective companies.