You can't escape failures. So, to the extent that it is practical, your designs should keep this in mind. However, network design is always faced with trade-offs: efficiency versus cost versus performance versus maintainability, and so on. While it would seem networks should always be designed to minimize problems, this minimization may run counter to other constraints. There are no simple solutions to these dilemmas. However, there are some general guidelines that can help with many networks.
Designing networks with failures in mind requires a thorough understanding of how systems fail. So before discussing design principles, it is helpful to review the ways systems fail. I'll begin by describing several different types of failure. While this classification is somewhat simplistic, it will serve our needs.
Simple failures: Perhaps the easiest type of failure to deal with is when there is a single point of failure or a simple failure. With this type of failure only one component of your network isn't working. Ideally, it will be clear which device isn't working, but often this isn't the case. In many instances it will appear that a number of devices are not working. For example, if there is a bad cable connector on a network's only DNS server, the server will be unavailable, DNS will be unavailable, and email (among other services) will be unavailable. Actually, only one failure has occurred: the connector. Once it is repaired, everything else will be miraculously cured.
Independent multiple failures: Sometimes you will be faced with multiple simultaneous or near simultaneous failures. Often you will have only the illusion of multiple failures, as in the previous example where a single connector failed. But sometimes more than one device will fail. With independent multiple failures, the timing of the failures is nothing more than coincidence. There is no causal link between the failures. Unfortunately, independent multiple failures can be difficult to deal with for three reasons. First, you must realize that you really do have more than one failure. Second, it may be difficult to separate the symptoms of one failure from another. Finally, it is human nature to try to find connections even where none exist. This tendency can severely mislead you.
Cascade failures: A sequence of failures where one failure causes the next is known as a cascade failure. With true cascade failures, each failure is a separate problem that must be addressed. For example, a failing power supply can damage an interface. To bring the device back online, you will need to repair or replace both the power supply and the interface. (Hopefully, you'll replace the power supply first.) It can be difficult to distinguish a true cascade failure from a simple failure that affects a number of other services. But while the distinction is irrelevant to your users, it can be important when troubleshooting.
Visit security.oreilly.com for a complete list of O'Reilly books on computer security.
System failures: Perhaps the most pernicious type of failure is the system failure, an overused and often misapplied term. System failures result from the unexpected, nonobvious interactions of system components. The interaction may be a consequence of independent failures that are interacting, or simply from unexpected incompatibilities between devices. A system failure requires an unfamiliar, unplanned, or unexpected interaction that is not visible or that is not immediately comprehensible. Multiple simple failures are not a system failure if there is no interaction. Nor are cascade failures since the interactions can be clearly understood. Probably the best way to explain system failures is to give an example.
A number of years ago I encountered a system failure when expanding a college network I had set up. Logically, the network was composed of four subnetworks: a network for the university administrative branch, a network for the faculty, a network for student laboratories, and an access network with the campus Internet connection, dial-in services, etc. Initially, these subnets were all connected to a multihomed host that functioned as an email server, a DNS server, and a router. Each of the individual subnetworks was a collection of hosts interconnected by hubs. The logical structure of the original network is shown in Figure 1a.
While this worked well enough initially, it didn't take long to outgrow the original network hardware, particularly with the emergence of the Web. The next phase in the evolution of this network was to install a separate router and to replace a number of the hubs with switches. It was decided to simply turn off routing at the email server but to leave it connected to all four networks. It was argued that this would provide greater efficiency since local email would not need to cross the router, and that it would provide some redundancy since email, for example, would still be available if the router was down. Because many of the users for the different networks were located near each other, switches supporting virtual LANs (VLANs) were chosen. The new logical structure is shown in Figure 1b.
To understand how this led to a system failure, it is necessary to understand some details about the host and the switches that were used. The email/DNS server had a quad-Ethernet adapter that used the same Ethernet or MAC address for each of the four network interfaces, or ports, rather than a different MAC address for each port. While this is unusual, documentation from the vendor notes that the IEEE leaves it up to the vendor whether to use one address for all ports, the station address approach, or a separate address for each port, the port address approach. (With the original configuration, since each MAC address was on a different network, everything worked fine.)
A characteristic of switches is that they learn MAC addresses from the traffic that passes through them and then use this information to control how the traffic is handled. When a packet arrives at a port, its MAC source address is entered into that port's address table. The switch also searches each port's address tables for the destination addresses. If it finds a match, it sends the packet out on the port whose address table contained the MAC address, and, perhaps just as importantly, it does not send it out on any of the other ports. (If the destination is not in any of the address tables for the switch, the switch acts like a hub and sends the packet out on every port.) Since devices may be moved from one port to another, typically when a switch adds a MAC address to the address table for one port, it will remove that address from any other address table it might be in. With "traditional" switches, this would not create any problems with the station address approach used by the email server. A switch on one of the individual networks simply wouldn't know what was going on any of the other networks.
Now consider the implication of using VLANs with this scheme. The idea behind VLAN equipment is that one physical device can be partitioned into several logical devices. Suppose you have three faculty members, two staff members, and a four-computer student cluster all going back to the same wiring closet. Using traditional switches, you would need three different switches since there are three different types of users that should each be on a different subnetwork. Moreover, we would need three different cable-runs back to the network backbone in order to connect each of these switches. With a VLAN capable switch, the switch could be partitioned into three logical switches. Each of the three logical switches would deal with traffic only on its specific subnetwork. So, instead of needing three physical switches, we would only buy one switch. And, if we are using VLAN technology at the network backbone, we will only need one cable-run (or trunk line) between the backbone and the wiring closet. The switches at either end should be able to sort the traffic and send it to the appropriate logical switch. Clearly, VLANs can provide significant savings in many cases.
This raises an interesting question: How should the address tables be managed on a switch supporting VLANs? Specifically, when adding a device to one address table, should it be removed from the address tables for all ports on the physical switch or just from ports on the same logical switch? Since devices may be moved from one network to another, you might assume that the address should be removed from all the tables on a physical device. And this is just what many (if not all) VLAN switches do. But this decision causes problems when used with the email server's station address approach.
The problem is that the email server needed to reside on all four networks simultaneously. Whenever a packet from the server arrived on one port on the switch, the connection to the server on any other port was dropped. The network would still send packets to the appropriate logical switch, but the logical switch would then need to send the packets out on every port since the address was no longer in any of the address tables. If only one person was talking to the server, the switch would stabilize quickly. But with simultaneous conversations, there are problems.
While incompatibility between the server's addressing scheme and the use of VLANs was undoubtedly the source of our problems, this explanation is simplistic. The problem was highly intermittent, showing up primarily as a very poorly performing network that was dropping lots of packets and connections. Paradoxically, the switch address tables were overflowing. There appeared to be problems separating traffic coming over trunk lines, etc. As is usually the case with such problems, a partial explanation was pieced together, largely after the fact. As is often the case with production systems, it was not possible to go back and study the problem in detail. Once the nature of the problem was grasped, changes were made that corrected the problem. Specifically, ifconfig was used to assign different MAC addresses to each port on the server.
As previously noted, the characteristics of a system failure are the undesirable interactions of two or more parts of a system in a non-obvious or unexpected way. Notice how the previous example matches this definition of a system failure. Each piece of the network would work perfectly in isolation or in some contexts. It was the interaction of the pieces that caused the problem. The multiple connections to the server created multiple interactions, and because these interactions were not obvious, it was a very difficult problem to diagnose and correct.
What makes a system prone to system failures? In his classic book, Normal Accidents, Charles Perrow analyzes a number of different systems from nuclear power plants to air traffic control systems. At the risk of simplifying Perrow's work, there are several factors he identified that predispose systems to system failures. First, the simpler the system, the less prone it is to a system failure. With complex systems there are more things that can go wrong, a greater likelihood of interactions, and, because there is more to understand when something does go wrong, there is a greater likelihood of non-obvious or hidden interactions. Information must be collected indirectly or inferred in a complex system.
Second, linear systems are less prone to system failures than systems with lots of interconnections. Systems with a high number of cross connections have many more ways for components to interact. Of the actual interconnections, it is the non-obvious ones that are most likely to create problems difficult to diagnose.
Closely related to linearity is the degree of coupling between the parts in the system. Loosely coupled, linear systems are less likely to have system failures than tightly coupled systems. With tightly coupled systems, interactions are more immediate and unforgiving.
O'Reilly's System and Network Administration resource center offers a complete list of current system and network administration books as well as related articles and news stories.
None of these characteristics are particularly surprising when you stop to think about them. But unless you do stop and think about them, they are very easy to overlook. Roughly speaking, I would classify most networks as complex but relatively linear systems. (Actually most networks are tree structured but this implies a single, obvious linear path between pairs of devices.) Most hardware tends to be fairly tightly coupled, but the protocols using the hardware may provide loose coupling. But if you don't agree with this, that's OK. Assigning a system to the categories of linear versus nonlinear, tightly coupled versus loosely coupled, and simple versus complex is largely a judgment call. These are really relative classifications. It is more important to be able to compare different network designs from this perspective than to give an absolute classification to one design.
Unfortunately, it is rarely obvious what type of failure you have when you start diagnosing a problem. Although it can be helpful to keep the possibilities in mind when faced with a problem, it is often the case that you will not be able to classify the type of problem you are facing until after you have solved the problem.
Invariably, you'll begin by treating all failures as simple failures until you know more. Begin by looking at the obvious point of failure and then go from there. With the previous example of a bad connector on a DNS server, you would probably start with the error message from the email software that indicated a DNS problem. Next, you would try checking DNS information. This would lead to the discovery that the DNS service was unavailable which in turn would lead to the discovery that the DNS server was unreachable. You would continue the process of zeroing in on the source of the failure. In mechanical systems, proximity is often an important clue. In networks, this is replaced by logical proximity-which other systems are being used by or are using the system in question? (If this all sounds a bit vague, Network Troubleshooting Tools describes specific tools that can be used for each of these steps.)
Being aware of the different types of failures can be helpful. When diagnosing simple failures, the usual course of action is a divide-and-conquer approach. In the example just given, it was important to be willing to shift your attention rather than dwell on symptoms several steps removed from the real failure. Time spent examining email configuration files or doing DNS queries is largely wasted in this example.
When two (or more) devices fail at roughly the same time, diagnosing the problem may be difficult since correcting one of the problems will not solve the overall problem. It is important that you be open to the possibility of multiple failures and their implications. The usual divide-and-conquer approach to troubleshooting, e.g. swapping out devices, is designed to zero in on a single problem. With multiple failures, as you move closer to one problem, you may be moving further away from the other problem. Typically, however, when you correct one of the problems, the symptoms will change. This is often the essential clue that you are dealing with multiple failures, particularly system failures.
While understanding the different types of failure may only be marginally helpful in diagnosing problems, it can be very helpful when designing networks, particularly if you think about the network characteristics that contribute to failures. But this approach will not be a panacea. In network design, your goals will naturally include avoiding failures in the first place, as well as minimizing their impact and simplifying troubleshooting. Unfortunately, these can be contradictory goals. For example, a design that contains the impact of a failure may make it more difficult to remotely collect information about that failure.
Simple Failures: The usual advice applies when trying to avoid simple failures. Use reliable equipment. Maintain and test the equipment on a regular basis. Keep your documentation up to date. Keep things as simple as possible by standardizing on a few protocols and a few vendors for equipment. Unfortunately, using only a few protocols limits what you can do with your network. And standardizing equipment means you'll probably end up paying a lot more for that equipment. No one vendor will be the most economical in every case, and special needs may dictate going with costly equipment when it isn't needed. There should be no surprises among these suggestions.
Multiple Failures and Cascade Failures: Of course, minimizing simple failures is the first step in preventing multiple failures, whether they are independent failures, cascade failures, or system failures. The next step is partitioning to minimize the interaction of network components. Partitioning can be done either at the data link level using segmentation or at the network level using subnettings. Although sometimes overlooked, firewalls need not be restricted to security concerns. They can be used as a general tool for controlling traffic between subnets. But while partitioning limits interactions, it makes data collection harder.
If you do partition your network, (and you should certainly consider doing so with any but the smallest of networks), it's important to build in checkpoints. At their simplest, these are computers on each partition that you can remotely log onto and that have tools for data collection. I discuss a number of tools useful for this purpose in Network Troubleshooting Tools. You can also configure network devices, like routers, to collect this kind of information. If your budget allows, then you can consider RMON probes and the like. Of course, realize that doing so may add to the complexity of your network. For example, you may need to reconfigure routers or firewalls to pass SNMP traffic, at least to selected hosts.
System Failures: Since system failures are the most difficult problems to diagnose (and, as a consequence, potentially one of the most expensive problems you'll face), you'll probably want to take particular care in structuring your network to avoid or limit this type of problem. Again, the previously listed precautions will all help. The next step is to think about the characteristics that predispose a network to system failures.
First, try to keep your network as linear as possible. Tree structured networks are preferred over arbitrary meshes. In general, apart from an occasional redundant connection, most networks are fairly linear (or tree structured). Ironically, adding redundant paths for robustness may have the opposite effect. Because of potential problems, this is one aspect of a network that needs to be tested carefully. When you add additional paths, you need to be aware of their potential to cause problems. Try for the cleanest structure that supports your mission.
Individual devices in most networks are relatively tightly coupled although many protocols and services that use them can be loosely coupled. In a tightly coupled system, buffers and redundancies need to be explicitly designed in. This has been done with some protocols but not others. Where possible, choose accordingly.
Complexity and efficiency are often closely linked in systems. Ironically, the simplest solutions are rarely the most efficient. Consequently, some degree of complexity will be required if your network is to function. Complexity is also added, in many cases, to improve reliability. In general, simple is better if it does the job, but only if it does the job.
Finally, as previously noted, for system failures a key element is that the interactions are non-obvious or unexpected. To avoid or deal with this type of problem, you'll want to know as much as possible about what is happening on your network. Unfortunately, this is in direct conflict with a basic design principle for networks: transparency. From the user's perspective, the details of how the network works are irrelevant. Consequently, networks are designed to hide the details from the user. If a network drops a packet from a TCP session, the protocol will arrange to have the packet resent without the user being aware of what is happening. Perhaps the network will seem a little slow, but that is the only thing the user will notice. From a troubleshooter's perspective, information hiding is the last thing you want. The solution is a thorough understanding of how your network works and a good set of tools for collecting information.
Engineering seems to always come down to trade-offs: linearity versus redundancy, simplicity versus efficiency, and transparency versus information availability. This requires careful balancing when building a network. There is no magic set of guidelines that will work in every instance. Improvement in reliability or ease in diagnosing problems will always come at a cost. But the first step in diagnosing any problem is understanding what is going on and being aware of the issues. Compare the complexity of your alternatives and try to predict where problems may arise.
Joseph D. Sloan has been working with computers since the mid-1970s. He began using Unix as a graduate student in 1981, first as an applications programmer and later as a system programmer and system administrator. Since 1988 he has taught mathematics and computer science at Lander University. He also manages the networking computer laboratory at Lander, where he can usually be found testing and using the software tools described in Network Troubleshooting Tools.
O'Reilly & Associates recently released (August 2001) Network Troubleshooting Tools.
Sample Chapter 4, Path Characteristics, is available free online.
For more information, or to order the book, click here.
Copyright © 2009 O'Reilly Media, Inc.