 |

Network Design and Troubleshooting
by Joseph D. Sloan
08/21/2001
You can't escape failures. So, to the extent that it is practical, your
designs should keep this in mind. However, network design is always
faced with trade-offs: efficiency versus cost versus performance versus
maintainability, and so on. While it would seem networks should always be
designed to minimize problems, this minimization may run counter to
other constraints. There are no simple solutions to these
dilemmas. However, there are some general guidelines that can help
with many networks.
Types of Failure
Designing networks with failures in mind requires a thorough understanding
of how systems fail. So before discussing design principles, it is helpful to
review the ways systems fail. I'll begin by describing several different types
of failure. While this classification is somewhat simplistic, it will serve
our needs.
Simple failures: Perhaps the easiest type of failure to deal with is
when there is a single point of failure or a simple failure. With this type of
failure only one component of your network isn't working. Ideally, it will be
clear which device isn't working, but often this isn't the case. In many
instances it will appear that a number of devices are not working. For
example, if there is a bad cable connector on a network's only DNS server, the
server will be unavailable, DNS will be unavailable, and email (among other
services) will be unavailable. Actually, only one failure has occurred: the
connector. Once it is repaired, everything else will be miraculously
cured.
Independent multiple failures: Sometimes you will be faced with
multiple simultaneous or near simultaneous failures. Often you will have only
the illusion of multiple failures, as in the previous example where a single
connector failed. But sometimes more than one device will fail. With
independent multiple failures, the timing of the failures is nothing more than
coincidence. There is no causal link between the failures. Unfortunately,
independent multiple failures can be difficult to deal with for three reasons.
First, you must realize that you really do have more than one failure. Second,
it may be difficult to separate the symptoms of one failure from another.
Finally, it is human nature to try to find connections even where none exist.
This tendency can severely mislead you.
Cascade failures: A sequence of failures where one failure causes the next is known as a cascade failure. With true cascade failures, each failure is a separate problem that must be addressed. For example, a failing power supply
can damage an interface. To bring the device back online, you will need to
repair or replace both the power supply and the interface. (Hopefully, you'll
replace the power supply first.) It can be difficult to distinguish a true
cascade failure from a simple failure that affects a number of other services.
But while the distinction is irrelevant to your users, it can be important when
troubleshooting.
Visit security.oreilly.com for a
complete list of O'Reilly books on computer security.
System failures: Perhaps the most pernicious type of failure is the
system failure, an overused and often misapplied term. System failures result
from the unexpected, nonobvious interactions of system components.
The interaction may be a consequence of independent failures that are
interacting, or simply from unexpected incompatibilities between devices. A
system failure requires an unfamiliar, unplanned, or unexpected interaction
that is not visible or that is not immediately comprehensible. Multiple simple
failures are not a system failure if there is no interaction. Nor are cascade
failures since the interactions can be clearly understood. Probably the best
way to explain system failures is to give an example.
Case Study: An Example of a System Failure
A number of years ago I encountered a system failure when expanding a
college network I had set up. Logically, the network was composed of four
subnetworks: a network for the university administrative branch, a network for
the faculty, a network for student laboratories, and an access network with the
campus Internet connection, dial-in services, etc. Initially, these subnets
were all connected to a multihomed host that functioned as an email server, a
DNS server, and a router. Each of the individual subnetworks was a collection
of hosts interconnected by hubs. The logical structure of the original network
is shown in Figure 1a.
While this worked well enough initially, it didn't take long to outgrow the
original network hardware, particularly with the emergence of the Web. The
next phase in the evolution of this network was to install a separate router
and to replace a number of the hubs with switches. It was decided to simply
turn off routing at the email server but to leave it connected to all four
networks. It was argued that this would provide greater efficiency since local
email would not need to cross the router, and that it would provide some
redundancy since email, for example, would still be available if the router was
down. Because many of the users for the different networks were located near
each other, switches supporting virtual LANs (VLANs) were chosen. The new
logical structure is shown in Figure 1b.
To understand how this led to a system failure, it is necessary to
understand some details about the host and the switches that were used. The
email/DNS server had a quad-Ethernet adapter that used the same Ethernet or MAC
address for each of the four network interfaces, or ports, rather than a
different MAC address for each port. While this is unusual, documentation from
the vendor notes that the IEEE leaves it up to the vendor whether to use one
address for all ports, the station address approach, or a separate address for
each port, the port address approach. (With the original configuration, since
each MAC address was on a different network, everything worked fine.)
A characteristic of switches is that they learn MAC addresses from the
traffic that passes through them and then use this information to control how
the traffic is handled. When a packet arrives at a port, its MAC source
address is entered into that port's address table. The switch also searches
each port's address tables for the destination addresses.
If it finds a match, it sends the packet out on the port whose
address table contained the MAC address, and, perhaps just as importantly, it
does not send it out on any of the other ports. (If the destination is not in
any of the address tables for the switch, the switch acts like a hub and sends
the packet out on every port.) Since devices may be moved from one port to
another, typically when a switch adds a MAC address to the address table for
one port, it will remove that address from any other address table it might be
in. With "traditional" switches, this would not create any problems with the
station address approach used by the email server. A switch on one of the
individual networks simply wouldn't know what was going on any of the other
networks.
Now consider the implication of using VLANs with this scheme. The idea
behind VLAN equipment is that one physical device can be partitioned into
several logical devices. Suppose you have three faculty members, two staff
members, and a four-computer student cluster all going back to the same wiring
closet. Using traditional switches, you would need three different switches
since there are three different types of users that should each be on a
different subnetwork. Moreover, we would need three different cable-runs back
to the network backbone in order to connect each of these switches. With a
VLAN capable switch, the switch could be partitioned into three logical
switches. Each of the three logical switches would deal with traffic only on
its specific subnetwork. So, instead of needing three physical switches, we
would only buy one switch. And, if we are using VLAN technology at the network
backbone, we will only need one cable-run (or trunk line) between the backbone
and the wiring closet. The switches at either end should be able to sort the
traffic and send it to the appropriate logical switch. Clearly, VLANs can
provide significant savings in many cases.
This raises an interesting question: How should the address tables be
managed on a switch supporting VLANs? Specifically, when adding a device to
one address table, should it be removed from the address tables for all ports
on the physical switch or just from ports on the same logical switch? Since
devices may be moved from one network to another, you might assume that the
address should be removed from all the tables on a physical device. And this
is just what many (if not all) VLAN switches do. But this decision causes
problems when used with the email server's station address approach.
The problem is that the email server needed to reside on all four networks
simultaneously. Whenever a packet from the server arrived on one port on the
switch, the connection to the server on any other port was dropped. The
network would still send packets to the appropriate logical switch, but the
logical switch would then need to send the packets out on every port since the
address was no longer in any of the address tables. If only one person was
talking to the server, the switch would stabilize quickly. But with
simultaneous conversations, there are problems.
While incompatibility between the server's addressing scheme and the use of
VLANs was undoubtedly the source of our problems, this explanation is
simplistic. The problem was highly intermittent, showing up primarily as a
very poorly performing network that was dropping lots of packets and
connections. Paradoxically, the switch address tables were overflowing. There appeared to be problems separating traffic coming over trunk lines, etc. As is usually the case with such problems, a partial explanation was pieced together, largely after the fact. As is often the case with production systems, it was
not possible to go back and study the problem in detail. Once the nature of
the problem was grasped, changes were made that corrected the problem.
Specifically, ifconfig was used to assign different MAC addresses to
each port on the server.
System Characteristics and System Failures
As previously noted, the characteristics of a system failure are the
undesirable interactions of two or more parts of a system in a non-obvious or
unexpected way. Notice how the previous example matches this definition of a
system failure. Each piece of the network would work perfectly in isolation or
in some contexts. It was the interaction of the pieces that caused the
problem. The multiple connections to the server created multiple interactions,
and because these interactions were not obvious, it was a very difficult
problem to diagnose and correct.
What makes a system prone to system failures? In his classic book, Normal
Accidents, Charles Perrow analyzes a number of different systems from
nuclear power plants to air traffic control systems. At the risk of
simplifying Perrow's work, there are several factors he identified that
predispose systems to system failures. First, the simpler the system, the less
prone it is to a system failure. With complex systems there are more things
that can go wrong, a greater likelihood of interactions, and, because there is
more to understand when something does go wrong, there is a greater likelihood
of non-obvious or hidden interactions. Information must be collected
indirectly or inferred in a complex system.
Second, linear systems are less prone to system failures than systems with
lots of interconnections. Systems with a high number of cross connections have
many more ways for components to interact. Of the actual interconnections, it
is the non-obvious ones that are most likely to create problems difficult to
diagnose.
Closely related to linearity is the degree of coupling between the parts in
the system. Loosely coupled, linear systems are less likely to have system
failures than tightly coupled systems. With tightly coupled systems,
interactions are more immediate and unforgiving.
O'Reilly's System and Network
Administration resource center offers a complete list of current system and network administration books as well as related articles and news stories.
None of these characteristics are particularly surprising when you stop to
think about them. But unless you do stop and think about them, they are very
easy to overlook. Roughly speaking, I would classify most networks as complex
but relatively linear systems. (Actually most networks are tree structured but
this implies a single, obvious linear path between pairs of devices.) Most
hardware tends to be fairly tightly coupled, but the protocols using the
hardware may provide loose coupling. But if you don't agree with this, that's
OK. Assigning a system to the categories of linear versus nonlinear, tightly
coupled versus loosely coupled, and simple versus complex is largely a judgment
call. These are really relative classifications. It is more important to be
able to compare different network designs from this perspective than to give an
absolute classification to one design.
Diagnosing Failures
Unfortunately, it is rarely obvious what type of failure you have when you
start diagnosing a problem. Although it can be helpful to keep the
possibilities in mind when faced with a problem, it is often the case that you
will not be able to classify the type of problem you are facing until after you
have solved the problem.
Invariably, you'll begin by treating all failures as simple failures until
you know more. Begin by looking at the obvious point of failure and then go
from there. With the previous example of a bad connector on a DNS server, you
would probably start with the error message from the email software that
indicated a DNS problem. Next, you would try checking DNS information. This
would lead to the discovery that the DNS service was unavailable which in turn
would lead to the discovery that the DNS server was unreachable. You would
continue the process of zeroing in on the source of the failure. In mechanical
systems, proximity is often an important clue. In networks, this is replaced
by logical proximity-which other systems are being used by or are using the
system in question? (If this all sounds a bit vague, Network Troubleshooting
Tools describes specific tools that can be used for each of these
steps.)
Being aware of the different types of failures can be helpful. When
diagnosing simple failures, the usual course of action is a divide-and-conquer
approach. In the example just given, it was important to be willing to shift
your attention rather than dwell on symptoms several steps removed from the
real failure. Time spent examining email configuration files or doing DNS
queries is largely wasted in this example.
When two (or more) devices fail at roughly the same time, diagnosing the
problem may be difficult since correcting one of the problems will not solve
the overall problem. It is important that you be open to the possibility of
multiple failures and their implications. The usual divide-and-conquer
approach to troubleshooting, e.g. swapping out devices, is designed to zero in
on a single problem. With multiple failures, as you move closer to one
problem, you may be moving further away from the other problem. Typically,
however, when you correct one of the problems, the symptoms will change. This
is often the essential clue that you are dealing with multiple failures,
particularly system failures.
Network Design
While understanding the different types of failure may only be marginally
helpful in diagnosing problems, it can be very helpful when designing networks,
particularly if you think about the network characteristics that contribute to
failures. But this approach will not be a panacea. In network design, your
goals will naturally include avoiding failures in the first place, as well as
minimizing their impact and simplifying troubleshooting. Unfortunately, these
can be contradictory goals. For example, a design that contains the impact of
a failure may make it more difficult to remotely collect information about that
failure.
Simple Failures: The usual advice applies when trying to avoid simple
failures. Use reliable equipment. Maintain and test the equipment on a regular
basis. Keep your documentation up to date. Keep things as simple as possible
by standardizing on a few protocols and a few vendors for equipment.
Unfortunately, using only a few protocols limits what you can do with your
network. And standardizing equipment means you'll probably end up paying a lot
more for that equipment. No one vendor will be the most economical in every
case, and special needs may dictate going with costly equipment when it isn't
needed. There should be no surprises among these suggestions.
Multiple Failures and Cascade Failures: Of course, minimizing simple
failures is the first step in preventing multiple failures, whether they are
independent failures, cascade failures, or system failures. The next step is
partitioning to minimize the interaction of network components. Partitioning
can be done either at the data link level using segmentation or at the network
level using subnettings. Although sometimes overlooked, firewalls need not be
restricted to security concerns. They can be used as a general tool for
controlling traffic between subnets. But while partitioning limits
interactions, it makes data collection harder.
If you do partition your network, (and you should certainly consider doing
so with any but the smallest of networks), it's important to build in
checkpoints. At their simplest, these are computers on each partition that you
can remotely log onto and that have tools for data collection. I discuss a
number of tools useful for this purpose in Network Troubleshooting
Tools. You can also configure network devices, like routers, to collect
this kind of information. If your budget allows, then you can consider RMON
probes and the like. Of course, realize that doing so may add to the
complexity of your network. For example, you may need to reconfigure routers
or firewalls to pass SNMP traffic, at least to selected hosts.
System Failures: Since system failures are the most difficult problems to
diagnose (and, as a consequence, potentially one of the most expensive problems
you'll face), you'll probably want to take particular care in structuring your
network to avoid or limit this type of problem. Again, the previously listed
precautions will all help. The next step is to think about the characteristics
that predispose a network to system failures.
First, try to keep your network as linear as possible. Tree structured
networks are preferred over arbitrary meshes. In general, apart from an
occasional redundant connection, most networks are fairly linear (or tree
structured). Ironically, adding redundant paths for robustness may have the
opposite effect. Because of potential problems, this is one aspect of a
network that needs to be tested carefully. When you add additional paths, you
need to be aware of their potential to cause problems. Try for the cleanest
structure that supports your mission.
Individual devices in most networks are relatively tightly coupled although
many protocols and services that use them can be loosely coupled. In a tightly
coupled system, buffers and redundancies need to be explicitly designed in.
This has been done with some protocols but not others. Where possible, choose
accordingly.
Complexity and efficiency are often closely linked in systems. Ironically,
the simplest solutions are rarely the most efficient. Consequently, some
degree of complexity will be required if your network is to function.
Complexity is also added, in many cases, to improve reliability. In general,
simple is better if it does the job, but only if it does the job.
Finally, as previously noted, for system failures a key element is that the
interactions are non-obvious or unexpected. To avoid or deal with this type of
problem, you'll want to know as much as possible about what is happening on
your network. Unfortunately, this is in direct conflict with a basic design
principle for networks: transparency. From the user's perspective, the details
of how the network works are irrelevant. Consequently, networks are designed
to hide the details from the user. If a network drops a packet from a TCP
session, the protocol will arrange to have the packet resent without the user
being aware of what is happening. Perhaps the network will seem a little slow,
but that is the only thing the user will notice. From a troubleshooter's
perspective, information hiding is the last thing you want. The solution is a
thorough understanding of how your network works and a good set of tools for
collecting information.
Conclusions
Engineering seems to always come down to trade-offs: linearity versus
redundancy, simplicity versus efficiency, and transparency versus
information availability. This requires careful balancing when
building a network. There is no magic set of guidelines that will
work in every instance. Improvement in reliability or ease in
diagnosing problems will always come at a cost. But the first step in
diagnosing any problem is understanding what is going on and being
aware of the issues. Compare the complexity of your alternatives and
try to predict where problems may arise.
Joseph D. Sloan has been working with computers since the
mid-1970s. He began using Unix as a graduate student in 1981, first as
an applications programmer and later as a system programmer and system
administrator. Since 1988 he has taught mathematics and computer
science at Lander University. He also manages the networking computer
laboratory at Lander, where he can usually be found testing and using
the software tools described in Network Troubleshooting Tools.
O'Reilly & Associates recently released
(August 2001) Network
Troubleshooting Tools.

|
 |
Sponsored by:
|