
It's Always the Load Balancer
by Tony Bourke
03/05/2001
I've worked with a lot of server load balancing (SLB) systems, and in doing
so I've been asked to diagnose and resolve a variety of issues. I've done
this for dozens of clients, ranging from small shops to high-profile Web
sites. And let me tell you, if there is one common theme when working in
these situations it's that "It's always the load balancer." From complete
site outages to packet loss--even male-pattern baldness--load balancers get
blamed for almost everything. Anyone who has ever administered a load
balancer will probably back me up on this point.
Load balancers are an integral part of today's Web infrastructure. They're
also complex and underdocumented pieces of hardware. In this article I
will explain the reasons why load balancers get the blame and what you can
do about it.
The Beast
Today's Web sites are complex beasts. Every component must work together to
create a site that is greater than the sum of its parts. Figure 1
represents a fairly typical Web site installation and in it you can see how
complex the average Web site has become.

Figure 1: Traffic flow for a load balancer
The Internet is connected to the routers, which pass traffic through a
firewall to the load balancers, which distribute the traffic to the Web
servers, which pass information to the application server bone, and the
application server bone is connected to the database server bone. You get the
picture. If one component or piece of the process fails, it can take down
the entire site.
So, if the load balancer is only a small part of a bigger whole, why does it
get blamed so disproportionately? What is it about load balancers that
attracts critics, finger pointers, and naysayers alike? Let's take a look at
some of the reasons.
The Blame Game
One reason why people blame server load balancers is that they often do
experience more problems with them than with other network devices. There
are several reasons for this, but the primary one is that of all the network
devices a site might employ, load balancers are typically the newest on the
scene. Moreover, manufacturers are competing with each other at a feverish pace.
They quickly release new feature-rich versions that aren't thoroughly
tested.
Another reason for the blame is that load balancers are not very well
understood. Documentation quality varies greatly from vendor to vendor,
and there are few third-party resources (O'Reilly and I hope to change
that). A load balancer may not be malfunctioning, but if the people
configuring the unit don't understand all of the features or
troubleshooting techniques, they may unduly lay blame on the load balancer.
The old maxim that people fear (and blame) what they don't understand
certainly applies here.
Load balancers are also in the direct path of all traffic to a
particular Web site. By looking at Figure 2 below, you can see that if the
load balancer stops working, the entire site stops working. This critical
position in the infrastructure can make it appear as though the load
balancer is the problem, even in cases where it is not (such as a firewall
issue, a back-end database problem, someone tripping over a cable, etc.).
Unlike a broken or malfunctioning Web server, a misconfigured or
malfunctioning load balancer will result in a dead-to-the-world site. This is
why a firewall is often a suspect, too, but to a lesser degree since it is
generally a simpler device than load balancers.

Figure 2: Load Balancer implementation
Considering these points, it's easy to understand why load balancers too
often take the rap for a Web site's misfortunes. The problem, however, is
that blame can lead to misdiagnosis of the real culprit and delay a remedy.
So, what's to be done? The rest of this article focuses on three simple and
effective steps you can take to identify the real culprit, and, as we say
in the trade, CYA (cover your ass).
MRTG
I've written articles and have spoken at great length extolling the virtues
of MRTG (Multi Router Traffic Router), a freeware software
package that allows you to graph bandwidth utilization as well as several
other metrics. MRTG is invaluable because it provides historical and
graphical trending for network devices, whether they be a server, a router,
or any other device with an Ethernet interface. You can take a look at
what all the network devices you're responsible for were doing during a
problem. Figure 3 shows an example of an MRTG graph that covers 36 hours
worth of traffic in 5-minute intervals.

Figure 3: MRTG example graph
MRTG not only records bandwidth in and out of an interface on a
networked device, it can also graph other SNMP-based (Simple Network
Management Protocol) metrics on a load balancer. For instance, one of the
metrics you can measure with MRTG is connections per second, which produces
a graph such as the one shown in Figure 4. The graph shows the number of
connections per second over a 36-hour period, peaking
at around 5,200 connections per second at a little after 2:00 P.M.

Figure 4: MRTG example for connections per second
It all depends on the load balancer, but you can also graph various
functions, such as connections per second, bandwidth in and out of a VIP
(virtual IP, also known as virtual server) or real server, connections per
second per port, total number of active TCP (Transmission Control Protocol)
sessions, and dozens of others. For a closer look at load balancers and MRTG,
check out the MRTG site I
maintain.
Syslog
Virtually every load-balancing product has some way to write to a syslog
server, or, in some cases, to store syslog locally. Logs are an invaluable
tool in helping to diagnose problems when they arise. You can use them to
show who and when someone made changes to configuration, any fail-overs to
redundant units, DOS (denial of service) attack warnings, and other
operation issues that might point to a problem (or lack thereof). Check
the syslog documentation for your load balancer to learn more.
Sniffing
It's critical that you have a way to quickly and easily analyze traffic
as it passes through the network for troubleshooting purposes. Common
problems that arise include NAT (Network Address Translation) problems,
packet filtering, routing problems, DOS attacks (and their sources), and
much more.
To capture traffic just about any Unix machine will do. Different Unix
implementations have various sniffing programs available, such snoop for
Solaris (comes installed with Solaris) or tcpdump, which is an open source traffic analyzer released
under the BSD license that runs on most Unix flavors. There are also several
commercial sniffing programs available for Windows 2000/NT, as well as
several black box packet analyzers.
To monitor traffic on the sniffing machine, most switch vendors offer the
ability to do something called port mirroring. Port mirroring is when one
port duplicates the traffic of another port for the explicit purpose of
monitoring traffic. Figure 5 shows a typical implementation of traffic
sniffing.

Figure 5: Traffic sniffing scenario
Any packet that goes out of a selected port is sent to the mirrored port
as well. This allows for nonintrusive and nondisruptive traffic
monitoring. Cisco calls this the "span" port, while most other vendors
simply call it the mirrored port.
Given that load balancing is a relative newcomer to the site infrastructure
scene, and because it is generally misunderstood, it's easy to see why the
deck is stacked against SLBs. But laying blame prematurely on a load
balancer will not only cause headaches to the vendor and to the people
responsible for its function, but it can also divert attention from the
real problem. I've seen several cases where mob mentality called
for the head of the vendor of the load balancer only to have the same
problem appear with an entirely new vendor. Assumptions and premature
diagnosis can only lead to further problems, frustrated customers, and upset
vendors. So make sure your assumptions are based on hard evidence, and not
just the fear of the unknown and misunderstood.

|