I recently attended an interesting and thought-provoking short course on IP router architecture led by Gísli Hjálmtýsson. Gísli is engaged in research in the field of active networks and has developed a Linux-based prototype active router called "Pronto." In describing this and other of his work, Gísli offered some insight into the issues impacting router performance, especially in a Linux environment. In this article I thought I'd take a couple of Gísli's key observations and translate them into some practical guidelines to assist in the construction of Linux-based routers with a focus on performance rather than functionality.
The basic process of routing the IP protocol is deliberately simple. For high performance routing, you want the datagram to be passed as quickly as possible from one interface to another. The process that does this forwarding for the vast majority of datagrams is sometimes called the "fast path."
The hardware receives the data from the transmission medium, stores it in a buffer, and signals the device driver that it is available to be read. The signalling is invariably performed using hardware interrupts. In the case of Ethernet hardware, there is often a single interrupt generated for each received packet; in the case of a PPP link with serial hardware, there can be as many as one interrupt per received character. This is important, as we'll see later.
The hardware device driver is responsible for reading the data from the hardware. Often the hardware or its device driver will do some checks to ensure that the data is not corrupt. Ethernet cards, for example, implement a checksum in hardware, discarding any packets that have been corrupted in transmission. The SLIP device driver, on the other hand, has no means of knowing, as the SLIP protocol does not provide any error detection capability, relying on that provided by the IP protocol.
The device driver will then call the
netif_rx() function in the core Linux networking code, which will check the protocol identifier of the received data and forward it to the appropriate kernel protocol stack. This article focuses only on version 4 of the IP protocol, but much the same process applies to other supported protocols such as IPv6, IPX, AX.25, and DECNet.
The IP protocol stack will first do some rudimentary checks to ensure that the datagram is ok.
The most basic tests that are performed are sanity checks, including ensuring that the length of the IP datagram is at least as long as an IP header and satisfies the IP header length field, and that the version number in the IP header is version 4. Finally, the IP header is checked for corruption by testing the IP header checksum against the IP header data. It's worth noting here that the IP header checksum protects only the header of the IP datagram, not the payload ... this allows this check to be performed very quickly. If any of these tests fail, the datagram is simply thrown away.
If the IP header contains any options fields, these are processed next. Some options that might be used are the "router alert" option and the "source route" option.
The destination address of the IP header tells the router where the datagram is to go. The first test that must be performed is one that determines whether the datagram is destined for us (i.e., this host), or whether it is destined for some other host and needs to be forwarded. It is simple enough to determine if the datagram is for us; the destination address will match an address of one of our active network interfaces.
If the datagram is for us, it is processed, and the data is ultimately passed to a local socket for an application to use. If the datagram is not for us, it is passed to the IP forwarding engine.
If the datagram is to be forwarded to another host, our router must do two things. Firstly it must decrement the IP time-to-live field in the datagram and discard the datagram if the result is zero. This mechanism helps limit the damage caused by routing loops. Naturally, we need to recalculate the IP header checksum for any datagram we keep, because the header has been modified.
Secondly, and much less trivially, our router must determine where to transmit the datagram next. It does this using the IP routing table. The IP routing table is usually built automatically using a routing daemon like Zebra, supporting routing protocols like OSPF, BGP, or RIP. Sometimes the routing table will have routes that have been entered manually called static routes. You can display the routing table using the commands
ip route l.
Routes are found by searching the routing table, looking for the best match. A match occurs when the destination field of the route matches the destination address of the IP datagram to be routed with the number of bits described by the network mask of the route. The network mask is the
genmask field of the route command output, or the
/nn field of the ip command output. A match can be anything from no bits (the default route) to the full 32 bits (a host route). The best match is the one that has the greatest number of matching bits. This search requires a certain amount of CPU power to perform and many, many bit test operations. Various tricks are used to reduce the time and effort taken to perform the search such as caching and hash tales. Suffice it to say for this column that the task of identifying the best matching route is computationally intensive with lots of bit tests and comparisons, especially as the number of routes increases.
Finally, the IP datagram can be sent to the device driver of the hardware that will carry it to its next hop. The data will be placed in a buffer where the hardware may read it for transmission, but not before one last operation occurs.
Linux Network Administrator's Guide, 2nd Edition
Each network interface on an IP router has a value associated with it called the MTU, the maximum transmission unit, which represents the largest sized chunk of data that the interface can transmit in a single transmission. For Ethernet interfaces this value is 1500 bytes, but some network technologies support larger or smaller MTU. If the datagram we're forwarding to an interface is larger in size than the MTU of that interface, we are obligated to cut it up into pieces that are at most MTU-sized. This process is called fragmentation. If, for, example our interface is a high-speed serial interface supporting PPP with an MTU of 576 bytes and we have a 1500 byte datagram to send to it, we break it up into two datagrams of 576 bytes, and the remaining data we place in a third smaller datagram. We then send all three of these datagrams to the interface for transmission.
This process occurs for each and every datagram forwarded by an IP router. The time taken for this process to complete is critical to the overall performance of the router. This process is called the fast path because there are slower processes possible. In practice, in modern routers there are a number of other tests that may be performed within the fast path. Features like firewall and network address translation each have tests associated with them.
The fast path looks fairly straightforward, but already there are hints that there are speed bumps, at least in some lanes. I've already mentioned that performance is about time: time-in to time-out, how long a datagram takes to traverse a router. There are a number of factors that can influence the performance of an IP router, some of them obvious, some not.
A number of tests are performed in the fast path: tests for whether the datagram is for the local host or to be routed, tests for IP options, tests for firewall, tests for network address translation. Each test incurs a performance cost, but any actions that might have to be taken as a result incur an additional cost. If you want high performance, functionality costs. The less you mess with the datagram on the way through, the faster it will travel.
A number of factors relate to the hardware. The four most important of these relate to the use of interrupts, the bus bandwidth, the speed of the CPU, and the bandwidth of the memory.
The time taken to process interrupts can be quite significant for high performance routing. Any amount of time taken between when the hardware generates an interrupt and when the relevant data is read is a direct contributor to latency within the router. Additionally, especially in the case of serial devices, the rate that interrupts are serviced plays a part in determining the upper limit of the speed of the network connection. If it takes 1 mS to service an interrupt and your serial hardware is generating one interrupt per received character, you'll not be able to handle more than 1000 characters per second, roughly 10 kbps. In the same fashion, even ignoring all other factors, if that same 1mS interrupt latency were applied a datagram at a time, you'd be limited to 1000 datagrams per second for that interface. To a large extent, interrupt latency is in the hands of the kernel hackers, but the type of hardware plays some part too.
The bandwidth of the bus in the router host is very important. Just about everything that happens in the hardware happens across the bus, whether it be data being read from an Ethernet card, written to a FDDI card, written to a serial device, or read from an IRDa device. The machine bus is the common communications channel that nearly all hardware uses to communicate. While the PCI bus now dominates the industry, there are still a lot of alternatives out there, ISA, EISA, and MCA being three. Non-Intel architectures had their own bus standards. A bus is controlled by a clock, and the overall aggregate bus bandwidth is a product of the bus clock and the data width, the number of bits that may be read or written in a single cycle. When you're attempting to route between a number of separate devices, it's possible for the bus to become a bottleneck in performance.
Something in the router has to do the work of shuffling all the bits around. The CPU doesn't really do all that much in the fast path: a bit of bit twiddling, some reads and writes, and a couple of calculations of the IP header checksum. The moment you start deviating from it, though, the CPU begins to work harder. For example, if you have lots of firewall or NAT rules, or you do lots of IP option processing, the CPU will be used more for each datagram and will play a larger role in the router performance. In the bigger picture, CPU plays a more significant role when your average datagram size is small. So if you have lots of data in small datagrams, your CPU will work harder per Mbps than if the datagrams are large. This is because the majority of CPU work is done on a per packet basis, rather than a per byte basis.
The one hardware related factor that is most heavily influenced by the total volume of data is the memory bandwidth. Every read from memory, every write to memory takes time. Any operation on a datagram that requires data to be copied in memory takes considerably more time. Care has been taken in the design of the Linux kernel to ensure that data copies are kept to a minimum, but some operations, such as IP datagram fragmentation or reading from or writing to some device drivers, require the datagram, or portions of it, to be copied in memory. While this may seem trivial, in practice it becomes an issue when processing high volumes of large datagrams.
If you're building a high performance Linux-based router, there are some choices you can make that will help ensure you're not disappointed. Inevitably, you'll make compromises somewhere, but you'll at least be doing it with some knowledge of the potential impact.
So what then are the rules of thumb that I promised? Here they are, split into two categories:
Interrupts: Keep their number low.
If you're wanting to support serial interfaces, be they asynchronous RS-232 or synchronous V.35 or HSSI, there is no substitute for an "intelligent" serial I/O controller. By taking the hard work away to their own processor, they ease the burden, both on the main CPU and on the bus.
If you're supporting Ethernet, then choose a modern design like the "tulip" chipset. Don't skimp by using those cards you have laying around on the shelf; a lot of evolution has occurred in the past couple of years in Ethernet card design that provides great wins in performance. If you're intending to support large numbers of Ethernet interfaces, consider purchasing multiport Ethernet cards. These cards support a number (4, 8) of Ethernet interfaces on a single card and share interrupts, shared memory, etc., between them.
Bus: Choose PCI.
It's now ubiquitous, it performs well, and its volume makes it comparably reasonably priced. If you're intending to support large numbers of I/O cards, choose a dual PCI bus machine. This splits the bus bandwidth into halves and reduces contention.
CPU: Don't go overboard.
IP routers become CPU bound when handling high volumes of small datagrams. Anecdotal evidence suggests that a P-III/450 is capable of routing over 50,000 packets per second with a Linux kernel so long as most of those stick to the fast path. If you're wanting firewalling or NAT capability on your router, then CPU performance will be important.
Memory: If you want high performance, you need fast memory.
Memory bandwidth becomes an issue when routing large volumes of large datagrams. Use fast RAM like SDRAM or better, especially if you're routing at fast Ethernet speeds. This is an area where you might have to compromise a little. Fast RAM can be expensive and can influence your motherboard options, so you'll need to be realistic about what you're actually trying to achieve at what price.
Choose your interface MTUs carefully. If your router will support Ethernet interfaces only, then this isn't a big issue, but if you're using a combination of Ethernet interfaces with serial interfaces, then consider using an MTU of 1500 on your serial interfaces to avoid the need to fragment when forwarding from an Ethernet interface.
Consider kernel compile options.
Be sure to compile your kernel to take advantage of whatever optimizations are available for your CPU. It's a small thing, but it can't hurt. The 2.4 series kernels provide some network optimizations that will be useful, too. The first of these is the "IP: advanced router/IP: large routing tables" (
CONFIG_IP_ROUTE_LARGE_TABLES) option which assists in performance when you have large routing tables. Another of these is the "fast switching" (
CONFIG_NET_FASTROUTE) option which allows direct Ethernet NIC to NIC data transfers for certain Ethernet drivers.
This might sound silly, but these functions impact router performance. If you don't have a good need for them, don't use them. One option to consider is to avoid using them on core routers in your network; instead, push these functions out to edge routers where the data volumes are lower. If you must use them on core routers, use them sparingly and choose your rulesets to offer the shortest path for the majority of traffic.
There is lots of reading to be done on this subject. I've provided a few references to related information. The Linux Router Project mailing list occasionally has discussions relating to Linux router performance and might be a good place to ask specific questions if you have any. If you have the opportunity to attend Gísli's talk at any time, I recommend it. In the meantime, happy routing.
A purpose-built Linux distribution aimed at router construction.
The "PROgrammable Network of TOmorrow," a Linux-based active router.
The DARPA provides sponsorship of a number of active network projects; you might be interested in reading about some of them.
Provides practical advice on selecting Ethernet cards.
Provides pointers and advice on selecting Linux-compatible hardware.
A well-known vendor of high performance, Linux-supported serial hardware.
A well-known vendor of high performance Linux-supported serial hardware.
Terry Dawson is the author of a number of network-related HOWTO documents for the Linux Documentation Project, a co-author of the 2nd edition of O'Reilly's Linux Network Administrators Guide, and is an active participant in a number of other Linux projects.
Read more Linux Network Administration columns.
Discuss this article in the O'Reilly Network Linux Forum.
Return to the Linux DevCenter.
Copyright © 2009 O'Reilly Media, Inc.