An Introduction to the InfiniBand Architectureby Odysseas Pentakalos, coauthor of Windows 2000 Performance Guide
In 1965, Dr. Gordon Moore observed that the industry was able to double the transistor density on a manufactured die every year (Gordon E. Moore, "The Continuing Silicon Technology Evolution Inside the PC Platform"). This observation became popular as Moore's Law and almost 40 years later it still holds as a fairly accurate estimate of the growth of transistor density (a more accurate estimate of the growth rate, which encompasses growth data from the past 45 years, is doubling of the density every 18 months).
This fast-paced growth in transistor density translates into CPU-performance increases of a similar magnitude, making applications such as data mining, data warehousing, and e-business commonplace. To reap the benefits of this growth in computational power, however, requires that the I/O subsystem is able to deliver the data needed by the processor subsystem at the rate at which is it needed. In the past couple of years, it has become clear that the current shared bus-based architecture will become the bottleneck of the servers that host these powerful but demanding applications.
Performance is just one dimension of the growing demands imposed on the I/O subsystem. As Wendy Vittori, general manager of Intel?s I/O products division, said, "The growth of e-Commerce and e-Business means more data delivered to more and more users. This data needs to move faster, with higher reliability and quality of service than ever before." (Intel, I/O Building Blocks, Ultimate I/O Performance.) E-commerce and e-business applications need to be available 24/7 to process transactions at very high rates. This implies that the desired solution to the architecture of the I/O system must be able to provide enhanced reliability and availability in addition to raw performance.
In this article, I will review each of the main architectural features of the InfiniBand as a solution to the corresponding limitation of the current I/O subsystem. The InfiniBand specification defines the architecture of the interconnect that will pull together the I/O subsystems of the next generation of servers and will ultimately even move to the powerful desktops of the future. The architecture is based on a serial, switched fabric that in addition to defining link bandwidths between 2.5 and 30 Gbits/sec, resolves the scalability, expandability, and fault tolerance limitations of the shared bus architecture through the use of switches and routers in the construction of its fabric. But before we delve into that discussion lets see who is behind this new I/O architecture and how it came about.
Who is working on it?
A few years ago, when the impending bottleneck at the I/O subsystem became a clear vision of the future, a number of industry leaders decided to take action and come up with the design for the I/O subsystem of the future. As it is common in the computer industry, two competing efforts started, one called the Future I/O initiative, which included Compaq, IBM, and Hewlett Packard; and the other one called the Next-Generation I/O initiative, which included Dell, Hitachi, Intel, NEC, Siemens, and Sun Microsystems. As the first versions of those specifications started to become available, the two groups decided to unify their efforts by bringing together the best ideas of each of the two separate initiatives. So, in August of 1999, the seven industry leaders, Compaq, Dell, Hewlett-Packard, IBM, Intel, Microsoft, and Sun Microsystems formed the InfiniBand Trade Association (ITA) (InfiniBand Trade Association, "What is the InfiniBand Trade Association.").
In addition to those seven steering committee members, the ITA consists of 11 sponsoring members, including among others 3Com, Cisco and EMC, and more than 200 member companies. By agreeing to join their efforts, the ITA was both able to eliminate the confusion in the market that would have resulted from the coexistence of the two competing standards but it was also able to release a first version of the specification very quickly. Version 1.0 of the InfiniBand Architecture Specification was released in October of 2000 and the 1.0a version, mainly consisting of minor changes to the 1.0 version, was released in June of 2001. The specification is available for download from the ITA Web site for free, even for non-members.
What are its main architectural features?
The Peripheral Component Interconnect (PCI) bus, which was first introduced in the early 90's, is the dominant bus, used in both desktop and server machines for attaching I/O peripherals to the CPU/memory complex. The most common configuration of the PCI bus is a 32-bit 33MHz version that provides a bandwidth of 133MB per second, although the 2.2 version of the specification allows for a 64-bit version at 33MHz for a bandwidth of 266MB per second and even a 64-bit 66MHz version for a bandwidth of 533MB per second. Even today's powerful desktop machines have lots of capacity available with the PCI bus in the typical configuration, but server machines are starting to hit the upper limits of the shared bus architecture. The availability of multiport Gigabit Ethernet NICs, along with one or more Fibre Channel I/O controllers can easily consume even the highest 64-bit, 66MHz version of the PCI bus.
To resolve this limitation on the bandwidth of the PCI bus, a number of solutions are becoming available in the market as interim solutions such as PCI-X and PCI DDR (Mellanox Technologies, "Understanding PCI Bus, 3GIO, and InfiniBand Architecture"). Both of them are backwards compatible upgrade paths to the current PCI bus. The PCI-X specification allows for a 64-bit version of the bus operating at the clock rate of 133 MHz, but this is achieved by easing some of the timing constraints. The shared bus nature of the PCI-X bus forces it to lower its fanout in order to achieve the high clock rate of 133 MHz.
A PCI-X system that is running at 133 MHz can have only one slot on the bus, two PCI-X slots would allow a maximum clock rate of 100 MHz, whereas the four slot configuration would drop down to a clock rate of 6 MHz (Compaq Computer Corporation, "PCI-X: An Evolution of the PCI Bus," September 1999, TC990903TB.). So, despite the temporary resolution of the PCI bandwidth limitation through these new upgrade technologies, there is a long term solution needed that cannot rely on a shared bus architecture.
InfiniBand breaks through the bandwidth and fanout limitations of the PCI bus by migrating from the traditional shared bus architecture into a switched fabric architecture. Figure 1, below, shows the simplest configuration of an InfiniBand installation, where two or more nodes are connected to one another through the fabric. A node represents either a host device such as a server or an I/O device such as a RAID subsystem. The fabric itself may consist of a single switch in the simplest case or a collection of interconnected switches and routers. I will describe the difference between a switch and a router a little later but those of you with a networking background have probably guessed already what the difference is.
Each connection between nodes, switches, and routers is a point-to-point, serial connection. This basic difference brings about a number of benefits:
- Because it is a serial connection, it only requires four as opposed to the wide parallel connection of the PCI bus.
- The point-to-point nature of the connection provides the full capacity of the connection to the two endpoints because the link is dedicated to the two endpoints. This eliminates the contention for the bus as well as the resulting delays that emerge under heavy loading conditions in the shared bus architecture.
- The InfiniBand channel is designed for connections between hosts and I/O devices within a Data Center. Due to the well defined, relatively short length of the connections, much higher bandwidth can be achieved than in cases where much longer lengths may be needed.
The InfiniBand specification defines the raw bandwidth of the base 1x connection at 2.5Gb per second. It then specifies two additional bandwidths, referred to as 4x and 12x, as multipliers of the base link rate. At the time that I am writing this, there are already 1x and 4x adapters available in the market. So, the InfiniBand will be able to achieve must higher data transfer rates than is physically possible with the shared bus architecture without the fan-out limitations of the later.
Lets now dig in a little deeper into the architecture of the InfiniBand to explore some of its additional benefits. Figure 2, below, illustrates a system area network that utilizes the InfiniBand architecture. In the example shown in the figure, the fabric consists of three switches that connect the six end nodes together. Each node connects to the fabric through a channel adapter. The InfiniBand specification classifies the channel adapters into two categories: Host Channel Adapters (HCA) and Target Channel Adapters (TCA).
HCAs are present in servers or even desktop machines and provide an interface that is used to integrate the InfiniBand with the operating system. TCAs are present on I/O devices such as a RAID subsystem or a JBOD subsystem. Each channel adapter may have one or more ports. As you can also see in the figure, a channel adapter with more than one port, may be connected to multiple switch ports. This allows for multiple paths between a source and a destination, resulting in performance and reliability benefits.
By having multiple paths available in getting the data from one node to another, the fabric is able to achieve transfer rates at the full capacity of the channel, avoiding congestion issues that arise in the shared bus architecture. Furthermore, having alternative paths results in increased reliability and availability since another path is available for routing of the data in the case of failure of one of the links.
Two more unique features of the InfiniBand Architecture that become evident in Figure 2 are the ability to share storage devices across multiple servers and the ability to perform third-party I/O. Third-party I/O is a term used to refer to the ability of two storage devices to complete an I/O transaction without the direct involvement of a host other than in setting up the operation. This feature is extremely important from the performance perspective since many such I/O operations between two storage devices can be totally off-loaded from the server, thereby eliminating the unnecessary CPU utilization that would otherwise be consumed.
Host and Target Channel adapters present an interface to the layers above them that allow those layers to generate and consume packets. In the case of a server writing a file to a storage device, the host is generating the packets that are then consumed by the storage device. In contract to the channel adapter, switches simply forward packets between two of their ports based on the established routing table and the addressing information stored on the packets. A collection of end nodes connected to one another through one or more switches form a subnet. Each subnet must have at least one Subnet Manager that is responsible for the configuration and management of the subnet.
Routers are like switches in the respect that they simply forward packets between their ports. The difference between the routers and the switches, however, is that a router is used to interconnect two or more subnets to form a larger multi-domain system area network. Within a subnet, each port is assigned a unique identifier by the subnet manager called the Local ID or LID. In addition to the LID, each port is assigned a globally unique identifier called the GID. Switches make use of the LIDs for routing packets from the source to the destination, whereas Routers make use of the GIDs for routing packets across domains. More detailed information on the LIDs, GIDs, and their assignment is available either in the specification or in William Futral's book (William T. Futral, InfiniBand Architecture: Development and Deployment. A Strategic Guide to Server I/O Solutions, Intel Press, 2001).
One more feature of the InfiniBand Architecture that is not available in the current shared bus I/O architecture is the ability to partition the ports within the fabric that can communicate with one another. This is useful for partitioning the available storage across one or more servers for management reasons and/or for security reasons.
Pages: 1, 2