An Introduction to the InfiniBand Architecture
Pages: 1, 2

Before I discuss a few more important benefits of the InfiniBand Architecture, we need to dig a little deeper. Figure 3 illustrates the communications stack of the InfiniBand Architecture. Before I go through each of the terms that appear in the figure, we need to understand the drivers that brought about the structure of the InfiniBand network stack.

Figure 3. InfiniBand Communication Stack

In order to achieve better performance and scalability at a lower cost, system architects have come up with the concept of clustering, where two or more servers are connected together to form a single logical server. In order to achieve the most benefit from the clustering of multiple servers, the protocol used for communication between the physical servers must provide high bandwidth and low latency. Unfortunately, full-fledged network protocols such as TCP/IP, in order to achieve good performance across both LANs and WANs, have become so complex that both incur considerable latency and require many thousands of lines of code for their implementation.

To overcome these issues, Compaq, Intel, and Microsoft joined forces and came up with the Virtual Interface (VI) Architecture (VIA) specification, which was released in December of 1997. The VI Architecture is a server messaging protocol whose focus is to provide a very low latency link between the communicating servers. The specification defines four basic components: virtual interfaces, completion queues, VI Provides, and VI Consumers. The VIA specification is available from here, and it describes in detail each of the components. I won't describe them in detail here, and I only try to provide high-level information so that you can understand how low the latency is achieved.

In transferring a block of data from one server to another, latency arises in the form of overhead and delays that are added to the time needed to transfer the actual data. If we were to break down the latency into its components, the major contributors would be: a) the overhead of executing network protocol code within the operating system, b) context switches to move in and out of kernel mode to receive and send out the data, and c) excessive copying of data between the user level buffers and the NIC memory.

Since VIA was only intended to be used for communication across the physical servers of a cluster (in other words across high-bandwidth links with very high reliability), the specification can eliminate much of the standard network protocol code that deals with special cases. Also, because of the well-defined environment of operation, the message exchange protocol was defined to avoid kernel mode interaction and allow for access to the NIC from user mode. Finally, because of the direct access to the NIC, unnecessary copying of the data into kernel buffers was also eliminated since the user is able to directly transfer data from user-space to the NIC. In addition to the standard send/receive operations that are typically available in a networking library, the VIA provides Remote Direct Memory Access operations where the initiator of the operation specifies both the source and destination of a data transfer, resulting in zero-copy data transfers with minimum involvement of the CPUs.

Now the reason why I spent so much time talking about the VIA in an article about the InfiniBand Architecture is because the InfiniBand uses basically the VIA primitives for its operation at the transport layer. Now we can return to Figure 3 and describe all the terms that are shown. In order for an application to communicate with another application over the InfiniBand it must first create a work queue that consists of a queue pair (QP). In order for the application to execute an operation, it must place a work queue element (WQE) in the work queue. From there the operation is picked-up for execution by the channel adapter. Therefore, the Work Queue forms the communications medium between applications and the channel adapter, relieving the operating system from having to deal with this responsibility.

Each process may create one or more QPs for communications purposes with another application. Instead of having to arbitrate for the use of the single queue for the NIC card, as in a typical operating system, each queue pair has an associated context. Since both the protocol and the structures are all very clearly defined, queue pairs can implemented in hardware, thereby off-loading most of the work from the CPU. Once a WQE has been processed properly, a completion queue element (CQE) is created and placed in the completion queue. The advantage of using the completion queue for notifying the caller of completed WQEs is because it reduces the interrupts that would be otherwise generated.

The list of operations supported by the InfiniBand architecture at the transport level for Send Queues are as follows:

  1. Send/Receive: supports the typical send/receive operation where one node submits a message and another node receives that message. One difference between the implementation of the send/receive operation under the InfiniBand architecture and traditional networking protocols is that the InfiniBand defines the send/receive operations as operating against queue pairs.

  2. RDMA-Write: this operation permits one node to write data directly into a memory buffer on a remote node. The remote node must of course have given appropriate access privileges to the node ahead of time and must have memory buffers already registered for remote access.

  3. RDMA-Read: this operation permits one node to read data directly from the memory buffer of a remote node. The remote node must of course have given appropriate access privileges to the node ahead of time.

  4. RDMA Atomics: this operation name actually refers to two different operations that have the same effect but which operate different from one another. The Compare & Swap operation allows a node to read a memory location and if its value is equal to a specified value, then a new value is written in that memory location. The Fetch Add atomic operation reads a value and returns it to the caller and then add a specified number to that value and saves it back at the same address.

For Receive Queue the only type of operation is:

  1. Post Receive Buffer: identifies a buffer into which a client may send to or receive data from through a Send, RDMA-Write, RDMA-Read operation.

When a QP is created, the caller may associate with the QP one of five different transport service types. A process may create and use more than one QP, each of a different transport service type. The InfiniBand transport service types are:

  • Reliable Connection (RC): reliable transfer of data between two entities.
  • Unreliable Connection (UC): unreliable transfer of data between two entities. Like RC there are only two entities involved in the data transfer but message may be lost.
  • Reliable Datagram (RD): the QP can send and receive messages from one or more QPs using a reliable datagram channel (RDC) between each pair of reliable datagram domains (RDDs).
  • Unreliable Datagram (UD): the QP can send and receive messages from one or more QPs however the messages may get lost.
  • Raw Datagram: the raw datagram is a data link layer service which provides the QP with the ability to send and receive raw datagram messages that are not interpreted.

This is not meant to be a complete coverage of the InfiniBand Architecture specification and as such I left lots of implementation details out. I do include a list of references though that should provide additional reading for those interested in learning more about it.

What state is the InfiniBand Architecture in and what is Microsoft doing about it?

Due to the overwhelming support in the industry for this new I/O standard architecture, InfiniBand development has been able to move very quickly from specification to actual products appearing in the market. Hardware for putting together InfiniBand fabrics started to appear in the 2nd quarter of 2001 and Beta OS Support was expected to be available in most popular operating system by the 4th quarter of 2001. Customer trials are expect after the 2nd quarter of 2002 with production systems starting to become available by the 3rd quarter of 2002.

Related Reading

Windows 2000 Performance GuideWindows 2000 Performance Guide
By Mark Friedman, Odysseas Pentakalos
Table of Contents
Sample Chapter
Full Description

At the Microsoft WinHEC 2001, Rob Haydt, Program Manager of the Windows Base OS Group, gave a presentation titled "Windows InfiniBand Support Roadmap" (Rob Haydt, "Windows InfiniBand Support Roadmap," WinHEC 2001.). Within that presentation he indicated that InfiniBand will be supported in the first release of the Windows Whistler operating system. Second generation IB development will take place between late 2002-2003 and significant commercial deployments will begin until the 2003-2004 timeframe.

If you are eager to try out some of the concepts described in this article but you don't have access to InfiniBand hardware, the easiest way to do that at this point is through Winsock Direct. Winsock Direct is an alternative implementation of the Winsock library that transparently takes advantage of support in the hardware to implement RDMA and kernel bypassing optimizations. It initially became available in Windows 2000 Datacenter Server but is now also available in the Service Pack 2 Release of Windows 2000 Advanced Server. The white paper by Jim Pinkerton that we have reference, describes in detail how Winsock Direct works and includes lots of numerical on experiments the author conducted so it makes for some interesting reading (Jim Pinkerton, "Winsock Direct: The Value of System Area Networks," Microsoft Corporation, May 2001). Other related interesting developments include the definition of the SCSI RDMA Protocol (SRP) over InfiniBand which is work in progress and the definition of the Sockets Direct Protocol (SDP) whose goal is to define a sockets type API over InfiniBand.

O'Reilly & Associates recently released (January 2002) Windows 2000 Performance Guide.

Odysseas Pentakalos has been an independent consultant for 10 years in performance modeling and tuning of computer systems and in object-oriented design and development.

Return to the O'Reilly Network.