VIA vs. TCP/IP

Next: Parallel Processing Up: Background Previous: MPI

VIA vs. TCP/IP

The MPI Pro libraries allow easy selection of which network drivers to use by setting an environment variable to TCP (for TCP/IP over Ethernet) or to VIA (for VIA over ServerNet). Messages between processes on the same node are automatically sent using shared memory. With the TCP option, data is passed over the 10Mb/s Ethernet network using the operating system's TCP/IP network stack. With the VIA option, data is passed using the much thinner and faster VIA protocol, over gigabit ServerNet NICs and routers with hardware support for the VIA protocol. The faster network makes both a higher degree of parallel processing and larger systems than would be worth while with TCP/IP, since TCP/IP more quickly reaches the point where the increased network throughput demands from more parallel processing and larger systems actually slow down processing.

**Figure 2:** Work Queues Model
$\includegraphics[height = 3in]{VI_det.eps}$

VIA is well suited to high speed communications in a distributed system. The Protocol was developed as a standard sponsored by Intel, Microsoft, and Compaq specifically for use in system area networks (SANs). The hope was to create an open standard that would promote the use of distributed message passing in computer clusters. Its strengths are speed and simplicity. TCP/IP, on the other hand, was designed to be able to work in very large networks. It can be used in networks with many more computers than clusters typically have. It has the flexibility to be used in a wide variety of systems and scenarios. Its thick protocol stack and various layers with error checking work well across a large network with complex, many-jump routing and plenty of opportunities to drop packets. Most of this functionality is implemented in software as the operating system protocol stack, which uses the CPU to compute data encapsulation and error checking. In systems using TCP/IP, hardware interfaces are usually implemented as interrupt handlers and I/O port communications, which must be explicitly handled by the OS. This process eats up CPU cycles. The job of protocol processing and interrupt handling takes 10 to 30 percent of CPU cycles, compared to three to five percent for VI over ServerNet [via97].
In contrast, VIA can be thinner since it is running in a smaller network. Still there is as much flexibility and scalability as is needed for cluster computing. As implemented on ServerNet, error checking, message encapsulating and routing can be done efficiently in hardware. The use of DMA transfers to move blocks of data from one server to another can offload the CPU. This both enables the CPU to be doing useful work while messages are being passed and decreases the data passing latency.

**Figure 3:** Descriptor processing
$\includegraphics[height = 3in]{VI_descr.eps}$

Like TCP/IP, VIA as a standard is specified as a set of hardware and operating system independent APIs. There are currently implementations for Windows NT and a variety of UNIXes, and for several varieties of networking hardware. Although some implementations work with some Ethernet cards, the systems described in this thesis used VIA only with ServerNet, and stuck to TCP/IP over Ethernet. The biggest benefits of the VIA protocol can be seen using hardware specifically designed to support that interface.
From the bottom up, VIA message passing transactions involve the ServerNet NIC, hardware interface functions known as the VIA Primitive Library (VIPL) or OS level drivers, and the application level function using VIA communications. During a VIA transaction, at the beginning of communication a process using MPI can call an MPI function, which talks to the User Agent. Together, the aforementioned components are called a VI Consumer. The User Agent is a software interface used to provide a standard interface to the more proprietary OS and hardware components which are collectively known as a VI Provider. This consists of the kernel mode VI Kernel Agent, the VIs themselves, a Completion Queue (CQ) for keeping track of which operations have been completed, and the VI NIC. Continuing with the communication setup sequence, the Kernel Agent sets up a virtual interface (VI) which the application using MPI can then use fairly directly through the User Agent. From that point, no more involvement is needed from the kernel for data movement, just for signaling and removing virtual interfaces.

**Figure 4:** Throughput vs. data block size using MPI on a two node cluster using TCP/IP over 10baseT Ethernet.
$\rotatebox{270}{ \includegraphics[scale=0.5]{ThroughputTCP.eps}}$

The TCP/IP equivalent is opening a socket between computers. The transmission control protocol (TCP) layer is responsible for providing reliable, connection oriented transport layer. It uses a state machine for coordinating handshaking for establishing or taking down connections, and for packet passing and acknowledgment to control data flow. For reliability, TCP checks for timeouts, dropped or duplicated packets, packet ordering, and data error detection. All data sent using TCP goes through each aspect of error checking as a software operation. IP is a network layer protocol, used for routing data from one machine to another. Its addressing is based on dotted decimal notation (xxx.xxx.xxx.xxx) and port number. Routing tables tell IP which network interface to send packets on based on their IP addresses. TCP packets may be broken into smaller pieces and re- assembled when they reach their destination. It is IP that allows packets to leave one computer, traverse the open Internet, and find their way to a particular other computer no matter where it is. The routing, buffering, and fragmentation/De-fragmentation are all software operations. The systems discussed in this thesis sent TCP/IP traffic over an Ethernet LAN, a common scenario for using TCP/IP. Ethernet has its own transmission control, collision recovery and avoidance protocols, and error checking. It also has its own hardware addresses, which must be translated back and forth from IP addresses using software calls and passing address resolution protocol (ARP) and reverse address resolution protocol (RARP) packets across the network. All of these steps are overhead which is repeated, mostly in software, for every packet sent over TCP/IP. Even after a socket is opened, an OS's TCP/IP protocol stack performs processing, error checking and buffering on all data passed.[Tan96]
A VI Kernel Agent, on the other hand, is only responsible for connection management, such as setting up or taking down VIs, and providing memory protection for Doorbells. A VI on one computer is configured to communicate with a VI on another computer, and from that point on the application software simply passes Descriptors through the User Agent, which can be done with minimal software overhead. Also, there is no need for a context switch into kernel mode.

**Figure 5:** Throughput vs. data block size using MPI on a two node cluster using VIA over ServerNet.
$\rotatebox{270}{ \includegraphics[scale=0.5]{ThroughputVIA.eps}}$

When the application wants to send or receive data, it calls a User Agent function which puts the appropriate descriptor on the VI's send or receive queue, then rings the corresponding doorbell. The doorbell is a signaling mechanism that tells the VI it has a new descriptor. Each VI has two lists of descriptors: the Send Queue, and the Receive Queue. And each queue has a corresponding doorbell. Only one process can use a given VI, but a process can set up as many VIs as it likes, barring hardware or memory constraints. Typically, a VI is created, attached to another VI to do all communication with one process on another computer, then removed when all communication with that process is done. A VI Descriptor is a data structure which describes data movement. It can describe point to point communications, scatters, and gathers. Memory blocks which have been allocated for sending or receiving have their addresses passed to the VI NIC in the descriptor. It also contains data about the current status of the data movement described.
The VI NIC processes VI Descriptors in a VI's send and receive queues, then dequeues them. In addition to VIs, a VI Consumer may set up Completion Queues so it can poll or wait on the completion of a descriptor or group of descriptors. A Completion Queue can handle completion information for multiple VIs. A process waiting on a descriptor completion can do so without using interrupts or switching to kernel mode.
In order to get a feeling for expected message passing latencies, measurements of all system interconnections under consideration were made by running programs which used MPI to pass data blocks of various sizes from one process to another. The speed of data passing in these tests should be similar to the speeds encountered in actual use by other programs. Figure 4 and 5 show the performance of ServerNet and TCP/IP respectively, as throughput vs. the size of data blocks transferred. The following table summarizes the results in terms of total message passing latency:

Latency
Data Block Size	8192 Bytes	16384 Bytes	32768 Bytes
TCP/IP	1.07x10^-7s	1.95x10^-7s	3.68x10^-7s
ServerNet	7.64x10^-9s	1.18x10^-8s	1.74x10^-8s
Shared Memory	2.07x10^-9s	2.65x10^-9s	3.69x10^-9s

Next: Parallel Processing Up: Background Previous: MPI

Garth Brown
2001-01-26