Next: Parallel Processing
Up: Background
Previous: MPI
The MPI Pro libraries allow easy selection of which network drivers to
use by setting an environment variable to TCP (for TCP/IP over
Ethernet) or to VIA (for VIA over ServerNet). Messages between
processes on the same node are automatically sent using shared memory.
With the TCP option, data is passed over the 10Mb/s Ethernet network
using the operating system's TCP/IP network stack. With the VIA
option, data is passed using the much thinner and faster VIA protocol,
over gigabit ServerNet NICs and routers with hardware support for the
VIA protocol. The faster network makes both a higher degree of
parallel processing and larger systems than would be worth while with
TCP/IP, since TCP/IP more quickly reaches the point where the
increased network throughput demands from more parallel processing and
larger systems actually slow down processing.
Figure 2:
Work Queues Model
|
VIA is well suited to high speed communications in a distributed
system. The Protocol was developed as a standard sponsored by Intel,
Microsoft, and Compaq specifically for use in system area networks
(SANs). The hope was to create an open standard that would promote the
use of distributed message passing in computer clusters. Its
strengths are speed and simplicity. TCP/IP, on the other hand, was
designed to be able to work in very large networks. It can be used in
networks with many more computers than clusters typically have. It
has the flexibility to be used in a wide variety of systems and
scenarios. Its thick protocol stack and various layers with error
checking work well across a large network with complex, many-jump
routing and plenty of opportunities to drop packets. Most of this
functionality is implemented in software as the operating system
protocol stack, which uses the CPU to compute data encapsulation and
error checking. In systems using TCP/IP, hardware interfaces are
usually implemented as interrupt handlers and I/O port communications,
which must be explicitly handled by the OS. This process eats up CPU
cycles. The job of protocol processing and interrupt handling takes
10 to 30 percent of CPU cycles, compared to three to five percent for
VI over ServerNet [via97].
In contrast, VIA can be thinner since it is running in a smaller
network. Still there is as much flexibility and scalability as is
needed for cluster computing. As implemented on ServerNet, error
checking, message encapsulating and routing can be done efficiently in
hardware. The use of DMA transfers to move blocks of data from one
server to another can offload the CPU. This both enables the CPU to
be doing useful work while messages are being passed and decreases the
data passing latency.
Figure 3:
Descriptor processing
|
Like TCP/IP, VIA as a standard is specified as a set of hardware and
operating system independent APIs. There are currently
implementations for Windows NT and a variety of UNIXes, and for
several varieties of networking hardware. Although some
implementations work with some Ethernet cards, the systems described
in this thesis used VIA only with ServerNet, and stuck to TCP/IP over
Ethernet. The biggest benefits of the VIA protocol can be
seen
using hardware specifically designed to support that interface.
From the bottom up, VIA message passing transactions involve the ServerNet
NIC, hardware interface functions known as the VIA Primitive Library
(VIPL) or OS level drivers, and the application level function using
VIA communications. During a VIA transaction, at the beginning of
communication a process using MPI can call an MPI function, which
talks to the User Agent. Together, the aforementioned components are
called a VI Consumer. The User Agent is a software interface used to
provide a standard interface to the more proprietary OS and hardware
components which are collectively known as a VI Provider. This
consists of the kernel mode VI Kernel Agent, the VIs themselves, a
Completion Queue (CQ) for keeping track of which operations have been
completed, and the VI NIC. Continuing with the communication setup
sequence, the Kernel Agent sets up a virtual interface (VI) which the
application using MPI can then use fairly directly through the User
Agent. From that point, no more involvement is needed from the kernel
for data movement, just for signaling and removing virtual interfaces.
Figure 4:
Throughput vs. data block size using MPI on a two node cluster using
TCP/IP over 10baseT Ethernet.
|
The TCP/IP equivalent is opening a socket between computers. The
transmission control protocol (TCP) layer is responsible for providing
reliable, connection oriented transport layer. It uses a state
machine for coordinating handshaking for establishing or taking down
connections, and for packet passing and acknowledgment to control data
flow. For reliability, TCP checks for timeouts, dropped or duplicated
packets, packet ordering, and data error detection. All data sent
using TCP goes through each aspect of error checking as a software
operation. IP is a network layer protocol, used for routing data from
one machine to another. Its addressing is based on dotted decimal
notation (xxx.xxx.xxx.xxx) and port number. Routing tables tell IP
which network interface to send packets on based on their IP
addresses. TCP packets may be broken into smaller pieces and re-
assembled when they reach their destination. It is IP that allows
packets to leave one computer, traverse the open Internet, and find
their way to a particular other computer no matter where it is. The
routing, buffering, and fragmentation/De-fragmentation are all
software operations. The systems discussed in this thesis sent TCP/IP
traffic over an Ethernet LAN, a common scenario for using TCP/IP.
Ethernet has its own transmission control, collision recovery and
avoidance protocols, and error checking. It also has its own hardware
addresses, which must be translated back and forth from IP addresses
using software calls and passing address resolution protocol (ARP) and
reverse address resolution protocol (RARP) packets across the network.
All of these steps are overhead which is repeated, mostly in software,
for every packet sent over TCP/IP. Even after a socket is opened, an
OS's TCP/IP protocol stack performs processing, error checking and
buffering on all data passed.[Tan96]
A VI Kernel Agent, on the other hand, is only responsible for
connection management, such as setting up or taking down VIs, and
providing memory protection for Doorbells. A VI on one computer is
configured to communicate with a VI on another computer, and from that
point on the application software simply passes Descriptors through
the User Agent, which can be done with minimal software overhead.
Also, there is no need for a context
switch into kernel mode.
Figure 5:
Throughput vs. data block size using MPI on a two node cluster using
VIA over ServerNet.
|
When the application wants to send or receive data, it calls a User
Agent function which puts the appropriate descriptor on the VI's send
or receive queue, then rings the corresponding doorbell. The doorbell
is a signaling mechanism that tells the VI it has a new descriptor.
Each VI has two lists of descriptors: the Send Queue, and the Receive
Queue. And each queue has a corresponding doorbell. Only one process
can use a given VI, but a process can set up as many VIs as it likes,
barring hardware or memory constraints. Typically, a VI is created,
attached to another VI to do all communication with one process on
another computer, then removed when all communication with that
process is done.
A VI Descriptor is a data structure which describes data movement. It
can describe point to point communications, scatters, and gathers.
Memory blocks which have been allocated for sending or receiving have
their addresses passed to the VI NIC in the descriptor. It also
contains data about the current status of the data movement described.
The VI NIC processes VI Descriptors in a VI's send and receive queues,
then dequeues them. In addition to VIs, a VI Consumer may set up
Completion Queues so it can poll or wait on the completion of a
descriptor or group of descriptors. A Completion Queue can handle
completion information for multiple VIs. A process waiting on a
descriptor completion can do so without using interrupts or switching
to kernel mode.
In order to get a feeling for expected message passing latencies,
measurements of all system interconnections under consideration were
made by running programs which used MPI to pass data blocks of various
sizes from one process to another. The speed of data passing in these
tests should be similar to the speeds encountered in actual use by
other programs. Figure 4 and 5 show the
performance of ServerNet and TCP/IP respectively, as throughput vs. the size
of data blocks transferred.
The following table summarizes
the results in terms of total message passing latency:
Latency |
Data Block Size |
8192 Bytes |
16384 Bytes |
32768 Bytes |
TCP/IP |
1.07x10-7s |
1.95x10-7s |
3.68x10-7s |
ServerNet |
7.64x10-9s |
1.18x10-8s |
1.74x10-8s |
Shared Memory |
2.07x10-9s |
2.65x10-9s |
3.69x10-9s |
Next: Parallel Processing
Up: Background
Previous: MPI
Garth Brown
2001-01-26