This thesis will examine performance issues in two implementations of
parallel programming algorithms. The aspect examined in most depth is the
optimal mapping of a virtual topology to physical resources at the level
of processes using MPI. Theoretical predictions will be
compared to performance tests of these algorithms for several
configurations of a two workstation system which can be configured to use ServerNet or Ethernet (10BaseT) as a system interconnection. Both
algorithms take a divide-and-conquer approach to processing large
amounts of data, by breaking up an initial large of array of data
and distributing it across physical processing nodes.
The algorithms use a tree of logical
tasks which are the virtual subdivisions of the overall task. Processing one task at a given level of the tree can take place independently of the other tasks in the same level. This makes divide and conquer algorithms particularly
appropriate for
distributed processing on a cluster. Logical tasks are mapped to
processes, or functions within processes. Determining the best way to assign those processes to processors is the key to making best use of a system. The results show the importance of using a high speed interconnection network such as a ServerNet gigabit network.
Knowing the processing
and data passing patterns of a program, for a given hardware
configuration it is possible to predict which mappings of logical
processes to physical processors will perform best [
DB99]. It is important to take
into consideration both processor utilization and data passing overhead.
While assigning one of the four identical processes at the bottom of
a tree to each processor of a four processor cluster may constitute
a balanced load and maximum parallelism, it does not necessarily
result in best performance. At the same time, it is important to realize that running two processes on a processor may take much longer than the sum of their individual execution times, so the benefits of assigning a process to a processor across even a slow network may be greater than expected.
The first algorithm examined is merge-sort, which involves dividing
an array in two and distributing the sub-arrays to the nodes of a
binary tree topology. The issues considered for
the performance of this algorithm include coding techniques,
compiler options, process distribution, and ServerNet vs. Ethernet as system
interconnects. Setting the compiler for a release build for the right processor is shown to reduce execution time by 66%. Predictions from theory for load balancing accurately calculated the optimal configuration and sort times.
The best performance for sorting a 50,000 element array, 0.032s, was achieved by using ServerNet between the workstations and assigning half of the processes below the root process in the virtual process toplogy to each workstation, and assigning their children to the same workstation as the parents. Using the same allocation of processes with only Ethernet between the workstations was much slower, taking 0.21s for the sort. Rather than distributing work over Ethernet, performance was much better keeping all processes on one computer, which performed the sort in 0.049s.
The second algorithm examined is quadtree compression. Predictions from
theory were less accurate here, since it turned out that assigning two
processes to a processor resulted in calculations taking four times as long,
rather than the expected twice as long. Again the fastest average time, 0.22s,
was achieved using ServerNet between the workstations and assigning two leaf
processes of the virtual process topology to each workstation. The same allocation of processes took 0.50s over Ethernet. Contrary to calculations which predicted processing time on a single workstation to be slightly less than than the aforementioned arrangement over Ethernet, processing took 0.96s.
Optimizing performance on a distributed system is an enormously complex
problem. It is important to have a good theoretical basis to get accurate results in determining the best mapping of the virtual process toplogy to hardware, while making enough simplifying assumptions that the problem is solvable. At the same time, it is important to test the theory, providing either verification or situations which may lead to improved models in the future. MPI makes it easy to compare different mappings of a virtual process topology to the nodes in a physical network toploby. The most important results of the research discussed in this thesis show the value of a fast network with low processing overhead, such as ServerNet, to take advantage of distributed processing.