next up previous contents
Next: Contents

BOSTON UNIVERSITY
COLLEGE OF ENGINEERING
Thesis
OPTIMIZATION OF DISTRIBUTED APPLICATIONS IN A CLUSTER COMPUTING ENVIRONMENT
by
GARTH WEAVER BROWN


B.A., Haverford College, 1993
Submitted in partial fulfillment of the
requirements for the degree of
Master of Science
2001

Approved by











First Reader  
Dimiter Avresky, Ph.D.
Advisor
Professor of Electrical and Computer Engineering
Northeastern University


Second Reader
Jeff Carruthers, Ph.D.
Professor of Electrical and Computer Engineering
Boston University
ACKNOWLEDGMENTS








Many thanks to Prof. Dimiter Avresky for providing support and advice in completing this research and for helping me through the Masters' program at Boston University. I would especially like to thank my wonderful wife, Joni, for providing much needed support through my degree despite the delays, long hours, and late nights.
OPTIMIZATION OF DISTRIBUTED APPLICATIONS IN A CLUSTER COMPUTING ENVIRONMENT
GARTH WEAVER BROWN

Abstract:

This thesis will examine performance issues in two implementations of parallel programming algorithms. The aspect examined in most depth is the optimal mapping of a virtual topology to physical resources at the level of processes using MPI. Theoretical predictions will be compared to performance tests of these algorithms for several configurations of a two workstation system which can be configured to use ServerNet or Ethernet (10BaseT) as a system interconnection. Both algorithms take a divide-and-conquer approach to processing large amounts of data, by breaking up an initial large of array of data and distributing it across physical processing nodes. The algorithms use a tree of logical tasks which are the virtual subdivisions of the overall task. Processing one task at a given level of the tree can take place independently of the other tasks in the same level. This makes divide and conquer algorithms particularly appropriate for distributed processing on a cluster. Logical tasks are mapped to processes, or functions within processes. Determining the best way to assign those processes to processors is the key to making best use of a system. The results show the importance of using a high speed interconnection network such as a ServerNet gigabit network.
Knowing the processing and data passing patterns of a program, for a given hardware configuration it is possible to predict which mappings of logical processes to physical processors will perform best [DB99]. It is important to take into consideration both processor utilization and data passing overhead. While assigning one of the four identical processes at the bottom of a tree to each processor of a four processor cluster may constitute a balanced load and maximum parallelism, it does not necessarily result in best performance. At the same time, it is important to realize that running two processes on a processor may take much longer than the sum of their individual execution times, so the benefits of assigning a process to a processor across even a slow network may be greater than expected.
The first algorithm examined is merge-sort, which involves dividing an array in two and distributing the sub-arrays to the nodes of a binary tree topology. The issues considered for the performance of this algorithm include coding techniques, compiler options, process distribution, and ServerNet vs. Ethernet as system interconnects. Setting the compiler for a release build for the right processor is shown to reduce execution time by 66%. Predictions from theory for load balancing accurately calculated the optimal configuration and sort times. The best performance for sorting a 50,000 element array, 0.032s, was achieved by using ServerNet between the workstations and assigning half of the processes below the root process in the virtual process toplogy to each workstation, and assigning their children to the same workstation as the parents. Using the same allocation of processes with only Ethernet between the workstations was much slower, taking 0.21s for the sort. Rather than distributing work over Ethernet, performance was much better keeping all processes on one computer, which performed the sort in 0.049s.
The second algorithm examined is quadtree compression. Predictions from theory were less accurate here, since it turned out that assigning two processes to a processor resulted in calculations taking four times as long, rather than the expected twice as long. Again the fastest average time, 0.22s, was achieved using ServerNet between the workstations and assigning two leaf processes of the virtual process topology to each workstation. The same allocation of processes took 0.50s over Ethernet. Contrary to calculations which predicted processing time on a single workstation to be slightly less than than the aforementioned arrangement over Ethernet, processing took 0.96s.
Optimizing performance on a distributed system is an enormously complex problem. It is important to have a good theoretical basis to get accurate results in determining the best mapping of the virtual process toplogy to hardware, while making enough simplifying assumptions that the problem is solvable. At the same time, it is important to test the theory, providing either verification or situations which may lead to improved models in the future. MPI makes it easy to compare different mappings of a virtual process topology to the nodes in a physical network toploby. The most important results of the research discussed in this thesis show the value of a fast network with low processing overhead, such as ServerNet, to take advantage of distributed processing.



 
next up previous contents
Next: Contents
Garth Brown
2001-01-26