next up previous contents
Next: Bibliography Up: Parallel Processing and Process Previous: Theory

Experimental Results

The virtual process topology used in the implementation of this algorithm has one process as a root, and four processes on the next level down each communicating only with the root. Lower parts of the tree were formed by simple function calls. For our two-workstation, four-processor system, the fastest configuration was to run the root and two other processes on one workstation, and two processes on the other. Contrary to the theoretical predictions, allocating more than one process to a processor slowed down execution time to far more than the sum of the individual execution times. Allocating all processes to one workstation was expected to take about 0.5 times as long as the ServerNet allocation, but instead took more than 4 times as long. A possible explanation is that data caching and read-ahead are strongly affected by the quadtree algorithm's data access patterns. While merge-sort is accessing data linearly in an array, quadtree is accessing data which is more widely scattered. That could lead to a more frequent need to access the memory bus, and in a dual processor system contention for the memory bus could greatly slow execution time.

  
Figure 19: Performance comparison of different system interconnects.
\rotatebox{270}{
\includegraphics[scale=0.5]{256byInterconnectQuad.eps}}

This was the best configuration when using either Ethernet or ServerNet, and represents the maximum degree of parallelism for this cluster. In this configuration, while the root process is blocked, waiting for the processes below it, there is one MPI process per processor. This physical distribution of processes allows the minimum necessary amount of data passing over the network. Tests were also done with all processes on one workstation. The message passing speed would be very fast, of course, since messages would be passed in shared memory. But since there are only two processors per workstation, only two processes can run concurrently. Since the overhead of distributing processing across the network is lower than the time to complete two extra processes, even using Ethernet, passing data to two processes on the other workstation results in faster overall times.

  
Figure 20: Processes in white are running on workstation 0, those in green on workstation 1.
\includegraphics[scale=0.5]{QuadTopo.eps}

A multi-threaded approach was tried, hoping that having each node on the lower levels of the tree be a thread would allow more efficient processor scheduling. However, it appears that the overhead of spawning even a few threads outweighed any benefits. Tests with the multi-threaded version of this program took more than three seconds to compress a 256x256 element array using ServerNet, compared to about a fifth of a second using simple function calls instead.

  
Figure 21: Performance comparison of topologies in diagram above.
\rotatebox{270}{ \includegraphics[scale=0.5]{256OneNodeMangled.eps}}

One important result to notice for implementing this algorithm with distributed processing is that total execution time is shorter for more compressible data. To assess the dependence of processing time on data compressibility, a quadtree compression program was tested with several combinations of process distribution and system interconnect. Around 1000 arrays of varying compressibility were generated for each configuration.

  
Figure 22: Performance comparison of topologies in diagram above.
\rotatebox{270}{
\includegraphics[scale=0.5]{256byTopoMangled00111.eps}}

In terms of software performance, Figure 19 shows the results for individual runs compressing 256x256 element arrays. Similar results were obtained with 128x128 and 64x64 arrays. The y-axis represents total compression time, and the x-axis represents the compressibility of the array, expressed as the ratio of the compressed size to the original size (e.g. the above array has a compression ratio of.24, and compression from 65536 elements to 32768 elements would have a compression ratio of.5). The cluster of points near the top of the graph, labeled One Node, is from the slower runs with all processes running on the same workstation. The cluster of points near the middle of the graph is from runs with processes equally distributed between the two nodes, and using Ethernet as the system interconnect. And the cluster of fastest runs near the bottom is from runs with processes equally distributed between the two nodes, and using ServerNet as the system interconnect.

  
Figure 23: Performance comparison of topologies in diagram above.
\rotatebox{270}{
\includegraphics[scale=0.5]{256byTopoMangled01000.eps}}

All three setups show a tendency to perform the compression faster when the data is more highly compressible. This is likely to be because when a sub-array is compressed, there is less data to send back to higher levels of the tree, and thus less data to be passed over the network to the root node. This would explain the observation that when the program is run over a slower network, e.g. Ethernet, there is an especially noticeable decrease in processing time for more compressible data. Another factor may be that packing compressible and uncompressible data into a cohesive array is a much slower operation than simply taking four elements and creating one element. Unlike the sorting algorithm, for quadtree compression processing time dominates data passing time even using Ethernet, so distributing processing does improve performance compared to execution on a single node. Figures 20 through 24 show several process topologies that were tried, and the performance achieved by each. Processes are labeled 0 through 4, and the notation below each figure refers to which workstation each process is running on, for example 01000 means process 1 is on workstation 1, and processes 0, 2, 3, and 4 are on workstation 0.
As expected, execution times over ServerNet are the fastest. Since processing time plays a much more dominant role than data passing time for ServerNet, there is not a noticeable difference in total execution time for compressible and uncompressible data. Comparing the very close processing times for both the most compressible data and uncompressible data, computation on a single node takes about four times as long as computation distributed over ServerNet. Compared to Ethernet, on the other hand, computation takes about 2.5 times as long on a single node for very compressible data, and about 1.8 times as long for uncompressible data. We can see that the performance improvement of distributing parallel processing depends strongly on the ratio of the overhead of distributing processing to the time required to perform that processing.

  
Figure 24: Performance comparison of topologies in diagram above.
\rotatebox{270}{
\includegraphics[scale=0.5]{256byTopoMangled00011.eps}}


next up previous contents
Next: Bibliography Up: Parallel Processing and Process Previous: Theory
Garth Brown
2001-01-26