Next: Bibliography
Up: Parallel Processing and Process
Previous: Theory
The virtual process topology used in the implementation of this
algorithm has one process as a root, and four processes on the next
level down each communicating only with the root. Lower parts of the
tree were formed by simple function calls. For our two-workstation,
four-processor system, the fastest configuration was to run the root
and two other processes on one workstation, and two processes on the
other. Contrary to the theoretical predictions, allocating more than
one process to a processor slowed down execution time to far more than
the sum of the individual execution times. Allocating all processes
to one workstation was expected to take about 0.5 times as long as the
ServerNet allocation, but instead took more than 4 times as long. A
possible explanation is that data caching and read-ahead are strongly
affected by the quadtree algorithm's data access patterns. While
merge-sort is accessing data linearly in an array, quadtree is
accessing data which is more widely scattered. That could lead to a
more frequent need to access the memory bus, and in a dual processor
system contention for the memory bus could greatly slow execution time.
Figure 19:
Performance comparison of different system interconnects.
|
This was the best configuration when using either Ethernet or
ServerNet, and represents the maximum degree of parallelism for this
cluster. In this configuration, while the root process is blocked,
waiting for the processes below it, there is one MPI process per
processor. This physical distribution of processes allows the minimum
necessary amount of data passing over the network. Tests were also
done with all processes on one workstation. The message passing speed
would be very fast, of course, since messages would be passed in
shared memory. But since there are only two processors per
workstation, only two processes can run concurrently. Since the
overhead of distributing processing across the network is lower than
the time to complete two extra processes, even using Ethernet, passing
data to two processes on the other workstation results in faster
overall times.
Figure 20:
Processes in white are running on workstation 0, those in green on
workstation 1.
|
A multi-threaded approach was tried, hoping that having each node on
the lower levels of the tree be a thread would allow more efficient
processor scheduling. However, it appears that the overhead of
spawning even a few threads outweighed any benefits. Tests with the
multi-threaded version of this program took more than three seconds to
compress a 256x256 element array using ServerNet, compared to about a
fifth of a second using simple
function calls instead.
Figure 21:
Performance comparison of topologies in diagram above.
|
One important result to notice for implementing this algorithm with
distributed processing is that total execution time is shorter for
more compressible data. To assess the dependence of processing time on
data compressibility, a quadtree compression program was tested with
several combinations of process distribution and system interconnect.
Around 1000 arrays of varying compressibility
were generated for each configuration.
Figure 22:
Performance comparison of topologies in diagram above.
|
In terms of software performance, Figure 19 shows the
results for individual runs compressing 256x256 element arrays.
Similar results were obtained with 128x128 and 64x64 arrays. The
y-axis represents total compression time, and the x-axis represents
the compressibility of the array, expressed as the ratio of the
compressed size to the original size (e.g. the above array has a
compression ratio of.24, and compression from 65536 elements to 32768
elements would have a compression ratio of.5). The cluster of points
near the top of the graph, labeled One Node, is from the slower runs
with all processes running on the same workstation. The cluster of
points near the middle of the graph is from runs with processes
equally distributed between the two nodes, and using Ethernet as the
system interconnect. And the cluster of fastest runs near the bottom
is from runs with processes equally distributed
between the two nodes, and using ServerNet as the system interconnect.
Figure 23:
Performance comparison of topologies in diagram above.
|
All three setups show a tendency to perform the compression faster
when the data is more highly compressible. This is likely to be
because when a sub-array is compressed, there is less data to send
back to higher levels of the tree, and thus less data to be passed
over the network to the root node. This would explain the observation
that when the program is run over a slower network, e.g. Ethernet,
there is an especially noticeable decrease in processing time for more
compressible data. Another factor may be that packing compressible and
uncompressible data into a cohesive array is a much slower operation
than simply taking four elements and creating one element. Unlike the
sorting algorithm, for quadtree compression processing time dominates
data passing time even using Ethernet, so distributing processing does
improve performance compared to execution on a single node. Figures
20 through 24 show several process
topologies that were tried, and the performance achieved by each.
Processes are labeled 0 through 4, and the notation below each figure
refers to which workstation each process is running on, for example
01000 means process 1 is on workstation 1, and processes 0, 2, 3, and
4 are on
workstation 0.
As expected, execution times over ServerNet are the fastest. Since
processing time plays a much more dominant role than data passing time
for ServerNet, there is not a noticeable difference in total execution
time for compressible and uncompressible data. Comparing the very
close processing times for both the most compressible data and
uncompressible data, computation on a single node takes about four
times as long as computation distributed over ServerNet. Compared to
Ethernet, on the other hand, computation takes about 2.5 times as long
on a single node for very compressible data, and about 1.8 times as
long for uncompressible data. We can see that the performance
improvement of distributing parallel processing depends strongly on
the ratio of the overhead of distributing processing to the time
required to perform that processing.
Figure 24:
Performance comparison of topologies in diagram above.
|
Next: Bibliography
Up: Parallel Processing and Process
Previous: Theory
Garth Brown
2001-01-26