Next: Quadtree Compression
Up: Parallel Processing and Process
Previous: Theory
Because of the differences in data transmission speeds, very different
results were produced for the implementation of merge sort tested by
using TCP/IP over Ethernet compared to using VIA over ServerNet.
Since network latency is so high over Ethernet, performance is drastically
improved by avoiding network communications altogether and running all
processes on the same node. The time required to distribute data is
longer than the time saved by processing that data in parallel. This
is an important example, since one might naively assume that greater
parallelism always leads to better performance. The activities of the
two processors are depicted in Figure 9. Although the
theoretical model used above assumes a process waits until others are
done to start executing, as may happen with VxWorks' fully preemptive
absolute priority based scheduling, these tests were run under
WindowsNT, which allocates slices of time to each process. In either
case, the important factor for allocating process to physical
resources is that a process will take longer to complete on a
processor which has a heavier load allocated to it.
Figure 9:
Graphical representation of the activity of each processor
when using one node.
|
The highest degree of parallelism on two identical computers with two
processors each is achieved when each node has an equal processing
load, and can perform the maximum amount of that processing
concurrently. For this merge-sort implementation, the optimal
parallelism is achieved by running two of the four leaves on each
node. This is shown in Figures 10 and 2.3.2, with
nodes colored green running on one workstation, and those colored red
running on the other.
Figure 10:
Balanced binary tree for maximum parallelism. Largest
communications delay is between levels one and two. This topology
has the minimum communication latency for a load balanced tree.
|
Figure 11:
A nearly equivalent balanced binary tree for maximum
parallelism. The difference is that the largest communications
delay is between levels two and three. Performance is not
significantly different from that of the topology in fig.
10.
|
Although the arrangement in Figure 10 may result in
slightly more resources available to the processes in the second level
than the arrangement
in Figure 2.3.2, its performance is not noticeably
better. There are several likely reasons for
this. First of all, since there are two nodes in the second level and
both workstations used have dual processors, they will execute in
parallel whether they are running on different processors in the same
machine or on different machines. There may be a slight processing
advantage to having the second level processes on different machines,
since the load is more balanced if each machine can devote one
processor to to sorting, and there would be less competition for
access to the memory bus. On the other hand, the overhead is greater
for passing four messages than for passing two.
When using Ethernet,
the other reason there is no significant difference in these two
topologies is that execution time is dominated by time spent passing
data over the network, and both arrangements pass the same amount of
data over the network. With a 50,000 element array, execution
requires 25,000 elements to be passed over the network each way. The
results of test programs showed that over Ethernet, the time to pass
200,000 bytes (25,000 ints * 4 bytes/int * 2 directions) is about 0.20
seconds. This accounts for most of the 0.212 seconds total sorting
time when distributing processing over Ethernet, and is much longer
than the 0.049 seconds the sort takes when just running everything on
one node. The predicted execution times for these arrangements were
0.166s and 0.042s. Experiments with running one or three leaf processes on one
node, two processes on each node but in a configuration that requires
twice as much data passing (Figure 12), or all leaves
on a separate node from the root (Figure 13), to change
the amount of data passed, showed that total processing time had a
nearly linear dependence on total
data passed.
Figure 12:
This topology results in the maximum amount of data passing
while maintaining a workload which is balanced between the two
nodes. While inefficient, it is useful for demonstrating the
influence of parallelism vs. the influence of communication
latency.
|
The unusual configuration shown in Figure 12
has the same degree of parallelism as those shown in the previous
diagrams, but it needs to pass twice as much data. Consequently,
execution time is very nearly doubled when using Ethernet as the
system interconnect. Experiments were also run on the configuration
shown in Figure 13.
Figure 13:
This topology has the maximum amount of data passing
possible for a three level binary tree distributed across two nodes.
It also has practically no parallel computation. It is used as a
worst case scenario.
|
This inefficient arrangement was chosen simply because it maximizes
data passing, while minimizing parallelism. The volume of data passed
is twice that of the topology shown in Figure 12, as
is execution time. Going from a maximum to a minimum parallelism had
no noticeable effect when run on Ethernet. It can be concluded that
message passing time dominates processing time, and is the main factor
in total execution time for this setup.
Even on one node, though, performance is clearly improved by the
combination of a multiprocess program and a dual processor machine.
Run as a single process with a single thread, and thus only runnable on
a single processor with no concurrency, this program averages 0.085
seconds per sort.
Comparing single process to multiprocess, the improvement factor of
1.75 is close to the ideal of 2 for going from one processor to two.
Figure 14:
Graphical representation of the activity of each processor
for the fastest configuration using VIA.
|
The fastest performance using our cluster is attained by configuring
one process per processor, for maximum parallelism, with VIA over
ServerNet as the system interconnect. This was the same configuration
that resulted from the task allocation algorithm discussed in the
theory section, which predicted a sort time of 0.0262s. Over 3000
runs, the average sort time for this configuration is 0.032
seconds. The sequence of events during execution are depicted in
Figure 14. At the start, root on processor 0 passes data to
process 1, also on processor 0, and process 2 on processor 2. At
point 1, process 1 has received the data and passes part to process 3
on processor 0 and part to process 4 on processor 1. Point 2 is when process
2 receives the data, and passes half to process 5 on processor 2 and
half to process 6 on processor 3. Processes 5 and 6 receive data and start sorting at point 3. Processes 5 and 6 are still sorting
at point 4, when processes 3 and 4 finish and send data to process 1,
which finishes merging the data using processor 0 at point 5. Processor 1 is finished all
work assigned to it, and 0 can't continue until data is returned from
process 2 at point 8. Processes 5 and 6 finish processing at point 6. By point 7, process 2 has merged the data from processes 5 and 6, and data transmission begins to process 0. Once all data has returned to the root, it
merges the data and finishes at point 9.
Figure 15:
Comparison of total sort times over ServerNet, over Ethernet using the same process allocation used for ServerNet, and on a single workstation.
|
Compared to the single
process runs, the improvement is a factor of 2.7, compared
to the theoretical ideal scalability of 4 for going from one processor
to four. With ideal scalability,
processing on four processors will take one fourth as long as
processing on one processor. Using the average time from the single
process case, ideal scaling would result in a sort time of (single
process time/4) = 0.085/4 = 0.02125 s. Most of the rest of the
processing time can be accounted for by interprocess communication
delays, as shown in the theory section. There are additional delays
which were not included in the theoretical predictions for execution
time, but the estimate was in the right ball-park, and predicted the
best allocation of processes to processors.
Figure 16:
Effects of physical process topology on total processing time.
|
Speed improvement: 77% for one node vs. TCP/IP, 35% for ServerNet
vs. one node.
Next: Quadtree Compression
Up: Parallel Processing and Process
Previous: Theory
Garth Brown
2001-01-26