next up previous contents
Next: Quadtree Compression Up: Parallel Processing and Process Previous: Theory

Experimental Results

Because of the differences in data transmission speeds, very different results were produced for the implementation of merge sort tested by using TCP/IP over Ethernet compared to using VIA over ServerNet. Since network latency is so high over Ethernet, performance is drastically improved by avoiding network communications altogether and running all processes on the same node. The time required to distribute data is longer than the time saved by processing that data in parallel. This is an important example, since one might naively assume that greater parallelism always leads to better performance. The activities of the two processors are depicted in Figure 9. Although the theoretical model used above assumes a process waits until others are done to start executing, as may happen with VxWorks' fully preemptive absolute priority based scheduling, these tests were run under WindowsNT, which allocates slices of time to each process. In either case, the important factor for allocating process to physical resources is that a process will take longer to complete on a processor which has a heavier load allocated to it.

  
Figure 9: Graphical representation of the activity of each processor when using one node.
\includegraphics[scale=0.4]{SortGraphTCP.eps}

The highest degree of parallelism on two identical computers with two processors each is achieved when each node has an equal processing load, and can perform the maximum amount of that processing concurrently. For this merge-sort implementation, the optimal parallelism is achieved by running two of the four leaves on each node. This is shown in Figures 10 and 2.3.2, with nodes colored green running on one workstation, and those colored red running on the other.

  
Figure 10: Balanced binary tree for maximum parallelism. Largest communications delay is between levels one and two. This topology has the minimum communication latency for a load balanced tree.
\includegraphics[scale=0.4]{BalancedBinaryTree.eps}


  
Figure 11: A nearly equivalent balanced binary tree for maximum parallelism. The difference is that the largest communications delay is between levels two and three. Performance is not significantly different from that of the topology in fig. 10.
\includegraphics[scale=0.4]{BalancedBinaryTree2.eps}

Although the arrangement in Figure 10 may result in slightly more resources available to the processes in the second level than the arrangement in Figure 2.3.2, its performance is not noticeably better. There are several likely reasons for this. First of all, since there are two nodes in the second level and both workstations used have dual processors, they will execute in parallel whether they are running on different processors in the same machine or on different machines. There may be a slight processing advantage to having the second level processes on different machines, since the load is more balanced if each machine can devote one processor to to sorting, and there would be less competition for access to the memory bus. On the other hand, the overhead is greater for passing four messages than for passing two.
When using Ethernet, the other reason there is no significant difference in these two topologies is that execution time is dominated by time spent passing data over the network, and both arrangements pass the same amount of data over the network. With a 50,000 element array, execution requires 25,000 elements to be passed over the network each way. The results of test programs showed that over Ethernet, the time to pass 200,000 bytes (25,000 ints * 4 bytes/int * 2 directions) is about 0.20 seconds. This accounts for most of the 0.212 seconds total sorting time when distributing processing over Ethernet, and is much longer than the 0.049 seconds the sort takes when just running everything on one node. The predicted execution times for these arrangements were 0.166s and 0.042s. Experiments with running one or three leaf processes on one node, two processes on each node but in a configuration that requires twice as much data passing (Figure 12), or all leaves on a separate node from the root (Figure 13), to change the amount of data passed, showed that total processing time had a nearly linear dependence on total data passed.

  
Figure 12: This topology results in the maximum amount of data passing while maintaining a workload which is balanced between the two nodes. While inefficient, it is useful for demonstrating the influence of parallelism vs. the influence of communication latency.
\includegraphics[scale=0.4]{maxComBalancedTree.eps}

The unusual configuration shown in Figure 12 has the same degree of parallelism as those shown in the previous diagrams, but it needs to pass twice as much data. Consequently, execution time is very nearly doubled when using Ethernet as the system interconnect. Experiments were also run on the configuration shown in Figure 13.
  
Figure 13: This topology has the maximum amount of data passing possible for a three level binary tree distributed across two nodes. It also has practically no parallel computation. It is used as a worst case scenario.
\includegraphics[scale=0.4]{maxComTree.eps}

This inefficient arrangement was chosen simply because it maximizes data passing, while minimizing parallelism. The volume of data passed is twice that of the topology shown in Figure 12, as is execution time. Going from a maximum to a minimum parallelism had no noticeable effect when run on Ethernet. It can be concluded that message passing time dominates processing time, and is the main factor in total execution time for this setup.
Even on one node, though, performance is clearly improved by the combination of a multiprocess program and a dual processor machine. Run as a single process with a single thread, and thus only runnable on a single processor with no concurrency, this program averages 0.085 seconds per sort. Comparing single process to multiprocess, the improvement factor of 1.75 is close to the ideal of 2 for going from one processor to two.


  
Figure 14: Graphical representation of the activity of each processor for the fastest configuration using VIA.
\includegraphics[scale=0.4]{SortGraphVIA.eps}

The fastest performance using our cluster is attained by configuring one process per processor, for maximum parallelism, with VIA over ServerNet as the system interconnect. This was the same configuration that resulted from the task allocation algorithm discussed in the theory section, which predicted a sort time of 0.0262s. Over 3000 runs, the average sort time for this configuration is 0.032 seconds. The sequence of events during execution are depicted in Figure 14. At the start, root on processor 0 passes data to process 1, also on processor 0, and process 2 on processor 2. At point 1, process 1 has received the data and passes part to process 3 on processor 0 and part to process 4 on processor 1. Point 2 is when process 2 receives the data, and passes half to process 5 on processor 2 and half to process 6 on processor 3. Processes 5 and 6 receive data and start sorting at point 3. Processes 5 and 6 are still sorting at point 4, when processes 3 and 4 finish and send data to process 1, which finishes merging the data using processor 0 at point 5. Processor 1 is finished all work assigned to it, and 0 can't continue until data is returned from process 2 at point 8. Processes 5 and 6 finish processing at point 6. By point 7, process 2 has merged the data from processes 5 and 6, and data transmission begins to process 0. Once all data has returned to the root, it merges the data and finishes at point 9.

 
Figure 15: Comparison of total sort times over ServerNet, over Ethernet using the same process allocation used for ServerNet, and on a single workstation.
\rotatebox{270}{ \includegraphics[scale=0.5]{Sorts.eps}}

Compared to the single process runs, the improvement is a factor of 2.7, compared to the theoretical ideal scalability of 4 for going from one processor to four. With ideal scalability, processing on four processors will take one fourth as long as processing on one processor. Using the average time from the single process case, ideal scaling would result in a sort time of (single process time/4) = 0.085/4 = 0.02125 s. Most of the rest of the processing time can be accounted for by interprocess communication delays, as shown in the theory section. There are additional delays which were not included in the theoretical predictions for execution time, but the estimate was in the right ball-park, and predicted the best allocation of processes to processors.
 
Figure 16: Effects of physical process topology on total processing time.
\rotatebox{270}{ \includegraphics[scale=0.5]{TCPSort.eps}}

Speed improvement: 77% for one node vs. TCP/IP, 35% for ServerNet vs. one node.

next up previous contents
Next: Quadtree Compression Up: Parallel Processing and Process Previous: Theory
Garth Brown
2001-01-26