There are two general models for parallelization:
SIMD is characterized by topology: how data flows through the processors. Typically, it is a hardware implementation in array processors, where the synchronization is controlled by hardware. It is useful for lattice computations, linear algebra, FFT, convolutions, image processing, pattern recognition, etc.
and
MIMD lends itself to more general hardware approaches, such as symmetric multiprocessors (SMP) and clusters (with network connections). Work scheduling, synchronization and deadlock detection and prevention must be programmed for each application. MIMD is useful for iterative systems, systems of PDEs, finding the zeroes of functions, cryptography, etc.
We will use the MIMD approach. Suppose we begin with a single thread algorithm in the following form:
make list of "atomic" operationswhere "there_is_more_to_be_done ()" points to the next atomic operation. We propose to replace this function with "parallelize ()":while ( there_is_more_to_be_done () )
do one atomic operation
if ( starting computation )Note that the atomic operations are identical whether the single thread or parallel algorithm is used. This approach lends itself to tensor contractions because they are a sum of products, where the components entering into the products are known in advance. We have implemented this approach in a program called PTAH: the Parallel Tensor Algebra Hybrid system. It is a hybrid of C++ programs which do the actual computations, and Mathematica functions which are used to prepare the input and analyze the results.break up computation into parcelselsestart scheduler process
get first parcel to work on
point to next calculation in parcelif ( done with parcel )
if ( done with computation )send resultselsesynchronize with other tasks
get next parcel from scheduler
ie., for R_{a b c d} = g_{a e} R^{e}_{b c d}These index lists must be in the same order on all processors. For two processors, the first would do the first 11 products, and the second would do the last 11. In general, we split the list into m * n parcels for n processors, where m is large enough to smooth out the differences in individual product complexity (m is typically of the order of 100).
free indices summed indices g_{a e} R^{e}_{b c d} a b c d e term 0 1 0 1 0 g_{0 0} R^{0}_{1 0 1} 0 1 0 1 3 g_{0 3} R^{3}_{1 0 1} 0 1 0 2 0 g_{0 0} R^{0}_{1 0 2} 0 1 0 2 3 g_{0 3} R^{3}_{1 0 2} 0 1 1 3 0 g_{0 0} R^{0}_{1 1 3} 0 1 1 3 3 g_{0 3} R^{3}_{1 1 3} 0 1 2 3 0 g_{0 0} R^{0}_{1 2 3} 0 1 2 3 3 g_{0 3} R^{3}_{1 2 3} 0 2 0 2 0 g_{0 0} R^{0}_{2 0 2} 0 2 0 2 3 g_{0 3} R^{3}_{2 0 2} 0 2 1 3 0 g_{0 0} R^{0}_{2 1 3} 0 2 1 3 3 g_{0 3} R^{3}_{2 1 3} 0 2 2 3 0 g_{0 0} R^{0}_{2 2 3} 0 2 2 3 3 g_{0 3} R^{3}_{2 2 3} 0 3 0 3 0 g_{0 0} R^{0}_{3 0 3} 0 3 0 3 3 g_{0 3} R^{3}_{3 0 3} 0 3 1 2 0 g_{0 0} R^{0}_{3 1 2} 0 3 1 2 3 g_{0 3} R^{3}_{3 1 2} 1 2 1 2 1 g_{1 1} R^{1}_{2 1 2} 1 3 1 3 1 g_{1 1} R^{1}_{3 1 3} 1 3 2 3 1 g_{1 1} R^{1}_{3 2 3} 2 3 2 3 2 g_{2 2} R^{2}_{3 2 3}
The resultant components (or terms, in the case of scalar invariants) can be sent to all processors so that intermediate results are known to all, or results can be stored on each processor and collected at the end of processing. As with the Mathematica examples discussed previously, (anti)symmetric pairs of indices can be used to reduce the workload.
Synchronization in PTAH is by barrier: all processors wait for all results in a given computation and perform simplifications themselves.
Another important consideration in parallel processing is fault tolerance: processors may fail, so work must be reassignable. If results are saved after each parcel, scheduling or catastrophic failure can be recovered and continued; results are then collected at the end, but final simplification must be delayed until then.
The next appendix is about algebraic computation without Mathematica.
©2005, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.
Please send comments or suggestions to the author.