Parallel Algebraic Computation

Algebraic computation does not normally lend itself to parallel implementation. We will see, however, that tensor contractions are in fact approachable with parallel algorithms [Koehler2].

There are two general models for parallelization:

SIMD: Single Instruction fetch + Multiple Data fetch -> multiple results

SIMD is characterized by topology: how data flows through the processors. Typically, it is a hardware implementation in array processors, where the synchronization is controlled by hardware. It is useful for lattice computations, linear algebra, FFT, convolutions, image processing, pattern recognition, etc.

and

MIMD: Multiple Instruction fetch + Multiple Data fetch -> multiple results

MIMD lends itself to more general hardware approaches, such as symmetric multiprocessors (SMP) and clusters (with network connections). Work scheduling, synchronization and deadlock detection and prevention must be programmed for each application. MIMD is useful for iterative systems, systems of PDEs, finding the zeroes of functions, cryptography, etc.

When dealing with parallel computation, one of the many issues of concern is scaling. Ideally, n processors would produce a factor of n speedup, but in practice, we usually achieve factors of between log₂ n and n / ln n due to internal waits for synchronization.

We will use the MIMD approach. Suppose we begin with a single thread algorithm in the following form:

make list of "atomic" operations
while ( there_is_more_to_be_done () )
do one atomic operation

where "there_is_more_to_be_done ()" points to the next atomic operation. We propose to replace this function with "parallelize ()":

if ( starting computation )
break up computation into parcels
start scheduler process
get first parcel to work on
else
point to next calculation in parcel
if ( done with parcel )
if ( done with computation )
send results
synchronize with other tasks
else
get next parcel from scheduler

Note that the atomic operations are identical whether the single thread or parallel algorithm is used. This approach lends itself to tensor contractions because they are a sum of products, where the components entering into the products are known in advance. We have implemented this approach in a program called PTAH: the Parallel Tensor Algebra Hybrid system. It is a hybrid of C++ programs which do the actual computations, and Mathematica functions which are used to prepare the input and analyze the results.

PTAH implementation

In PTAH, contractions are done in parallel. Work scheduling is done by partitioning contractions: we create lists of free indices and sum components for all nonzero product terms. These are generated on all processors:

ie., for R_{a b c d} = g_{a e} R^e_{b c d}

free indices summed indices g_{a e} R^e_{b c d}

a b c d e term

0 1 0 1 0 g_{0 0} R⁰_{1 0 1}

0 1 0 1 3 g_{0 3} R³_{1 0 1}

0 1 0 2 0 g_{0 0} R⁰_{1 0 2}

0 1 0 2 3 g_{0 3} R³_{1 0 2}

0 1 1 3 0 g_{0 0} R⁰_{1 1 3}

0 1 1 3 3 g_{0 3} R³_{1 1 3}

0 1 2 3 0 g_{0 0} R⁰_{1 2 3}

0 1 2 3 3 g_{0 3} R³_{1 2 3}

0 2 0 2 0 g_{0 0} R⁰_{2 0 2}

0 2 0 2 3 g_{0 3} R³_{2 0 2}

0 2 1 3 0 g_{0 0} R⁰_{2 1 3}

0 2 1 3 3 g_{0 3} R³_{2 1 3}

0 2 2 3 0 g_{0 0} R⁰_{2 2 3}

0 2 2 3 3 g_{0 3} R³_{2 2 3}

0 3 0 3 0 g_{0 0} R⁰_{3 0 3}

0 3 0 3 3 g_{0 3} R³_{3 0 3}

0 3 1 2 0 g_{0 0} R⁰_{3 1 2}

0 3 1 2 3 g_{0 3} R³_{3 1 2}

1 2 1 2 1 g_{1 1} R¹_{2 1 2}

1 3 1 3 1 g_{1 1} R¹_{3 1 3}

1 3 2 3 1 g_{1 1} R¹_{3 2 3}

2 3 2 3 2 g_{2 2} R²_{3 2 3}

free indices	summed indices	g_{a e} R^e_{b c d}
a b c d	e	term
0 1 0 1	0	g_{0 0} R⁰_{1 0 1}
0 1 0 1	3	g_{0 3} R³_{1 0 1}
0 1 0 2	0	g_{0 0} R⁰_{1 0 2}
0 1 0 2	3	g_{0 3} R³_{1 0 2}
0 1 1 3	0	g_{0 0} R⁰_{1 1 3}
0 1 1 3	3	g_{0 3} R³_{1 1 3}
0 1 2 3	0	g_{0 0} R⁰_{1 2 3}
0 1 2 3	3	g_{0 3} R³_{1 2 3}
0 2 0 2	0	g_{0 0} R⁰_{2 0 2}
0 2 0 2	3	g_{0 3} R³_{2 0 2}
0 2 1 3	0	g_{0 0} R⁰_{2 1 3}
0 2 1 3	3	g_{0 3} R³_{2 1 3}
0 2 2 3	0	g_{0 0} R⁰_{2 2 3}
0 2 2 3	3	g_{0 3} R³_{2 2 3}
0 3 0 3	0	g_{0 0} R⁰_{3 0 3}
0 3 0 3	3	g_{0 3} R³_{3 0 3}
0 3 1 2	0	g_{0 0} R⁰_{3 1 2}
0 3 1 2	3	g_{0 3} R³_{3 1 2}
1 2 1 2	1	g_{1 1} R¹_{2 1 2}
1 3 1 3	1	g_{1 1} R¹_{3 1 3}
1 3 2 3	1	g_{1 1} R¹_{3 2 3}
2 3 2 3	2	g_{2 2} R²_{3 2 3}

These index lists must be in the same order on all processors. For two processors, the first would do the first 11 products, and the second would do the last 11. In general, we split the list into m * n parcels for n processors, where m is large enough to smooth out the differences in individual product complexity (m is typically of the order of 100).

The resultant components (or terms, in the case of scalar invariants) can be sent to all processors so that intermediate results are known to all, or results can be stored on each processor and collected at the end of processing. As with the Mathematica examples discussed previously, (anti)symmetric pairs of indices can be used to reduce the workload.

Synchronization in PTAH is by barrier: all processors wait for all results in a given computation and perform simplifications themselves.

Another important consideration in parallel processing is fault tolerance: processors may fail, so work must be reassignable. If results are saved after each parcel, scheduling or catastrophic failure can be recovered and continued; results are then collected at the end, but final simplification must be delayed until then.

The next appendix is about algebraic computation without Mathematica.

Table of Contents

Index:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Please send comments or suggestions to the author.