Parallel Algebraic Computation

Algebraic computation does not normally lend itself to parallel implementation. We will see, however, that tensor contractions are in fact approachable with parallel algorithms [Koehler2].

There are two general models for parallelization:

When dealing with parallel computation, one of the many issues of concern is scaling. Ideally, n processors would produce a factor of n speedup, but in practice, we usually achieve factors of between log 2 n and n / ln n due to internal waits for synchronization.

We will use the MIMD approach. Suppose we begin with a single thread algorithm in the following form:

make list of "atomic" operations

while ( there_is_more_to_be_done () )

do one atomic operation
where "there_is_more_to_be_done ()" points to the next atomic operation. We propose to replace this function with "parallelize ()":
if ( starting computation )
break up computation into parcels

start scheduler process

get first parcel to work on

else
point to next calculation in parcel

if ( done with parcel )

if ( done with computation )
send results

synchronize with other tasks

else
get next parcel from scheduler
Note that the atomic operations are identical whether the single thread or parallel algorithm is used. This approach lends itself to tensor contractions because they are a sum of products, where the components entering into the products are known in advance. We have implemented this approach in a program called PTAH: the Parallel Tensor Algebra Hybrid system. It is a hybrid of C++ programs which do the actual computations, and Mathematica functions which are used to prepare the input and analyze the results.


PTAH implementation

In PTAH, contractions are done in parallel. Work scheduling is done by partitioning contractions: we create lists of free indices and sum components for all nonzero product terms. These are generated on all processors:
ie., for Ra b c d = ga e Reb c d

free indicessummed indicesga e Reb c d
a b c determ
0 1 0 10g0 0 R01 0 1
0 1 0 13g0 3 R31 0 1
0 1 0 20g0 0 R01 0 2
0 1 0 23g0 3 R31 0 2
0 1 1 30g0 0 R01 1 3
0 1 1 33g0 3 R31 1 3
0 1 2 30g0 0 R01 2 3
0 1 2 33g0 3 R31 2 3
0 2 0 20g0 0 R02 0 2
0 2 0 23g0 3 R32 0 2
0 2 1 30g0 0 R02 1 3
0 2 1 33g0 3 R32 1 3
0 2 2 30g0 0 R02 2 3
0 2 2 33g0 3 R32 2 3
0 3 0 30g0 0 R03 0 3
0 3 0 33g0 3 R33 0 3
0 3 1 20g0 0 R03 1 2
0 3 1 23g0 3 R33 1 2
1 2 1 21g1 1 R12 1 2
1 3 1 31g1 1 R13 1 3
1 3 2 31g1 1 R13 2 3
2 3 2 32g2 2 R23 2 3

These index lists must be in the same order on all processors. For two processors, the first would do the first 11 products, and the second would do the last 11. In general, we split the list into m * n parcels for n processors, where m is large enough to smooth out the differences in individual product complexity (m is typically of the order of 100).

The resultant components (or terms, in the case of scalar invariants) can be sent to all processors so that intermediate results are known to all, or results can be stored on each processor and collected at the end of processing. As with the Mathematica examples discussed previously, (anti)symmetric pairs of indices can be used to reduce the workload.

Synchronization in PTAH is by barrier: all processors wait for all results in a given computation and perform simplifications themselves.

Another important consideration in parallel processing is fault tolerance: processors may fail, so work must be reassignable. If results are saved after each parcel, scheduling or catastrophic failure can be recovered and continued; results are then collected at the end, but final simplification must be delayed until then.

The next appendix is about algebraic computation without Mathematica.



©2005, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.

Please send comments or suggestions to the author.