# Parallel Algebraic Computation

Algebraic computation does not normally lend itself to parallel implementation. We will see, however, that tensor contractions are in fact approachable with parallel algorithms [Koehler2].
When dealing with parallel computation, one of the many issues of concern is scaling. Ideally, n processors would produce a factor of n speedup, but in practice, we usually achieve factors of between log 2 n and n / ln n due to internal waits for synchronization.
make list of "atomic" operations

while ( there_is_more_to_be_done () )

do one atomic operation
where "there_is_more_to_be_done ()" points to the next atomic operation. We propose to replace this function with "parallelize ()":
if ( starting computation )
break up computation into parcels

start scheduler process

get first parcel to work on

else
point to next calculation in parcel

if ( done with parcel )

if ( done with computation )
send results

synchronize with other tasks

else
get next parcel from scheduler
Note that the atomic operations are identical whether the single thread or parallel algorithm is used. This approach lends itself to tensor contractions because they are a sum of products, where the components entering into the products are known in advance. We have implemented this approach in a program called PTAH: the Parallel Tensor Algebra Hybrid system. It is a hybrid of C++ programs which do the actual computations, and Mathematica functions which are used to prepare the input and analyze the results.

### PTAH implementation

In PTAH, contractions are done in parallel. Work scheduling is done by partitioning contractions: we create lists of free indices and sum components for all nonzero product terms. These are generated on all processors:
ie., for Ra b c d = ga e Reb c d

free indicessummed indicesga e Reb c d
a b c determ
0 1 0 10g0 0 R01 0 1
0 1 0 13g0 3 R31 0 1
0 1 0 20g0 0 R01 0 2
0 1 0 23g0 3 R31 0 2
0 1 1 30g0 0 R01 1 3
0 1 1 33g0 3 R31 1 3
0 1 2 30g0 0 R01 2 3
0 1 2 33g0 3 R31 2 3
0 2 0 20g0 0 R02 0 2
0 2 0 23g0 3 R32 0 2
0 2 1 30g0 0 R02 1 3
0 2 1 33g0 3 R32 1 3
0 2 2 30g0 0 R02 2 3
0 2 2 33g0 3 R32 2 3
0 3 0 30g0 0 R03 0 3
0 3 0 33g0 3 R33 0 3
0 3 1 20g0 0 R03 1 2
0 3 1 23g0 3 R33 1 2
1 2 1 21g1 1 R12 1 2
1 3 1 31g1 1 R13 1 3
1 3 2 31g1 1 R13 2 3
2 3 2 32g2 2 R23 2 3

These index lists must be in the same order on all processors. For two processors, the first would do the first 11 products, and the second would do the last 11. In general, we split the list into m * n parcels for n processors, where m is large enough to smooth out the differences in individual product complexity (m is typically of the order of 100).

The resultant components (or terms, in the case of scalar invariants) can be sent to all processors so that intermediate results are known to all, or results can be stored on each processor and collected at the end of processing. As with the Mathematica examples discussed previously, (anti)symmetric pairs of indices can be used to reduce the workload.

The next appendix is about algebraic computation without Mathematica.

©2005, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.

Please send comments or suggestions to the author.