Due: Tuesday, September 30th
(1) You are tired of having your programs write output to the screen
in random jumbled order. Write a few lines of pseudocode that are
executed by all processors, produce this style of output, and that
require no more than 2P sends/receives, where P is the number of
processors. It should work for any value of P.
Proc 0: value is 4
Proc 1: value is 2
Proc 2: value is 7
Proc 3: value is 15
...
If you were writing to a file instead of the screen, typically only
processor 0 would open the file and do the ``writes''. Write a
section of pseudocode that satisfies this new constraint. How many
sends/receives are now required?
(2) Consider a finite difference calculation performed on an NxN grid, similar
to the parallel game-of-life simulations we have
been discussing. Assume that updating the value for each grid cell
requires 8 floating-point operations (same 9-point stencil as the
GOL), and communication is needed at the start of each timestep to
update ghost-cell values. Analyze the two parallelization strategies
discussed in class for programming assignment 3: (1) a
one-dimensional decomposition of the grid over P processors, and
(2) a two-dimensional decomposition across Q*Q=P processors. For
each strategy, write down a formula for the total CPU time per
timestep in terms of communication startup (alpha), inverse
bandwidth per byte (beta) and time per flop (tau), as well as
N and P (or Q). You can assume that N is an exact multiple of P, and
that each communicated ghost-cell value requires eight bytes of
memory. For each strategy, derive formulas for the parallel speedup
and efficiency as a function of the same parameters.
For the 2-d strategy, assuming the startup cost is negligible, how
quickly must N grow as a function of P to keep the parallel
efficiency constant?
On Sandia's Intel Tflops machine, the following values are
approximately correct (in nanoseconds): alpha = 20000,
beta = 3, and tau = 10. For each of
the two strategies running on 1024 processors, how large must N be
to keep the cost of communication below 20% of the total run time?
(3) Consider the recursive halving and recursive doubling operations described in class. For P a power-of-two, write some pseudocode using these operations to perform a broadcast of a N-length vector with 2*log(P) message startups and only 2N total communication volume. Note: Although it requires more startups, for large N this is better than the N*log(P) communication volume required by the tree-structured (cascade) broadcast you programmed for the first assignment.