Divide and Conquer

** Due:** Thursday, November 13

** Goals:**

- Introduce you to divide-and-conquer parallelism.
- Learn to write a parallel algorithm from scratch rather than
modifying a serial code.
- Create and free MPI communicators on the fly.

** Background:**

Impressed by your bugling elk simulations, the Lincoln County Ranchers
Association (LCRA) has funneled serious grant money into your
department. Your advisor sees the light and decides to change the
direction of your research. With a government subsidy, the LCRA has
set up infrared monitoring stations around the perimeter of the Gila
which detect whenever an elk leaves the safe haven of the Gila and
laser tattoos it with a unique ID number. Periodically these
monitoring computers must communicate and sort their data so that ID
numbers can be handed out to LCRA members for their weekly
``park-and-poach'' activity. The LCRA wants you to write a portable
MPI code for this purpose which they will own and can license to other
conservation-minded groups around the country. While skeptical that
this really qualifies as cutting-edge environmental research, as a
maturing graduate student, you recognize the need to satisfy paying
customers.

** Assignment:**

You will write a parallel program to sort a set of numerical values. To make life easier you may assume that no two values are identical. Initially, each processor owns an equal fraction of the values. When your program is finished, each processor will own a sorted subset of values: that is, processor 0 will own the lowest few values, processor 1 the next few, and so on. Each processor will also sort its owned subset so that in aggregate, the entire list is in sorted order. To perform the sort, you should implement the parallel divide-and-conquer variant of quicksort presented in class: divide the data into two subsets of low and high values on two subgroups of processors by finding a median, then recurse within each subgroup. To simplify your task, you can assume that the number of processors is a power-of-two.

Input to your program will consist of the number of values to sort (N), a random number seed (SEED) used to generate the list of values, and an index (M) of the value to output. Your code should run correctly for any N on any power-of-two processors P, including the case where P > N. The output of your code is the Mth value in the sorted list and a simple ``check-sum'' calculation on the sorted values which is more easily described in the code itself. For both of these outputs you will find the MPI_Scan function useful. Your parallel code should also include the following features: computation of the CPU time for the sort (not including initialization of the random values), creation and freeing of MPI communicators, use of asynchronous receives and waits for exchanging large sets of values between processors, and safeguards to insure that the exchanged data does not overflow allocated memory.

You should start with a sequential Fortran
or C program which performs this task.

** What you hand in:**

- A program listing of your code.
- Some evidence (you decide on the input/output) that your code
runs correctly even when P > N, including when P > 2N.
- A 2-d table of results for a series of runs, all with SEED =
123456 and M = N/5. The table columns should be for varying
N by powers of 10 and should have at least 4 columns
(100,1000,10000,100000), more if you wish. The table rows
should be for varying P and should have at least 4 rows
including P=1. For example, P = 1,2,4,8 on cochiti or P =
1,8,64,256 on the nCUBE. Each ``entry'' in the table should
contain the CPU time for the sort, the Mth sorted value, the
checksum result, the overall speed-up, and parallel efficiency.
For example, here is the table with just one row, running the
serial code on the Intel Paragon:
P=1 N = 100 N = 1000 N = 10000 N = 100000 CPU (secs) 3.44E-4 5.20E-3 8.25E-2 1.28 Mth 0.19539653 0.19475528 0.19816056 0.20036508 Checksum 32.757375 329.27865 3314.1820 33329.805 Speed-Up 1.0 1.0 1.0 1.0 Efficiency 1.0 1.0 1.0 1.0 Be sure to include at least 8 digits of precision in the Mth value and checksum entries so that we can verify that your code is correct. Comment on the performance trends you see in the table for one-processor, fixed-size, and scaled-size problems. In other words, tell us why you think the numbers show what they do. Note: there are several scaling issues here, so you need to give more thought to this than in past assignments.

- Brief (!) answers to the Open Questions below.

** Open Questions:**

- With this algorithm, at an intermediate stage the number of
values owned by a particular processor can be different than
N/P. Can you bound this number so you can safely pre-allocate
memory? How could this problem be avoided altogether?
- Would your program work if the data to be sorted contained
duplicate values? What about the degenerate case where all the
values are the same? What part(s) of your algorithm would need
to be carefully constructed to avoid problems with this kind of
data?
- How would you need to change your program to make it work when
P is not a power-of-two?
- (Optional) How can the laser-tattoo system distinguish elk from
hikers, Greenpeace activists and members of the LCRA?
- (Optional) What can the LCRA do to enhance its odds of winning the annual Friend-of-Bambi award from the Sierra Club?