This section describes the MapReduce programs provided in the examples directory of the distribution:

Each are provided in 3 formats: as a C++ program, C program, and Python script. Note that the Python scripts use the PyPar package which provides a Python/MPI interface, as discussed above in the Python Interface section, so you must have PyPar installed in your Python to run them.

The C++ and C programs can be built (assuming you have already built the MR-MPI library) by typing

make -f Makefile.foo

from within the examples directory, using one of the provided Makefiles. As with the library itself, you may need to edit one of the Makefiles to create a new version appropriate to your machine.

The examples directory also includes input scripts for the scripting interface to MR-MPI called OINK. There are scripts for word frequency (in.wordfreq), R-MAT generation (in.rmat) and various graph algorithms (in.cc, in.tri, in.luby, in.sssp, in.pagerank), described in the paper by Plimpton and Devine.

OINK has its own manual and doc pages. To run these scripts you will need to build OINK, and then run one of the scripts as follows:

oink_machine < in.rmat

The wordfreq programs implement the word frequency counting algorithm described above in this section. The wordfreq programs are run by giving a list of text files as arguments, e.g.

wordfreq ~/mydir/*.cpp mpirun -np 8 wordfreq ~/mydir/*.cpp cwordfreq ~/mydir/*.cpp mpirun -np 8 cwordfreq ~/mydir/*.cpp python wordfreq.py ~/mydir/*.cpp mpirun -np 8 python wordfreq.py ~/mydir/*.cpp

Total word counts and a list of the top 10 words should be printed to the screen, along with the time to perform the operation.

The 3 different versions of the wordfreq program should give the same answers, although if non-text files are used, the parsing of the contents into words can be done differently by the C library strtok() function and the Python string "split" method.

The rmat programs generate a particular form of randomized sparse matrix known as an R-MAT matrix. Depending on the parameters chosen, the sparsity pattern in the resulting matrix can be highly non-uniform, and a good model for irregular graphs, such as ones representing a network of computers or WWW page links.

The rmat programs are run by specifying a few parameters, e.g.

rmat N Nz a b c d frac outfile mpirun -np 8 rmat N Nz a b c d frac outfile crmat N Nz a b c d frac outfile mpirun -np 8 crmat N Nz a b c d frac outfile python rmat.py N Nz a b c d frac outfile mpirun -np 8 python rmat.py N Nz a b c d frac outfile

The meaning of the parameters is as follows. Note that only matrices with a power-of-2 number of rows can be generated, so specifying N=20 creates a matrix with over a million rows.

- 2^N = # of rows in matrix
- Nz = average # of non-zeroes per row
- a,b,c,d = generation params for matrix entries, must sum to 1
- frac = randomization parameter between 0 and 1
- seed = random # seed, positive integer
- outfile = optional output file

A full description of the R-MAT generation algorithm is beyond the
scope of this doc page, but here's the brief version. The *a,b,c,d*
parameters are effectively weights on the 4 quadrants of the matrix.
To generate a single new matrix element, one quadrant is chosen, with
a probability proportional to its weight. This operation is repeated
recursively within the chosen quadrant, applying the *frac* parameter
to randomize the weights a bit. After N iterations, a single I,J
matrix location has been identified and its value is set (to 1 in this
case).

The total number of matrix entries generated is Nx * 2^N. This procedure can generate duplicates, so those are removed, and new elements generated until the desired number is reached.

When completed, the matrix statistics are printed to the screen, along
with the time to generate the matrix. If the optional *outfile*
parameter is specified, then the matrix entries are written to files
(one per processor). Each line of any file has the form

I J value

where I,J are the matrix row,column and value is the matrix entry (all are 1 in this case). If the files are concatenated together, the full set of matrix entries should result.

The 3 different versions of the rmat programs should give the same answers in a statistical sense. The answers will not be identical because the same random number generation scheme is not used in all 3 programs.

**(RMAT)** D. Chakrabarti, Y. Zhan, C. Faloutsos, R-MAT: A Recursive
Model for Graph Mining", if Proceedings of the SIAM Conference on Data
Mining (2004), available at
http://www.cs.cmu.edu/~deepay/mywww/papers/siam04.pdf.

**(Plimpton)** Plimpton and Devine, "MapReduce in MPI for Large-Scale
Graph Algorithms", to appear in Parallel Computing (2011).