This section describes the MapReduce programs provided in the examples directory of the distribution:
Each are provided in 3 formats: as a C++ program, C program, and Python script. Note that the Python scripts use the PyPar package which provides a Python/MPI interface, as discussed above in the Python Interface section, so you must have PyPar installed in your Python to run them.
The C++ and C programs can be built (assuming you have already built the MR-MPI library) by typing
make -f Makefile.foo
from within the examples directory, using one of the provided Makefiles. As with the library itself, you may need to edit one of the Makefiles to create a new version appropriate to your machine.
The examples directory also includes input scripts for the scripting interface to MR-MPI called OINK. There are scripts for word frequency (in.wordfreq), R-MAT generation (in.rmat) and various graph algorithms (in.cc, in.tri, in.luby, in.sssp, in.pagerank), described in the paper by Plimpton and Devine.
OINK has its own manual and doc pages. To run these scripts you will need to build OINK, and then run one of the scripts as follows:
oink_machine < in.rmat
The wordfreq programs implement the word frequency counting algorithm described above in this section. The wordfreq programs are run by giving a list of text files as arguments, e.g.
wordfreq ~/mydir/*.cpp mpirun -np 8 wordfreq ~/mydir/*.cpp cwordfreq ~/mydir/*.cpp mpirun -np 8 cwordfreq ~/mydir/*.cpp python wordfreq.py ~/mydir/*.cpp mpirun -np 8 python wordfreq.py ~/mydir/*.cpp
Total word counts and a list of the top 10 words should be printed to the screen, along with the time to perform the operation.
The 3 different versions of the wordfreq program should give the same answers, although if non-text files are used, the parsing of the contents into words can be done differently by the C library strtok() function and the Python string "split" method.
The rmat programs generate a particular form of randomized sparse matrix known as an R-MAT matrix. Depending on the parameters chosen, the sparsity pattern in the resulting matrix can be highly non-uniform, and a good model for irregular graphs, such as ones representing a network of computers or WWW page links.
The rmat programs are run by specifying a few parameters, e.g.
rmat N Nz a b c d frac outfile mpirun -np 8 rmat N Nz a b c d frac outfile crmat N Nz a b c d frac outfile mpirun -np 8 crmat N Nz a b c d frac outfile python rmat.py N Nz a b c d frac outfile mpirun -np 8 python rmat.py N Nz a b c d frac outfile
The meaning of the parameters is as follows. Note that only matrices with a power-of-2 number of rows can be generated, so specifying N=20 creates a matrix with over a million rows.
A full description of the R-MAT generation algorithm is beyond the scope of this doc page, but here's the brief version. The a,b,c,d parameters are effectively weights on the 4 quadrants of the matrix. To generate a single new matrix element, one quadrant is chosen, with a probability proportional to its weight. This operation is repeated recursively within the chosen quadrant, applying the frac parameter to randomize the weights a bit. After N iterations, a single I,J matrix location has been identified and its value is set (to 1 in this case).
The total number of matrix entries generated is Nx * 2^N. This procedure can generate duplicates, so those are removed, and new elements generated until the desired number is reached.
When completed, the matrix statistics are printed to the screen, along with the time to generate the matrix. If the optional outfile parameter is specified, then the matrix entries are written to files (one per processor). Each line of any file has the form
I J value
where I,J are the matrix row,column and value is the matrix entry (all are 1 in this case). If the files are concatenated together, the full set of matrix entries should result.
The 3 different versions of the rmat programs should give the same answers in a statistical sense. The answers will not be identical because the same random number generation scheme is not used in all 3 programs.
(RMAT) D. Chakrabarti, Y. Zhan, C. Faloutsos, R-MAT: A Recursive Model for Graph Mining", if Proceedings of the SIAM Conference on Data Mining (2004), available at http://www.cs.cmu.edu/~deepay/mywww/papers/siam04.pdf.
(Plimpton) Plimpton and Devine, "MapReduce in MPI for Large-Scale Graph Algorithms", to appear in Parallel Computing (2011).