This mutiple-page section discusses how to call the MR-MPI library from a C++ program and gives a description of all its methods and variable settings. Use of the library from a C program (or other hi-level language) or from Python is discussed in other sections of the manual.
All the library methods operate on two basic data structures stored within the MapReduce object, a KeyValue object (KV) and a KeyMultiValue object (KMV). When running in parallel, these objects are stored in a distributed fashion across multiple processors.
A KV is a collection of key/value pairs. The same key may appear many times in the collection, associated with values which may or may not be the same.
A KMV is also a collection of key/value pairs. But each key in the KMV is unique, meaning it appears exactly once (see the clone() method for a possible exception). The value associated with a KMV key is a concatenated list (a multi-value) of all the values associated with the same key in the original KV.
More details about how KV and KMV objects are stored are given in the Technical Details section.
Here is an overview of how the various library methods operate on KV and KMV objects. This is useful to understand, since this determines how the various operations can be chained together in your program.
add() | KV -> KV | add pairs from one KV to another | serial | 2 pages |
aggregate() | KV -> KV | pairs are aggregated onto procs | parallel | 7 pages |
broadcast() | KV -> KV | send pairs from one proc to all procs | parallel | 2 pages |
clone() | KV -> KMV | each KV pair becomes a KMV pair | serial | 2 pages |
close() | KV | allows one MapReduce object to add KV pairs to another | serial | 0 pages |
collapse() | KV -> KMV | all KV pairs become one KMV pair | serial | 2 pages |
collate() | KV -> KMV | aggregate + convert | parallel | 4+ pages |
compress() | KV -> KV | calls back to user program to compress duplicate keys | serial | 4+ pages |
convert() | KV -> KMV | duplicate KV keys become one KMV key | serial | 4+ pages |
gather() | KV -> KV | collect pairs on many procs to few procs | parallel | 2 pages |
map() | create or add to a KV | calls back to user program to generate pairs | serial | 1 page |
reduce() | KMV -> KV | calls back to user program to process KMV pairs | serial | 3 pages |
open() | create or add to a KV | allows one MapReduce object to add KV pairs to another | serial | 0 pages |
print() | KV or KMV | print KV or KMV pairs to screen or file(s) | serial | 1 page |
scan() | KV or KMV | calls back to user program to process KV or KMV pairs | serial | 1 page |
scrunch() | KV -> KMV | gather + collapse | parallel | 3 pages |
sort_keys() | KV -> KV | calls back to user program to sort pairs by key | serial | 5 pages |
sort_values() | KV -> KV | calls back to user program to sort pairs by value | serial | 5 pages |
sort_multivalues() | KMV -> KMV | calls back to user program to sort multi-values within each pair | serial | 4 pages |
kv_stats() | KV | print stats about a KV | serial | 0 pages |
kmv_stats() | KMV | print stats about a KMV | serial | 0 pages |
Note that each MapReduce object contains a single KV or KMV object (or neither) when its method is called. (Some methods operate on 2 or more MapReduce objects.) When the method completes, the MapReduce object also contains a single KV or KMV object. Thus if a method creates a new KV or KMV object, the old one is deleted, if it existed. The KV object is also deleted if a KMV object is produced, and vice versa.
The methods flagged as "serial" perform their operation on the portion of a KV or KMV owned by an individual processor. They involve only local computation (performed simultaneously on all processors) and no parallel comuunication. The methods flagged as "parallel" involve communication between processors.
The listed page counts are the number of memory pages that method requires. See the memsize setting for a discussion of what memory pages are and how their size is set. The methods whose page count is listed as 4+ all perform a convert() operation internally. The minimum number of pages this requires is 4. Depending on the page size and the characteristics of the KV pairs being converted to KMV pairs, more pages can be required. See the out-of-core discussion in this section for more details.