MapReduce-MPI WWW Site - MapReduce-MPI Documentation

MapReduce reduce() method

MapReduce multivalue_blocks() method

MapReduce multivalue_block() method

MapReduce multivalue_block_select() method

uint64_t MapReduce::reduce(void (*myreduce)(char *, int, char *, int, int *, KeyValue *, void *), void *ptr)

uint64_t MapReduce::multivalue_blocks(int &)

int MapReduce::multivalue_block(int iblock, char **ptr_multivalue, int **ptr_valuesizes)

void MapReduce::multivalue_block_select(int which)

This calls the reduce() method of a MapReduce object, passing it a function pointer to a myreduce function you write. It operates on a KeyMultiValue object, calling your myreduce function once for each unique key/multi-value (KMV) pair owned by that processor. A new KeyValue object is created which stores all the key/value pairs generated by your myreduce() function. The method returns the total number of new key/value pairs stored by all processors.

You can give this method a pointer (void *ptr) which will be returned to your myreduce() function. See the Technical Details section for why this can be useful. Just specify a NULL if you don't need this.

In this example the user function is called myreduce() and it must have the following interface, which is the same as that used by the compress() method:

void myreduce(char *key, int keybytes, char *multivalue, int nvalues, int *valuebytes, KeyValue *kv, void *ptr)

A single KMV pair is passed to your function from the KeyMultiValue object stored by the MapReduce object. The key is typically unique to this reduce task and the multi-value is a list of the nvalues associated with that key in the KeyMultiValue object.

There are two possibilities for a KMV pair returned to your function. The first is that it fits in one page of memory allocated by the MapReduce object, which is the usual case. See the memsize setting for details on memory allocation.

In this case, the char *multivalue argument is a pointer to the beginning of the multi-value which contains all nvalues, packed one after the other. The int *valuebytes argument is an array which stores the length of each value in bytes. If needed, it can be used by your function to compute an offset into char *values for where each individual value begins. Your function is also passed a kv pointer to a new KeyValue object created and stored internally by the MapReduce object.

If the KMV pair does not fit in one page of memory, then the meaning of the arguments passed to your function is changed. Your function must call two additional library functions in order to retrieve a block of values that does fit in memory, and process them one block at a time.

In this case, the char *multivalue argument will be NULL and the nvalues argument will be 0. Either of these can be tested for within your function. If you know that no KMV pair will overflow one page of memory, then the test is not needed. The meaning of the kv and ptr arguments is the same as discussed above. However, the int *valuebytes argument is changed to be a pointer to the MapReduce object. This is to allow you to make the following two kinds of calls back to the library:

MapReduce *mr = (MapReduce *) valuebytes;
int nblocks;
uint64_t nvalues_total = mr->multivalue_blocks(nblocks);
for (int iblock = 0; iblock < nblocks; iblock++) { 
  int nv = mr->multivalue_block(iblock,&multivalue,&valuebytes);
  for (int i = 0; i < nv; i++) {
    process each value within the block of values
  }
}

The call to multivalue_blocks() returns both the total number of values (as an unsigned 64-bit integer return value), and the number of blocks of values in the multi-value (as first argument). Each call to multivalue_block() retrieves one block of values. The number of values in the block is returned, as nv in this case. The multivalue and valuebytes arguments are pointers to a char * and int * (i.e. a char ** and int **), which will be set to point to the block of values and their lengths respectively, so they can then be used just as the multivalue and valuebytes arguments in the myreduce() callback itself (when the values do not exceed available memory).

Note that in this example we are re-using (and thus overwriting) the original multivalue and valuebytes arguments as local variables.

Also note that your myreduce() function can call multivalue_block() as many times as it wishes and process the blocks of values multiple times or in any order, though looping through blocks in ascending order will typically give the best disk I/O performance.

If you need to load and process two blocks of values simultaneously (e.g. in a double loop), then the multivalue_block_select() function can be called with which = 1 or 2 to specify a page of memory to read a block of values into. This should be set just before the call to multivalue_block(), to insure one block of values is not overwritten by reading a second block.

Your myreduce() function can produce key/value pairs (though this is not required) which it registers with the MapReduce object by calling the add() method of the KeyValue object. The syntax for registration is described on the doc page of the KeyValue add() method. Alternatively, your myreduce() function can write information to an output file.

See the Settings and Technical Details sections for details on the byte-alignment of keys and values that are passed to your myreduce() function and on those you register with the KeyValue add() methods. Note that only the first value of a multi-value (or of each block of values) passed to your myreduce() function will be aligned to the valuealign setting.

This method is an on-processor operation, requiring no communication. When run in parallel, each processor performs a myreduce() on each of the key/value pairs it owns and stores any new key/value pairs it generates.

Related methods: Keyvalue add(), map()