MapReduce-MPI WWW Site - MapReduce-MPI Documentation

MapReduce map() method

Variant 1:
uint64_t MapReduce::map(int nmap, void (*mymap)(int, KeyValue *, void *), void *ptr)
uint64_t MapReduce::map(int nmap, void (*mymap)(int, KeyValue *, void *), void *ptr, int addflag)

Variant 2:
uint64_t MapReduce::map(int nstr, char **strings, int self, int recurse, int readfile, void (*mymap)(int, char *, KeyValue *, void *), void *ptr)
uint64_t MapReduce::map(int nstr, char **strings, int self, int recurse, int readfile, void (*mymap)(int, char *, KeyValue *, void *), void *ptr, int addflag)

Variant 3:
uint64_t MapReduce::map(int nmap, int nstr, char **strings, int recurse, int readfile, char sepchar, int delta, void (*mymap)(int, char *, int, KeyValue *, void *), void *ptr)
uint64_t MapReduce::map(int nmap, int nstr, char **strings, int recurse, int readfile, char sepchar, int delta, void (*mymap)(int, char *, int, KeyValue *, void *), void *ptr, int addflag)

Variant 4:
uint64_t MapReduce::map(int nmap, int nstr, char **strings, int recurse, int readfile, char *sepstr, int delta, void (*mymap)(int, char *, int, KeyValue *, void *), void *ptr)
uint64_t MapReduce::map(int nmap, int nstr, char **strings, int recurse, int readfile, char *sepstr, int delta, void (*mymap)(int, char *, int, KeyValue *, void *), void *ptr, int addflag)

Variant 5:
uint64_t MapReduce::map(MapReduce *mr2, void (*mymap)(uint64_t, char *, int, char *, int, KeyValue *, void *), void *ptr)
uint64_t MapReduce::map(MapReduce *mr2, void (*mymap)(uint64_t, char *, int, char *, int, KeyValue *, void *), void *ptr, int addflag)

This calls the map() method of a MapReduce object. A function pointer to a mapping function you write is specified as an argument. This method either creates a new KeyValue object to store all the key/value pairs generated by your mymap function, or adds them to an existing KeyValue object. The method returns the total number of key/value pairs in the KeyValue object.

There are several variants of the map() methods to allow for different ways to process input data. This also induces variants of the callback mymap() function.

For the first set of variants (with or without addflag) you simply specify a total number of map tasks nmap to perform across all processors. The index of a map task is passed back to your mymap() function. The MapReduce library assigns map tasks to processors; see more details below.

For the second set of variants, you specify nstr and strings which are file and/or directory names. Using these strings, a list of filenames is generated. Each filename in the list is passed back to your mymap() function which can open the file and process it.

If self is 0, then only processor 0 generates the list of filenames, and the MapReduce library assigns files to processors; see more details below. If self is 1, then each processor generates its own list of filenames and those files are assigned to that processor. Note that in the self = 0 case, it is assumed that every processor can read any file that is assigned to it. Also note, that with self = 1 you can assign files to a processor that reside on a disk local to a processor, or with a parallel disk system you can pass different strings to different processors so that each processor reads from different set of files/directories.

The list of filenames is generated in the following manner. Each of the strings is checked for whether it is a file or directory. If it is a file, it is added to the list of files. If it is a directory, the directory is opened and all the files in it are added to the list of files. If the recurse flag is set to 1, then if sub-directories are found in the directory, they are opened and the files in them are also added to the list of files (and so forth, recursively).

The readfile setting adds one additional wrinkle. If readfile is 1, then instead of adding each filename to the list, each file is opened, and filenames are read from that file and added to the list. In this mode, each file should contain contain one filename per line. Blank lines are not allowed. Leading and trailing whitespace around each filename is OK.

The number of files that are generated and processed can be accessed after the map() method is invoked, but the variable mapfilecount, e.g.

MapReduce *mr = new MapReduce();
mr->map(nstr,strings,1,0,1,mymap,NULL);
int ntotalfiles = mr->mapfilecount;

The third and fourth set of variants allow large file(s) to be broken into chunks and one or more sections to be passed back to your mymap() function as a string so it can process it. Nmap is the number of chunks to generate from all the files in aggregate (not nmap chunks per file). As with the previous variant, you also specify nstr, strings, recurse, and readfile. This generates a list of filenames, the same as in the previous variant. The only difference is that no self setting is allowed, because only processor 0 does this. The specified nmap should be >= the number of files in the generated list; it is reset to the number of files if that is not the case.

For the third set of variants you specify a separation character sepchar. For the fourth set of variants, you specify a separation string sepstr. The files in the generated list of files are split into nmap chunks with roughly equal numbers of bytes in each chunk. Think of all the files concatenated together and then split into nmap chunks. For each call to your mymap() function, a chunk is read from a particular file, and passed to your function as a string, so your code does not read the file. See details below about the splitting methodology and the delta input parameter.

For the fifth set of variants, you specify an existing MapReduce object mr2 with key/value pairs, which can either be this MapReduce object or another one. The key/value pairs from mr2 are passed back to your mymap() function, one key/value at a time, allowing you to generate new key/value pairs from an existing set.

You can give any of the map() methods a pointer (void *ptr) which will be returned to your mymap() function. See the Technical Details section for why this can be useful. Just specify a NULL if you don't need this.

The meaning of the final addflag argument is as follows.

For all but the last variant, if addflag is omitted or is specified as 0, then map() will create a new KeyValue object, deleting any existing KeyValue object. If addflag is non-zero, then KV pairs generated by your mymap() function are added to an existing KeyValue object, which is created if needed.

For the last variant, if the source of KeyValue pairs (mr2) is different than the MapReduce object mr, then the KV pairs in mr2 are not altered or deleted, regardless of the addflag setting. If addflag is 0, then the KeyValue object in mr is deleted, and newly generated KV pairs are added to a new KeyValue object. If addflag is 1, then newly generated KV pairs are added to the existing KeyValue object in mr.

For the last variant, if the source of KeyValue pairs (mr2) is the same as MapReduce object mr, there are two possibilities. If addflag is 1, then newly generated KV pairs are added to the existing KeyValue object. If addflag is 0, then the existing KeyValue object is effectively replaced by the newly generated KV pairs. Note that the addflag=1 option requires the KeyValue object to first be copied. If your mymap() function will not generate any new KV pairs, then it is more efficient to use the scan() method, which simply allows you to iterated over the existing KV pairs.

In these examples the user function is called mymap() and it has one of four interfaces depending on which variant of the map() method is invoked:

void mymap(int itask, KeyValue *kv, void *ptr)
void mymap(int itask, char *file, KeyValue *kv, void *ptr)
void mymap(int itask, char *str, int size, KeyValue *kv, void *ptr)
void mymap(uint64_t itask, char *key, int keybytes, char *value, int valuebytes, KeyValue *kv, void *ptr)

In all cases, the final 2 arguments passed to your function are a pointer to a KeyValue object (kv) stored internally by the MapReduce object, and the original pointer you specified as an argument to the map() method, as void *ptr.

In the first mymap() variant, itask is passed to your function with a value 0 <= itask < nmap, where nmap was specified in the map() call. For example, you could use itask to select a file from a list stored by your application. Your mymap() function could open and read the file or perform some other operation.

In the second mymap() variant, itask will have a value 0 <= itask < nfiles, where nfiles is either the number of filenames in the list of files that was generated. Your function is also passed a single filename, which it will presumably open and read.

In the third mymap() variant, itask will have a value from 0 <= itask < nmap, where nmap was specified in the map() call and is the number of file segments generated. It is also passed a string of bytes (str) of length size read from one of the files. Size includes a trailing '\0' that is appended to the string.

For map() methods that take files and a separation criterion as arguments, you must specify nmap >= nfiles, so that there is one or more map tasks per file. For files that are split into multiple chunks, the split is done at occurrences of the separation character or string. You specify a delta of how many extra bytes to read with each chunk that will guarantee the splitting character or string is found within that many bytes. For example if the files are lines of text, you could choose a newline character '\n' as the sepchar, and a delta of 80 (if the longest line in your files is 80 characters). If the files are snapshots of simulation data where each snapshot is 1000 lines (no more than 80 characters per line), you could choose the first line of each snapshot (e.g. "Snapshot") as the sepstr, and a delta of 80000. Note that if the separation character or string is not found within delta bytes, an error will be generated. Also note that there is no harm in choosing a large delta so long as it is not larger than the chunk size for a particular file.

If the separation criterion is a character (sepchar), the chunk of bytes passed to your mymap() function will start with the character after a sepchar, and will end with a sepchar (followed by a '\0'). If the separation criterion is a string (sepstr), the chunk of bytes passed to your mymap() function will start with sepstr, and will end with the character immediately preceeding a sepstr (followed by a '\0'). Note that this means your mymap() function will be passed different byte strings if you specify sepchar = 'A' vs sepstr = "A".

In the fourth mymap() variant, itask will have a value from 0 <= itask < nkey, where nkey is a unsigned 64-bit int and is the number of key/value pairs in the specified MapReduce object. Key and value are the byte strings for a single key/value pair and are of length keybytes and valuebytes respectively.

The MapReduce library assigns map tasks to processors. Options for how it does this can be controlled by MapReduce settings. Basically, nmap/P tasks are assigned to each processor, where P is the number of processors in the MPI communicator you instantiated the MapReduce object with.

Typically, your mymap() function will produce key/value pairs which it registers with the MapReduce object by calling the add() method of the KeyValue object. The syntax for registration is described on the doc page of the KeyValue add() method.

See the Settings and Technical Details sections for details on the byte-alignment of keys and values you register with the KeyValue add() methods or that are passed to your mymap() function.

Aside from the assignment of tasks to processors, this method is really an on-processor operation, requiring no communication. When run in parallel, each processor generates key/value pairs and stores them, independently of other processors.

Related methods: Keyvalue add(), reduce()