MapReduce-MPI WWW Site - MapReduce-MPI Documentation

Python interface to the MapReduce-MPI Library

A Python wrapper for the MR-MPI library is included in the distribution. The advantage of using Python is how concise the language is, enabling rapid development and debugging of MapReduce programs. The disadvantage is speed, since Python is slower than a compiled language. Using the MR-MPI library from Python incurs two additional overheads, discussed in the Technical Details section.

Before using MR-MPI from a Python script, you need to do two things. You need to build MR-MPI as a dynamic shared library, so it can be loaded by Python. And you need to tell Python how to find the library and the Python wrapper file python/mrmpi.py. Both these steps are discussed below. If you wish to run MR-MPI in parallel from Python, you also need to extend your Python with MPI. This is also discussed below.

The Python wrapper for MR-MPI uses the amazing and magical (to me) "ctypes" package in Python, which auto-generates the interface code needed between Python and a set of C interface routines for a library. Ctypes is part of standard Python for versions 2.5 and later. You can check which version of Python you have installed, by simply typing "python" at a shell prompt.

The following sub-sections cover the rest of the Python discussion:



Building MR-MPI as a shared library

Instructions on how to build MR-MPI as a shared library are given in the Start section. A shared library is one that is dynamically loadable, which is what Python requires. On Linux this is a library file that ends in ".so", not ".a".

From the src directory, type

make -f Makefile.shlib foo 

where foo is the machine target name, such as linux or g++ or serial. This should create the file libmrmpir_foo.so in the src directory, as well as a soft link libmrmpi.so, which is what the Python wrapper will load by default. Note that if you are building multiple machine versions of the shared library, the soft link is always set to the most recently built version.

If this fails, see the Start section for more details.


Installing the Python wrapper into Python

For Python to invoke MR-MPI, there are 2 files it needs to know about:

Mrmpi.py is the Python wrapper on the MR-MPI library interface. Libmrmpi.so is the shared MR-MPI library that Python loads, as described above.

You can insure Python can find these files in one of two ways:

If you set the paths to these files as environment variables, you only have to do it once. For the csh or tcsh shells, add something like this to your ~/.cshrc file, one line for each of the two files:

setenv PYTHONPATH $PYTHONPATH:/home/sjplimp/mrmpi/python
setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH:/home/sjplimp/mrmpi/src 

If you use the python/install.py script, you need to invoke it every time you rebuild MR-MPI (as a shared library) or make changes to the python/mrmpi.py file.

You can invoke install.py from the python directory as

% python install.py [libdir] [pydir] 

The optional libdir is where to copy the MR-MPI shared library to; the default is /usr/local/lib. The optional pydir is where to copy the mrmpi.py file to; the default is the site-packages directory of the version of Python that is running the install script.

Note that libdir must be a location that is in your default LD_LIBRARY_PATH, like /usr/local/lib or /usr/lib. And pydir must be a location that Python looks in by default for imported modules, like its site-packages dir. If you want to copy these files to non-standard locations, such as within your own user space, you will need to set your PYTHONPATH and LD_LIBRARY_PATH environment variables accordingly, as above.

If the install.py script does not allow you to copy files into system directories, prefix the python command with "sudo". If you do this, make sure that the Python that root runs is the same as the Python you run. E.g. you may need to do something like

% sudo /usr/local/bin/python install.py [libdir] [pydir] 

You can also invoke install.py from the make command in the src directory as

% make install-python 

In this mode you cannot append optional arguments. Again, you may need to prefix this with "sudo". In this mode you cannot control which Python is invoked by root.

Note that if you want Python to be able to load different versions of the MR-MPI shared library (see this section below), you will need to manually copy files like lmpmrmpi_g++.so into the appropriate system directory. This is not needed if you set the LD_LIBRARY_PATH environment variable as described above.


11.3 Extending Python with MPI to run in parallel

If you wish to run MR-MPI in parallel from Python, you need to extend your Python with an interface to MPI. This also allows you to make MPI calls directly from Python in your script, if you desire.

There are several Python packages available that purport to wrap MPI as a library and allow MPI functions to be called from Python.

These include

All of these except pyMPI work by wrapping the MPI library and exposing (some portion of) its interface to your Python script. This means Python cannot be used interactively in parallel, since they do not address the issue of interactive input to multiple instances of Python running on different processors. The one exception is pyMPI, which alters the Python interpreter to address this issue, and (I believe) creates a new alternate executable (in place of "python" itself) as a result.

In principle any of these Python/MPI packages should work to invoke MR-MPI in parallel and MPI calls themselves from a Python script which is itself running in parallel. However, when I downloaded and looked at a few of them, their documentation was incomplete and I had trouble with their installation. It's not clear if some of the packages are still being actively developed and supported.

The one I recommend, since I have successfully used it with MR-MPI, is Pypar. Pypar requires the ubiquitous Numpy package be installed in your Python. After launching python, type

import numpy 

to see if it is installed. If not, here is how to install it (version 1.3.0b1 as of April 2009). Unpack the numpy tarball and from its top-level directory, type

python setup.py build
sudo python setup.py install 

The "sudo" is only needed if required to copy Numpy files into your Python distribution's site-packages directory.

To install Pypar (version pypar-2.1.4_94 as of Aug 2012), unpack it and from its "source" directory, type

python setup.py build
sudo python setup.py install 

Again, the "sudo" is only needed if required to copy Pypar files into your Python distribution's site-packages directory.

If you have successully installed Pypar, you should be able to run Python and type

import pypar 

without error. You should also be able to run python in parallel on a simple test script

% mpirun -np 4 python test.py 

where test.py contains the lines

import pypar
print "Proc %d out of %d procs" % (pypar.rank(),pypar.size()) 

and see one line of output for each processor you run on.

IMPORTANT NOTE: To use Pypar and MR-MPI in parallel from Python, you must insure both are using the same version of MPI. If you only have one MPI installed on your system, this is not an issue, but it can be if you have multiple MPIs. Your MR-MPI build is explicit about which MPI it is using, since you specify the details in your lo-level src/MAKE/Makefile.foo file. Pypar uses the "mpicc" command to find information about the MPI it uses to build against. And it tries to load "libmpi.so" from the LD_LIBRARY_PATH. This may or may not find the MPI library that MR-MPI is using. If you have problems running both Pypar and MR-MPI together, this is an issue you may need to address, e.g. by moving other MPI installations so that Pypar finds the right one.


11.4 Testing the Python-MR-MPI interface

To test if MR-MPI is callable from Python in serial, launch Python interactively and type:

>>> from mrmpi import mrmpi
>>> mr = mrmpi() 

If you get no errors, you're ready to use MR-MPI from Python. If the 2nd command fails, the most common error to see is

OSError: Could not load MR-MPI dynamic library 

which means Python was unable to load the MR-MPI shared library. This typically occurs if the system can't find the MR-MPI shared library, or if something about the library is incompatible with your Python. The error message should give you an indication of what went wrong.

You can also test the load directly in Python as follows, without first importing from the mrmpi.py file:

>>> from ctypes import CDLL
>>> CDLL("libmrmpi.so") 

If an error occurs, carefully go thru the steps in Start and above about building a shared library and about insuring Python can find the necessary two files it needs.

Test MR-MPI and Python in parallel:

To run MR-MPI in parallel, assuming you have installed the Pypar package as discussed above, create a test.py file containing these lines:

import pypar
from mrmpi import mrmpi
mr = mrmpi()
print "Proc %d out of %d procs has" % (pypar.rank(),pypar.size()),mr
pypar.finalize() 

You can then run it in parallel as:

% mpirun -np 4 python test.py 

Note that if you leave out the 3 lines from test.py that specify Pypar commands you will instantiate and run MR-MPI independently on each of the P processors specified in the mpirun command. In this case you should get 4 sets of output, each showing that a MR-MPI was initialized on a single processor, instead of one set of output showing MR-MPI was initialized on 4 processors. If the 1-processor outputs occur, it means that Pypar is not working correctly.

Also note that once you import the PyPar module, Pypar initializes MPI for you, and you can use MPI calls directly in your Python script, as described in the Pypar documentation. The last line of your Python script should be pypar.finalize(), to insure MPI is shut down correctly.

Running Python scripts:

Note that any Python script (not just for MR-MPI) can be invoked in one of several ways:

% python foo.script
% python -i foo.script
% foo.script 

The last command requires that the first line of the script be something like this:

#!/usr/local/bin/python 
#!/usr/local/bin/python -i 

where the path points to where you have Python installed, and that you have made the script file executable:

% chmod +x foo.script 

Without the "-i" flag, Python will exit when the script finishes. With the "-i" flag, you will be left in the Python interpreter when the script finishes, so you can type subsequent commands. As mentioned above, you can only run Python interactively when running Python on a single processor, not in parallel.



Using the MR-MPI library from Python

The Python interface to MR-MPI consists of a Python "mrmpi" module, the source code for which is in python/mrmpi.py, which creates a "mrmpi" object, with a set of methods that can be invoked on that object. The sample Python code below assumes you have first imported the "mrmpi" module in your Python script, as follows:

from mrmpi import mrmpi 

These are the methods defined by the mrmpi module. Some of them take callback functions as arguments, e.g. map() and reduce(). These are Python functions you define elsewhere in your script. When you register "keys" and "values" with the library, they can be simple quantities like strings or ints or floats. Or they can be Python data structures like lists or tuples.

These are the class methods defined by the mrmpi module. Their functionality and arguments are described in the C++ interface section.

mr = mrmpi()                # create a MR-MPI object using the default libmrmpi.so library
mr = mrmpi(mpi_comm)        # ditto, but with a specified MPI communicator
mr = mrmpi(0.0)             # ditto, and the library will finalize MPI
mr = mrmpi(None,"g++")      # create a MR-MPI object using the libmrmpi_g++.so library
mr = mrmpi(mpi_comm,"g++")  # ditto, but with a specified MPI communicator
mr = mrmpi(0.0,"g++")       # ditto, and the library will finalize MPI 
mr2 = mr.copy()             # copy mr to create mr2 
mr.destroy()                # destroy an mrmpi object, freeing its memory
                            # this will also occur if Python garbage collects 
mr.add(mr2)
mr.aggregate()
mr.aggregate(myhash)        # if specified, myhash is a hash function
			    #   called back from the library as myhash(key)
			    # myhash() should return an integer (a proc ID)
mr.broadcast(root)
mr.clone()
mr.close()
mr.collapse(key)
mr.collate()
mr.collate(myhash)          # if specified, myhash is the same function
			    #   as for aggregate() 
mr.compress(mycompress)     # mycompress is a function called back from the
			    #   library as mycompress(key,mvalue,mr,ptr)
			    #   where mvalue is a list of values associated
			    #   with the key, mr is the MapReduce object,
			    #   and you (optionally) provide ptr (see below)
			    # your mycompress function should typically
			    #   make calls like mr->add(key,value)
mr.compress(mycompress,ptr) # if specified, ptr is any Python datum
		            #    and is passed back to your mycompress()
			    # if not specified, ptr = None 
mr.convert()
mr.gather(nprocs) 
mr.map(nmap,mymap)          # mymap is a function called back from the
			    #   library as mymap(itask,mr,ptr)
			    #   where mr is the MapReduce object,
			    #   and you (optionally) provide ptr (see below)
			    # your mymap function should typically
			    #   make calls like mr->add(key,value)
mr.map(nmap,mymap,ptr)      # if specified, ptr is any Python datum
			    #    and is passed back to your mymap()
			    # if not specified, ptr = None
mr.map(nmap,mymap,ptr,addflag) # if addflag is specfied as a non-zero int,
			       #   new key/value pairs will be added to the
			       #   existing key/value pairs 
mr.map_file(files,self,recurse,readfile,mymap)
                             # files is a list of filenames and dirnames
			     # mymap is a function called back from the
			     #   library as mymap(itask,filename,mr,ptr)
			     # as above, ptr and addflag are optional args
mr.map_file_char(nmap,files,recurse,readfile,sepchar,delta,mymap)
                             # files is a list of filenames and dirnames
			     # mymap is a function called back from the
			     #   library as mymap(itask,str,mr,ptr)
			     # as above, ptr and addflag are optional args
mr.map_file_str(nmap,files,recurse,readfile,sepstr,delta,mymap)
                             # files is a list of filenames and dirnames
			     # mymap is a function called back from the
			     #   library as mymap(itask,str,mr,ptr)
			     # as above, ptr and addflag are optional args
mr.map_mr(mr2,mymap)         # pass key/values in mr2 to mymap
                             # mymap is a function called back from the
			     #   library as mymap(itask,key,value,mr,ptr)
			     # as above, ptr and addflag are optional args 
mr.open()
mr.open(addflag)
mr.print_screen(proc,nstride,kflag,vflag)
mr.print_file(file,fflag,proc,nstride,kflag,vflag) 
mr.reduce(myreduce)         # myreduce is a function called back from the
			    #   library as myreduce(key,mvalue,mr,ptr)
			    #   where mvalue is a list of values associated
			    #   with the key, mr is the MapReduce object,
			    #   and you (optionally) provide ptr (see below)
			    # your myreduce function should typically
			    #   make calls like mr->add(key,value)
mr.reduce(myreduce,ptr)     # if specified, ptr is any Python datum
			    #    and is passed back to your myreduce()
			    # if not specified, ptr = None 
mr.scan_kv(myscan)          # myscan is a function called back from the
			    #   library as myscan(key,value,ptr)
			    #   for each key/value pair
			    #   and you (optionally) provide ptr (see below)
mr.scan_kv(myscan,ptr)      # if specified, ptr is any Python datum
			    #    and is passed back to your myreduce()
			    # if not specified, ptr = None 
mr.scan_kmv(myscan)         # myscan is a function called back from the
			    #   library as myreduce(key,mvalue,ptr)
			    #   where mvalue is a list of values associated
			    #   with the key,
			    #   and you (optionally) provide ptr (see below)
mr.scan_kmv(myscan,ptr)     # if specified, ptr is any Python datum
			    #    and is passed back to your myreduce()
			    # if not specified, ptr = None 
mr.scrunch(nprocs,key)
mr.sort_keys(mycompare)
mr.sort_values(mycompare)
mr.sort_multivalues(mycompare) # compare is a function called back from the
			       #   library as mycompare(a,b) where
			       #   a and b are two keys or two values
			       # your mycompare() should compare them
			       #   and return a -1, 0, or 1 
			       #   if a < b, or a == b, or a > b
mr.sort_keys_flag(flag)
mr.sort_values_flag(flag)
mr.sort_multivalues_flag(flag) 
mr.kv_stats(level)
mr.kmv_stats(level) 
mr.mapstyle(value)             # set mapstyle to value
mr.all2all(value)              # set all2all to value
mr.verbosity(value)            # set verbosity to value
mr.timer(value)                # set timer to value
mr.memsize(value)              # set memsize to value
mr.minpage(value)              # set minpage to value
mr.maxpage(value)              # set maxpage to value 
mr.add(key,value)                 # add single key and value
mr.add_multi_static(keys,values)  # add list of keys and values
				  # all keys are assumed to be same length
				  # all values are assumed to be same length
mr.add_multi_dynamic(keys,values) # add list of keys and values
				  # each key may be different length
				  # each value may be different length 

Note that you can create multiple MR-MPI objects in your Python script, and coordinate the data stored in each and moved between them, just as can from a C or C++ program.

The class methods above correspond one-to-one with the C++ methods described here, except that for C++ methods with multiple interfaces (e.g. map()), there are multiple Python methods with slightly different names, similar to the C interface.

There is no set function the the keyalign and valuealign settings. These are hard-wired to 1 for the Python interface, since no other values make sense, due to the pickling/unpickling that is performed in key and value data.

See the Python scripts in the examples directory for examples of how these calls are made from a Python program. They are conceptually identical to the C++ and C programs in the same directory.