c ParaDyn - Parallel DYNAMO - molecular dynamics with EAM potentials
c 
c Authored by Steve Plimpton
c   (505) 845-7873, sjplimp@cs.sandia.gov
c   Dept 9221, MS 1111, Sandia National Labs, Albuquerque, NM  87185-1111
c
c Based on the serial DYNAMO code authored by
c   Stephen Foiles (foiles@ca.sandia.gov), Sandia National Labs, Livermore, CA
c   Murray Daw (daw@hubcap.clemson.edu), Clemson University
c
c See the README file for more information
c

ParaDyn 1998 - Release 1 (April 1998)

This distribution contains the ParaDyn code and related potential and
test files.  ParaDyn is a parallel molecular dynamics code which uses
embedded atom method (EAM) potentials to model metals and metal
alloys.  It is written in F77 and C and performs message-passing via
MPI calls.  Thus it should be portable to virtually any parallel or
single-processor machine.

All of ParaDyn's features are part of a newer parallel molecular
dynamics code called LAMMPS, which has a multitude of additional
features.  LAMMPS is generally faster than ParaDyn for reasonably
large problems.  ParaDyn is no longer under active development.  The
LAMMPS WWW Site at http://www.cs.sandia.gov/~sjplimp/lammps.html has
more details.

ParaDyn is an adaptation of the serial DYNAMO code written by Stephen
Foiles and Murray Daw.  My thanks to both Stephen and Murray for
assistance at various stages of this project.  Similar to DYNAMO,
ParaDyn's features include NPT dynamics, a choice of
temperature/pressure controls and boundary conditions, atom and region
constraints, and options for dynamics or energy minimization.  For
those who have used DYNAMO, one difference is that for portability
ParaDyn is run with a script file of input commands, rather than a
Fortran namelist deck.

The parallel techniques used in ParaDyn are discussed in the paper:

S. J. Plimpton and B. A. Hendrickson, "Parallel Molecular Dynamics
With the Embedded Atom Method", in Materials Theory and Modelling,
edited by J. Broughton, P. Bristowe, and J. Newsam, (MRS Proceedings
291, Pittsburgh, PA, 1993), p. 37.

The code and a postscript copy of the paper are available from me via
e-mail or on the Web at http://www.cs.sandia.gov/~sjplimp/main.html

Please send me e-mail if you want to be notified of any upgrades or
significant bug-fixes to ParaDyn.

Feel free to contact me regarding ParaDyn if you

(a) have specific questions about how to use it,
(b) encounter problems when running it or find any bugs,
(c) port it to a new machine,
(d) succeed in using or modifying it to solve some interesting problem.

Steve Plimpton

----------------------------------------------------------------------------

When you uncompress (gunzip paradyn.tar.gz) and untar (tar xvf
paradyn.tar) the distribution, you should have 2 files and 4
directories: src, potentials, tools, examples.

README          this file
OVERVIEW	explanation of all ParaDyn input commands

***** src directory:

*.f,*.c,*.h	ParaDyn source files
Crib		variable documentation for ParaDyn
Makefile	top-level generic Makefile
Makefile.xxx    low-level Makefiles for various machines

***** potentials directory:

cuu3,niu3,...	potential files for various elements in DYNAMO format

***** tools directory:

dyn2para.f	serial utility to convert DYNAMO coordinate 
			files to ParaDyn atom files - use for converting
			DYNAMO output files to ParaDyn input files -
			see header of file for usage syntax
randomize.f	serial utility to randomize atom order in an atom file
Makefile	Makefile to make tools

***** examples directory:

Contains input and output files for 4 benchmark calculations you can
use to test ParaDyn.  See the "Testing ParaDyn" section below.

----------------------------------------------------------------------------

Making ParaDyn:

The src directory contains the F77 and C source files for ParaDyn as
well as a top-level Makefile and examples of lower-level
machine-specific Makefile.xxx files.

Makefile is setup to create multiple versions of ParaDyn for various
machines.  Type "make" to see a list of supported targets.  Typing
"make target" will produce an executable named "pd_target".

The top-level Makefile works on most Unix platforms, but may be
incompatible with some, since the "make" command is not very standard.
If it is available on your system, an alternative is the Gnu make
command "gmake", which is standard.

When a particular target is made, it will create an Obj_target
sub-directory to store machine-specific *.o files.  This is to prevent
*.o files generated for different machines from getting mixed up.

If a Makefile.xxx file exists for your machine, you will likely need
to edit the section with system-specific paths, compiler options, MPI
libraries, etc.  To make ParaDyn for a new machine, simply copy one of
the Makefile.xxx files to use as a template, giving it a new suffix.
Then add the new suffix to the top-level Makefile in the appropriate
places.

ParaDyn is a parallel code that will run on any number of processors,
but also runs fine in serial on a single-processor - e.g. Cray vector
machines or any Unix workstation or (possibly) even a Wintel PC.  To
compile for a single processor, you have 2 choices.  If you have MPI
installed, you can compile and link as usual -- see the Makefile.sgi
file for example.  This will allow you to run on multiple (virtual)
processors of your workstation - e.g. using the mpirun command.  Or
you can compile without MPI, using the Makefile.serial file.  This
requires that you first make a library in the STUBS sub-directory that
provides hooks for the MPI routines called by ParaDyn.  Just type
"make" in the STUBS directory to create this library.

Note that ParaDyn is not optimized for the vector and RISC processing
single processors often support.  If you plan to run large
calculations on a vector machine such as the Cray Y-MP or C90 you may
be better off getting the original DYNAMO code from Stephen Foiles
which is optimized for vector processing.

There are 2 F77 compiler flags that include specific features in the
code.  If you add -DSTRESS to the F77FLAGS line, you will compute the
stresses on individual atoms which can be dumped to a file as desired.
This slows down the force computations by about 10%, so it is not
turned on by default.

If you have a Fortran compiler that supports the common "pointer"
extensions to F77, you can add -DDYNAMIC, which allows ParaDyn to set
array sizes at run-time.  The advantage is that you do not typically
need to recompile the code when you run different size problems on
different numbers of processors.  This flag also slows down the code
by about 10%, so you may not wish to use it.  Ideally you might run a
-DDYNAMIC version of the code while setting up or running a few
timesteps of your problem.  Then look at the "Array bounds" section at
the bottom of the log file to see the exact sizes needed, set the
parameters accordingly in paradyn.h, and recompile without the
-DDYNAMIC flag for the production runs of your problem.  If you change
the setting of either of these flags, make sure you recompile the
entire ParaDyn source, since the Makefile will not trigger this
recompilation automatically.  Also, if your compiler does not support
"pointer" statements you should delete memory.f from the list of *.f
files in the top-level Makefile.

There are other compiler flags you may want to experiment with for
optimal performance of the inter-processor communication routines in
ParaDyn.  See the discussion in the Running ParaDyn section below.

----------------------------------------------------------------------------

Testing ParaDyn:

In the examples directory, there are 4 *.in benchmark input files.
Two are small enough to run quickly on a single workstation processor
(cu_bulk.in and min.in).  All 4 can be run on a parallel machine.  You
can compare the results from running on your machine with the example
output files that are in the test directory for accuracy.  You should
get identical answers on any number of processors.  Files with a
example.xxx.P suffix are sample outputs from running the benchmarks on
P processors of machine xxx.  The test scripts will run all 4 examples
on different machines.

Here is how to run the benchmark tests:

(1) pd_sgi < cu_bulk.in

Runs 100 steps of a bulk Cu lattice of 500 atoms.
Creates cu_bulk.log as output.

(2) pd_sgi < gb_diff.in

Runs 3000 steps of a Cu grain boundary with 1720 atoms.
Needs gb_diff.atoms as additional input.
Creates gb_diff.log as output.
Creates gb_diff.diff as additional output
  so long as ParaDyn is compiled with correct diagnostic routine.

(3) pd_sgi < gb_random.in

Same as previous run except used randomized atom set with
  parallel method 2.
Runs 3000 steps of a Cu grain boundary with 1720 atoms.
Needs gb_diff.atoms_random as additional input
  (created from running gb_diff.atoms thru randomize).
Creates gb_diff.log_random as output.
Creates gb_diff.diff_random as additional output
  so long as ParaDyn is compiled with correct diagnostic routine.
Should get statistically similar answers to (2).

(4) pd_sgi < min.in

Minimizes the energy of a Cu impurity atom in a 500 atom Ni lattice.
Needs min.atoms as additional input.
Creates min.log and min.dump as output.

----------------------------------------------------------------------------

Running ParaDyn:

I have attempted to implement essentially all the features of DYANMO
in ParaDyn.  These features are accessed as commands in the input file
and are described in the OVERVIEW file.

The potential files used by ParaDyn are in the same format used by
DYNAMO for single elements or alloy systems.  A few of these are
provided in the potentials directory.  Stephen Foiles has a large
collection of these.

There are three input file commands which affect how fast a problem
runs on a given number of processors.  These are "parallel method",
"newton flag", and "neighbor method" and are described in the OVERVIEW
file.  When running a particular problem on a particular number of
processors, you may want to experiment with different values of these
three options to see what gives the fastest run time.  In principle
you should get identical answers using any combination of the three
options on any number of processors.  In practice, round-off errors
can cause slight differences and eventual divergence of dynamical
trajectories.

As discussed in the OVERVIEW file, when using parallel method = 2
(force-decomposition), for optimal performance you should run on P
processors where P = M*N and M is roughly equal to N.  If you run with
P a prime (or with widely differing factors), then your communication
costs will be closer to parallel method = 1.

Another issue affecting execution speed for parallel method = 2
(force-decomposition) is the ordering of atoms.  If the ordering is
regular either in the input data file ("read atoms") or as the lattice
is generated ("create atoms"), then force-decomposition will be more
likely to suffer from load-imbalance.  This can slow the code down in
parallel.  Thus you should first use the serial code "randomize" from
the tools directory to randomly permute the order of atoms in a "read
atoms" input file.  See the header of the randomize.f file for more
information.  Or you should use the "create atoms" command with a
non-zero 6th parameter - see the OVERVIEW file for details.  One
important note: two runs with different atom orderings should give
statistically similar results (e.g. thermodynamic averages), but will
typically not be identical.  This will happen if you use the "create
velocity" command, because it does not assign initial velocities the
same for both runs.  It will also happen if you use "read velocities"
on an input file of velocities that has not been permuted with exactly
the same re-ordering.

A final issue affecting execution speed is the communication routines
used in ParaDyn.  As discussed at the top of gs.c the default is to
use a recursive gather/scatter with blocking sends and receives.  This
is often the fastest option, but may cause some MPI implementations to
hang if you run big simulations.  This will typically happen before
the first thermodynamics print-out to the screen.  Sometimes this can
be controlled by setting appropriate environment variables that alter
MPI rules for when message buffering is done.  If the code hangs, you
can add a -DGS_IRECV switch to your CCFLAGS options in the
Makefile.xxx file and recompile gs.c.  This will use non-blocking
receives which require no buffer space and thus should not hang, but
is often a bit slower.  You can also experiement with -DGS_MPI which
will call the MPI_Allgatherv and MPI_Reduce_scatter routines directly.
In most MPI implementations these are slower than the recursive
gather/scatter routines I provide, but maybe you will be lucky!
Finally, if you have an optimized daxpy routine (e.g. BLAS routine) on
your machine, you can use the -DGS_AXPY switch to boost performance a
bit.  You must check that the syntax in gs.c for calling daxpy matches
your library routine.

There is also a -DSYNC switch you can compile the force.f file with,
which will do extra synchronization before calling communication
routines.  This will tend to slow down the code slightly, but will
give you a more accurate print-out of the time spent communicating.
Without this switch, load imbalance will typically be included in the
communication time; with this switch it should be included in the
force time.

If you get an error message when running ParaDyn about "boosting"
something, it means your arrays are not allocated large enough.  You
need to modify the appropriate parameter statement(s) at the top of
the param.h file and recompile.  Some of these errors are detected at
setup, others (like neighbor list overflow) may not occur until the
middle of a run.  When the latter happens the program will either
gracefully stop (if all processors incurred the same error) or hang.
Either way you should get an error message printed to the screen.  You
can also get an error message about running out of physical memory on
a processor.  If you can't set any parameters smaller and still run
your problem, then you need more processors!

A "boost" message should occur only rarely if you compiled with
-DDYNAMIC.  The most likely culprit will be the neighbor list arrays
For the -DDYNAMIC version you increase these sizes by setting the
"extra_neigh" parameter in paradyn.h to a larger value.  This is a
multiplier on the size for the neighbor list arrays that ParaDyn
estimates at run-time.  If you are running in static memory mode (no
-DDYNAMIC switch), you are more likely to get boost errors for
maxlocal, maxrow, or maxcol, all of which are functions of the number
of atoms N and the number of processors P.  Roughly speaking, you need
maxlocal >= N/P.  For atom-decomposition, you need maxrow >= N/P and
maxcol = N.  For force-decomposition, you need maxrow >= N/sqrt(P) and
maxcol >= N/sqrt(P).

I've tried to be pretty careful about detecting memory-overflow kinds
of errors.  If ParaDyn ever crashs or hangs without spitting out an
error message first due to algorithmic/parallelism problems (as
opposed to physics problems, like too big a timestep or putting 2
atoms on top of each other or the GS_IRECV problem discussed above),
it's probably a bug, so let me know about it.

To use ParaDyn to measure other properties of a system besides the
simple thermodynamic quantities (T,P,etc) it outputs to the screen and
log file you have 2 options.  You can simply dump atom positions
and/or velocities to disk and post-process the files to compute
desired quantities.  Or you write your own diagnostic routine to
compute the desired quantities on-the-fly as the simulation runs.  The
diagnostic.f file has an example routine that works in this fashion to
compute diffusion coefficients in a grain boundary system.  See the
top of the diagnostic.f file and the "diagnostic" command description
in the OVERVIEW file for more information.  If you don't want any
diagnostics you should compile with the diagnostic.hold file instead
of the diagnostic.f file.  Note that 2 of the example tests use the
diagnostic.f provided.  You can also add a user-defined force to
whatever atoms you want every time the force routine is called, by
adding a userforce routine in the file userforce.f.

There are two utilities in the tools directory.  For DYNAMO users
there is a dyn2para code that converts DYNAMO configuration files to
ParaDyn input files (atom and velocity dumps).  See the header of the
dyn2para.f file for the usage syntax.  The randomize utility was
discussed above.

----------------------------------------------------------------------------

Understanding ParaDyn:

If you wish to modify ParaDyn or understand it's inner workings you
may find the Crib file useful.  I'd like to think the source code is
so cleanly written as to be self-documenting (ha!).  The MRS paper is
the best high-level overview of what ParaDyn is doing as far as
parallelism.

If you can't understand something or want suggestions about modifying
some section of code, give me a call or send e-mail about it and I'll
explain (or obfuscate) things more fully.