123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180 |
- \section{Concept}
- The {\bf Planet Simulator} is coded for parallel execution
- on computers with multiple CPU's or networked machines.
- The implementation uses MPI (Message Passage Interface),
- that is available for nearly every operating system
- {\url{http://www.mcs.anl.gov/mpi}}.
- In order to avoid maintaining two sets of source code
- for the parallel and the single CPU version, all
- calls to the MPI routines are encapsulated into a module.
- Users, that want to compile and execute the parallel
- version use the module
- {\bf mpimod.f90} and the commands {\bf mpif90}
- for compiling and {\bf mpirun} for running.
- If MPI is not implemented or the single CPU
- version is sufficient, {\bf mpimod\_stub.f90}
- is used instead of {\bf mpimod.f90}.
- Also remove or comment the line:
- \begin{verbatim}
- ! use mpi
- \end{verbatim}
- and set the number of processors to 1:
- \begin{verbatim}
- parameter(NPRO = 1)
- \end{verbatim}
- \section{Parallelization in Gridpoint Domain}
- The data arrays in gridpoint domain are either
- three-dimensional e.g. gt(NLON, NLAT, NLEV) referring
- to an array organized after longitudes, latitudes and levels,
- or two-dimensional, e.g. gp(NLON, NLAT).
- The code is organized such, that there are no dependencies
- in latitudinal direction, while in gridpoint domain.
- Such dependencies are resolved during the Legendre-Transformations.
- So the the partitioning of the data is done in latitudes.
- The program can use as many CPU's as latitudes with the extreme
- of every CPU doing the computations for a single latitude.
- There is the restriction however, that the number of latitudes
- (NLAT) divided by the number of processors (NPRO), giving
- the number of latitudes per process (NLPP) must have zero
- remainder. E.g. A T31 resolution uses $NLAT=48$.
- Possible values for NPRO are then 1, 2, 3, 4, 6, 8, 12, 16, 24, and 48.
- All loops dealing with a latitudinal index look like:
- \begin{verbatim}
- do jlat = 1 , NLPP
- ....
- enddo
- \end{verbatim}
- There are, however, many subroutines, with the most prominent
- called {\bf calcgp}, that can fuse latitudinal and longitudinal
- indices. In all these cases the dimension NHOR is used.
- NHOR is defined as: $NHOR = NLON * NLPP$ in the
- pumamod - module. The typical gridpoint loop that looks like:
- \begin{verbatim}
- do jlat = 1 , NLPP
- do jlon = 1 , NLON
- gp(jlon,jlat) = ...
- enddo
- enddo
- \end{verbatim}
- is then replaced by the faster executing loop:
- \begin{verbatim}
- do jhor = 1 , NHOR
- gp(jhor) = ...
- enddo
- \end{verbatim}
- \section{Parallelization in Spectral Domain}
- The number of coefficients in spectral domain (NRSP)
- is divided by the number of processes (NPRO) giving
- the number of coefficients per process (NSPP).
- The number is rounded up to the next integer and the
- last process may get some additional dummy elements,
- if there is a remainder in the division operation.
- All loops in spectral domain are organized like:
- \begin{verbatim}
- do jsp = 1 , NSPP
- sp(jsp) = ...
- enddo
- \end{verbatim}
- \section{Synchronization points}
- All processes must communicate and have therefore to
- be synchronized at following events:
- \begin{itemize}
- \item Legendre-Transformation:
- This involves changing from latitudinal partitioning to
- spectral partitioning and such some gather and scatter
- operations.
- \item Inverse Legendre-Transformation:
- The partitioning changes from spectral to latitudinal
- by using gather, broadcast, and scatter operations.
- \item Input-Output:
- All read and write operations must be done only by
- the root process, who gathers and broadcasts or
- scatters the information as desired.
- Code that is to be executed by the root process exclusively is
- written like:
- \begin{verbatim}
- if (mypid == NROOT) then
- ...
- endif
- \end{verbatim}
- NROOT is typically 0 in MPI implementations,
- mypid (My process identification) is assigned by MPI.
- \end{itemize}
- \section{Source code}
- It needs some discipline in order to maintain parallel code.
- Here are the most important rules for changing or adding code
- to the {\bf Planet Simulator}:
- \begin{itemize}
- \item Adding namelist parameters:
- All namelist parameters must be broadcasted after reading
- the namelist. (Subroutines mpbci, mpbcr, mpbcin, mpbcrn)
- \item Adding scalar variables and arrays:
- Global variables must be defined in a module header
- and initialized.
- \item Initialization code:
- Initialization code, that contains dependencies on
- latitude or spectral modes must be done by the
- root process only and then scattered from there
- to all child processes.
- \item Array dimensions and loop limits:
- Always use parameter constants (NHOR, NLAT, NLEV, etc.)
- as defined in
- pumamod.f90 for array dimensions
- and loop limits.
- \item Testing:
- After significant code changes the program should be tested
- in single and in multi-CPU configuration. The results
- of a single CPU run is usually not exactly the same as the
- result of a multi-CPU run due to effects in
- rounding. But the results should show only small
- differences during the first timesteps.
- \item Synchronization points:
- The code is optimzed for parallel execution and minimizes
- therefore communication overhead. The necessary communication
- code is grouped around the Legendre-transformations.
- If more scatter/gather operations or other communication
- routines are to be added, they should be placed
- just before or after the execution of the calls to the
- Legendre-Transformation. Any other place would degrade
- the overall performance in introducing additional
- process synchronization.
- \end{itemize}
|