123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171 |
- \section{Concept}
- {\bf PUMA} is coded for parallel execution
- on computers with multiple CPU's or networked machines.
- The implementation uses MPI (Message Passing Interface)
- that is available for nearly every operating system
- {\url{http://www.mcs.anl.gov/mpi}}.
- In order to avoid maintaining two sets of source code
- for the parallel and the single CPU version, all
- calls to the MPI routines are encapsulated into a module.
- Most takes care of choosing the correct version for compiling.
- If MPI is not located by the configure script or the single CPU
- version is sufficient, then the module {\module mpimod\_dummy.f90}
- is used instead of {\module mpimod.f90}.
- \section{Parallelization in the Gridpoint Domain}
- The data arrays in the gridpoint domain are either
- three-dimensional e.g. gt(NLON, NLAT, NLEV) referring
- to an array organized after longitudes, latitudes and levels,
- or two-dimensional, e.g. gp(NLON, NLAT).
- The code is organized so that there are no dependencies
- in the latitudinal direction while in the gridpoint domain.
- Such dependencies are resolved during the Legendre transformations.
- So the data is partitioned by latitude.
- The program can use as many CPU's as lf of the number of latitudes
- with each CPU doing
- the computations for a pair of (North/South) latitudes.
- However, there is the restriction that the number of latitudes
- (NLAT) divided by the number of processors (NPRO), giving
- the number of latitudes per process (NLPP), must have zero
- remainder, e.g. a T31 resolution uses $NLAT=48$.
- Possible values for NPRO are then 1, 2, 3, 4, 6, 8, 12, and 24.
- All loops dealing with a latitudinal index look like:
- \begin{verbatim}
- do jlat = 1 , NLPP
- ....
- enddo
- \end{verbatim}
- There are, however, many subroutines, with the most prominent
- called {\sub calcgp}, that can fuse latitudinal and longitudinal
- indices. In all these cases the dimension NHOR is used.
- NHOR is defined as: $NHOR = NLON * NLPP$ in the
- {\module pumamod} - module. The typical gridpoint loop, which looks like:
- \begin{verbatim}
- do jlat = 1 , NLPP
- do jlon = 1 , NLON
- gp(jlon,jlat) = ...
- enddo
- enddo
- \end{verbatim}
- is replaced by the faster executing loop:
- \begin{verbatim}
- do jhor = 1 , NHOR
- gp(jhor) = ...
- enddo
- \end{verbatim}
- \section{Parallelization in the Spectral Domain}
- The number of coefficients in the spectral domain (NRSP)
- is divided by the number of processes (NPRO) giving
- the number of coefficients per process (NSPP).
- The number is rounded up to the next integer and the
- last process may get some additional dummy elements,
- if there is a remainder in the division operation.
- All loops in spectral domain are organized like:
- \begin{verbatim}
- do jsp = 1 , NSPP
- sp(jsp) = ...
- enddo
- \end{verbatim}
- \section{Synchronization points}
- All processes must communicate and have therefore to
- be synchronized at following events:
- \begin{itemize}
- \item Legendre transformation:
- This involves changing from latitudinal partitioning to
- spectral partitioning and associated gather and scatter
- operations.
- \item Inverse Legendre transformation:
- The partitioning changes from spectral to latitudinal
- by using gather, broadcast, and scatter operations.
- \item Input-Output:
- All read and write operations must only be performed by
- the root process, which gathers and broadcasts or
- scatters the desired information.
- Code that is to be executed by the root process exclusively is
- written as:
- \begin{verbatim}
- if (mypid == NROOT) then
- ...
- endif
- \end{verbatim}
- NROOT is typically 0 in MPI implementations,
- mypid (My process id) is assigned by MPI.
- \end{itemize}
- \section{Source code}
- Discipline is required when maintaining parallel code.
- Here are the most important rules for changing or adding code
- to {\bf PUMA}:
- \begin{itemize}
- \item Adding namelist parameters:
- All namelist parameters must be broadcasted after reading
- the namelist. (Subroutines {\sub mpbci}, {\sub mpbcr},
- {\sub mpbcin}, {\sub mpbcrn})
- \item Adding scalar variables and arrays:
- Global variables must be defined in a module header
- and initialized.
- \item Initialization code:
- Initialization code that contains dependencies on
- latitude or spectral modes must be performed by the
- root process only and then scattered from there
- to all child processes.
- \item Array dimensions and loop limits:
- Always use parameter constants (NHOR, NLAT, NLEV, etc.)
- as defined in
- pumamod.f90 for array dimensions
- and loop limits.
- \item Testing:
- After significant code changes the program should be tested
- in single and in multi-CPU configurations. The results
- of a single CPU run is usually not exactly the same as the
- result of a multi-CPU run due to effects in
- rounding. But the results should show only small
- differences during the first few time steps.
- \item Synchronization points:
- The code is optimzed for parallel execution and therefore the
- communication overhead is minimized by grouping it around the
- Legendre transformation.
- If more scatter/gather operations or other communication
- routines are to be added, they should be placed
- just before or after the execution of the calls to the
- Legendre transformation. Placing them elsewhere would degrade
- the overall performance by introducing additional
- process synchronization.
- \end{itemize}
|