parallel.tex 5.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180
  1. \section{Concept}
  2. The {\bf Planet Simulator} is coded for parallel execution
  3. on computers with multiple CPU's or networked machines.
  4. The implementation uses MPI (Message Passage Interface),
  5. that is available for nearly every operating system
  6. {\url{http://www.mcs.anl.gov/mpi}}.
  7. In order to avoid maintaining two sets of source code
  8. for the parallel and the single CPU version, all
  9. calls to the MPI routines are encapsulated into a module.
  10. Users, that want to compile and execute the parallel
  11. version use the module
  12. {\bf mpimod.f90} and the commands {\bf mpif90}
  13. for compiling and {\bf mpirun} for running.
  14. If MPI is not implemented or the single CPU
  15. version is sufficient, {\bf mpimod\_stub.f90}
  16. is used instead of {\bf mpimod.f90}.
  17. Also remove or comment the line:
  18. \begin{verbatim}
  19. ! use mpi
  20. \end{verbatim}
  21. and set the number of processors to 1:
  22. \begin{verbatim}
  23. parameter(NPRO = 1)
  24. \end{verbatim}
  25. \section{Parallelization in Gridpoint Domain}
  26. The data arrays in gridpoint domain are either
  27. three-dimensional e.g. gt(NLON, NLAT, NLEV) referring
  28. to an array organized after longitudes, latitudes and levels,
  29. or two-dimensional, e.g. gp(NLON, NLAT).
  30. The code is organized such, that there are no dependencies
  31. in latitudinal direction, while in gridpoint domain.
  32. Such dependencies are resolved during the Legendre-Transformations.
  33. So the the partitioning of the data is done in latitudes.
  34. The program can use as many CPU's as latitudes with the extreme
  35. of every CPU doing the computations for a single latitude.
  36. There is the restriction however, that the number of latitudes
  37. (NLAT) divided by the number of processors (NPRO), giving
  38. the number of latitudes per process (NLPP) must have zero
  39. remainder. E.g. A T31 resolution uses $NLAT=48$.
  40. Possible values for NPRO are then 1, 2, 3, 4, 6, 8, 12, 16, 24, and 48.
  41. All loops dealing with a latitudinal index look like:
  42. \begin{verbatim}
  43. do jlat = 1 , NLPP
  44. ....
  45. enddo
  46. \end{verbatim}
  47. There are, however, many subroutines, with the most prominent
  48. called {\bf calcgp}, that can fuse latitudinal and longitudinal
  49. indices. In all these cases the dimension NHOR is used.
  50. NHOR is defined as: $NHOR = NLON * NLPP$ in the
  51. pumamod - module. The typical gridpoint loop that looks like:
  52. \begin{verbatim}
  53. do jlat = 1 , NLPP
  54. do jlon = 1 , NLON
  55. gp(jlon,jlat) = ...
  56. enddo
  57. enddo
  58. \end{verbatim}
  59. is then replaced by the faster executing loop:
  60. \begin{verbatim}
  61. do jhor = 1 , NHOR
  62. gp(jhor) = ...
  63. enddo
  64. \end{verbatim}
  65. \section{Parallelization in Spectral Domain}
  66. The number of coefficients in spectral domain (NRSP)
  67. is divided by the number of processes (NPRO) giving
  68. the number of coefficients per process (NSPP).
  69. The number is rounded up to the next integer and the
  70. last process may get some additional dummy elements,
  71. if there is a remainder in the division operation.
  72. All loops in spectral domain are organized like:
  73. \begin{verbatim}
  74. do jsp = 1 , NSPP
  75. sp(jsp) = ...
  76. enddo
  77. \end{verbatim}
  78. \section{Synchronization points}
  79. All processes must communicate and have therefore to
  80. be synchronized at following events:
  81. \begin{itemize}
  82. \item Legendre-Transformation:
  83. This involves changing from latitudinal partitioning to
  84. spectral partitioning and such some gather and scatter
  85. operations.
  86. \item Inverse Legendre-Transformation:
  87. The partitioning changes from spectral to latitudinal
  88. by using gather, broadcast, and scatter operations.
  89. \item Input-Output:
  90. All read and write operations must be done only by
  91. the root process, who gathers and broadcasts or
  92. scatters the information as desired.
  93. Code that is to be executed by the root process exclusively is
  94. written like:
  95. \begin{verbatim}
  96. if (mypid == NROOT) then
  97. ...
  98. endif
  99. \end{verbatim}
  100. NROOT is typically 0 in MPI implementations,
  101. mypid (My process identification) is assigned by MPI.
  102. \end{itemize}
  103. \section{Source code}
  104. It needs some discipline in order to maintain parallel code.
  105. Here are the most important rules for changing or adding code
  106. to the {\bf Planet Simulator}:
  107. \begin{itemize}
  108. \item Adding namelist parameters:
  109. All namelist parameters must be broadcasted after reading
  110. the namelist. (Subroutines mpbci, mpbcr, mpbcin, mpbcrn)
  111. \item Adding scalar variables and arrays:
  112. Global variables must be defined in a module header
  113. and initialized.
  114. \item Initialization code:
  115. Initialization code, that contains dependencies on
  116. latitude or spectral modes must be done by the
  117. root process only and then scattered from there
  118. to all child processes.
  119. \item Array dimensions and loop limits:
  120. Always use parameter constants (NHOR, NLAT, NLEV, etc.)
  121. as defined in
  122. pumamod.f90 for array dimensions
  123. and loop limits.
  124. \item Testing:
  125. After significant code changes the program should be tested
  126. in single and in multi-CPU configuration. The results
  127. of a single CPU run is usually not exactly the same as the
  128. result of a multi-CPU run due to effects in
  129. rounding. But the results should show only small
  130. differences during the first timesteps.
  131. \item Synchronization points:
  132. The code is optimzed for parallel execution and minimizes
  133. therefore communication overhead. The necessary communication
  134. code is grouped around the Legendre-transformations.
  135. If more scatter/gather operations or other communication
  136. routines are to be added, they should be placed
  137. just before or after the execution of the calls to the
  138. Legendre-Transformation. Any other place would degrade
  139. the overall performance in introducing additional
  140. process synchronization.
  141. \end{itemize}