parallel.tex 5.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171
  1. \section{Concept}
  2. {\bf PUMA} is coded for parallel execution
  3. on computers with multiple CPU's or networked machines.
  4. The implementation uses MPI (Message Passing Interface)
  5. that is available for nearly every operating system
  6. {\url{http://www.mcs.anl.gov/mpi}}.
  7. In order to avoid maintaining two sets of source code
  8. for the parallel and the single CPU version, all
  9. calls to the MPI routines are encapsulated into a module.
  10. Most takes care of choosing the correct version for compiling.
  11. If MPI is not located by the configure script or the single CPU
  12. version is sufficient, then the module {\module mpimod\_dummy.f90}
  13. is used instead of {\module mpimod.f90}.
  14. \section{Parallelization in the Gridpoint Domain}
  15. The data arrays in the gridpoint domain are either
  16. three-dimensional e.g. gt(NLON, NLAT, NLEV) referring
  17. to an array organized after longitudes, latitudes and levels,
  18. or two-dimensional, e.g. gp(NLON, NLAT).
  19. The code is organized so that there are no dependencies
  20. in the latitudinal direction while in the gridpoint domain.
  21. Such dependencies are resolved during the Legendre transformations.
  22. So the data is partitioned by latitude.
  23. The program can use as many CPU's as lf of the number of latitudes
  24. with each CPU doing
  25. the computations for a pair of (North/South) latitudes.
  26. However, there is the restriction that the number of latitudes
  27. (NLAT) divided by the number of processors (NPRO), giving
  28. the number of latitudes per process (NLPP), must have zero
  29. remainder, e.g. a T31 resolution uses $NLAT=48$.
  30. Possible values for NPRO are then 1, 2, 3, 4, 6, 8, 12, and 24.
  31. All loops dealing with a latitudinal index look like:
  32. \begin{verbatim}
  33. do jlat = 1 , NLPP
  34. ....
  35. enddo
  36. \end{verbatim}
  37. There are, however, many subroutines, with the most prominent
  38. called {\sub calcgp}, that can fuse latitudinal and longitudinal
  39. indices. In all these cases the dimension NHOR is used.
  40. NHOR is defined as: $NHOR = NLON * NLPP$ in the
  41. {\module pumamod} - module. The typical gridpoint loop, which looks like:
  42. \begin{verbatim}
  43. do jlat = 1 , NLPP
  44. do jlon = 1 , NLON
  45. gp(jlon,jlat) = ...
  46. enddo
  47. enddo
  48. \end{verbatim}
  49. is replaced by the faster executing loop:
  50. \begin{verbatim}
  51. do jhor = 1 , NHOR
  52. gp(jhor) = ...
  53. enddo
  54. \end{verbatim}
  55. \section{Parallelization in the Spectral Domain}
  56. The number of coefficients in the spectral domain (NRSP)
  57. is divided by the number of processes (NPRO) giving
  58. the number of coefficients per process (NSPP).
  59. The number is rounded up to the next integer and the
  60. last process may get some additional dummy elements,
  61. if there is a remainder in the division operation.
  62. All loops in spectral domain are organized like:
  63. \begin{verbatim}
  64. do jsp = 1 , NSPP
  65. sp(jsp) = ...
  66. enddo
  67. \end{verbatim}
  68. \section{Synchronization points}
  69. All processes must communicate and have therefore to
  70. be synchronized at following events:
  71. \begin{itemize}
  72. \item Legendre transformation:
  73. This involves changing from latitudinal partitioning to
  74. spectral partitioning and associated gather and scatter
  75. operations.
  76. \item Inverse Legendre transformation:
  77. The partitioning changes from spectral to latitudinal
  78. by using gather, broadcast, and scatter operations.
  79. \item Input-Output:
  80. All read and write operations must only be performed by
  81. the root process, which gathers and broadcasts or
  82. scatters the desired information.
  83. Code that is to be executed by the root process exclusively is
  84. written as:
  85. \begin{verbatim}
  86. if (mypid == NROOT) then
  87. ...
  88. endif
  89. \end{verbatim}
  90. NROOT is typically 0 in MPI implementations,
  91. mypid (My process id) is assigned by MPI.
  92. \end{itemize}
  93. \section{Source code}
  94. Discipline is required when maintaining parallel code.
  95. Here are the most important rules for changing or adding code
  96. to {\bf PUMA}:
  97. \begin{itemize}
  98. \item Adding namelist parameters:
  99. All namelist parameters must be broadcasted after reading
  100. the namelist. (Subroutines {\sub mpbci}, {\sub mpbcr},
  101. {\sub mpbcin}, {\sub mpbcrn})
  102. \item Adding scalar variables and arrays:
  103. Global variables must be defined in a module header
  104. and initialized.
  105. \item Initialization code:
  106. Initialization code that contains dependencies on
  107. latitude or spectral modes must be performed by the
  108. root process only and then scattered from there
  109. to all child processes.
  110. \item Array dimensions and loop limits:
  111. Always use parameter constants (NHOR, NLAT, NLEV, etc.)
  112. as defined in
  113. pumamod.f90 for array dimensions
  114. and loop limits.
  115. \item Testing:
  116. After significant code changes the program should be tested
  117. in single and in multi-CPU configurations. The results
  118. of a single CPU run is usually not exactly the same as the
  119. result of a multi-CPU run due to effects in
  120. rounding. But the results should show only small
  121. differences during the first few time steps.
  122. \item Synchronization points:
  123. The code is optimzed for parallel execution and therefore the
  124. communication overhead is minimized by grouping it around the
  125. Legendre transformation.
  126. If more scatter/gather operations or other communication
  127. routines are to be added, they should be placed
  128. just before or after the execution of the calls to the
  129. Legendre transformation. Placing them elsewhere would degrade
  130. the overall performance by introducing additional
  131. process synchronization.
  132. \end{itemize}