#25 Crash on Lemaitre3

Отворени
отворен преди 4 години от pbarriat · 1 коментара

Dear all,

On Lemaitre3, with the standard configuration (see README), if sometimes you get this crash:

Sep 16 15:27:19 lm3-w045 ifsmaster-ecconf: (hfi/PSM)[94909]: PSM2 can't open hfi unit: -1 (err=23) 

or this one:

forrtl: Remote I/O error

you should change a little bit the code of the OASIS coupler.

From your ec-earth repository, open sources/oasis3-mct/lib/psmile/src/mod_oasis_method.F90 and replace:

 423                WRITE(filename,'(a,i2.2)') 'debug.root.',compid
 429                WRITE(filename2,'(a,i2.2)') 'debug.notroot.',compid
 436            WRITE(filename,'(a,i2.2,a,i6.6)') 'debug.',compid,'.',mpi_rank_local

with:

 423                WRITE(filename,'(a,i2.2)') '/dev/shm/debug.root.',compid
 429                WRITE(filename2,'(a,i2.2)') '/dev/shm/debug.notroot.',compid
 436            WRITE(filename,'(a,i2.2,a,i6.6)') '/dev/shm/debug.',compid,'.',mpi_rank_local

Once done, re-compile oasis, ifs and nemo...

Reason:

the scratch on Lemaitre3 is a BeeGFS file system which "doesn't like" small files. At the beginning of a run, OASIS creates many small files (in a very short period) and sometimes BeeGFS can't handle them.

So it's better to write these files on the RAM (= /dev/shm/) instead of your running directory (scratch)

Dear all, On Lemaitre3, with the standard configuration (see [README](https://gogs.elic.ucl.ac.be/pbarriat/ecearth_patch/src/master/README.md)), if sometimes you get this crash: ``` Sep 16 15:27:19 lm3-w045 ifsmaster-ecconf: (hfi/PSM)[94909]: PSM2 can't open hfi unit: -1 (err=23) ``` or this one: ``` forrtl: Remote I/O error ``` you should change a little bit the code of the OASIS coupler. From your ec-earth repository, open `sources/oasis3-mct/lib/psmile/src/mod_oasis_method.F90` and replace: ``` 423 WRITE(filename,'(a,i2.2)') 'debug.root.',compid 429 WRITE(filename2,'(a,i2.2)') 'debug.notroot.',compid 436 WRITE(filename,'(a,i2.2,a,i6.6)') 'debug.',compid,'.',mpi_rank_local ``` with: ``` 423 WRITE(filename,'(a,i2.2)') '/dev/shm/debug.root.',compid 429 WRITE(filename2,'(a,i2.2)') '/dev/shm/debug.notroot.',compid 436 WRITE(filename,'(a,i2.2,a,i6.6)') '/dev/shm/debug.',compid,'.',mpi_rank_local ``` Once done, re-compile oasis, ifs and nemo... Reason: the scratch on Lemaitre3 is a BeeGFS file system which "doesn't like" small files. At the beginning of a run, OASIS creates many small files (in a very short period) and sometimes BeeGFS can't handle them. So it's better to write these files on the RAM (= /dev/shm/) instead of your running directory (scratch)
Charles Pelletier коментира преди 4 години
Сътрудник

The same bug also affected the PARAMOUR NEMO-CCLM coupled setup. The fix described above solved the issue. Thanks PY.

@klein: Is it possible to include a CPP "BeeFGS" key in OASIS, and adapt the code to use the fix described above when that key is triggered?

The same bug also affected the PARAMOUR NEMO-CCLM coupled setup. The fix described above solved the issue. Thanks PY. @klein: Is it possible to include a CPP "BeeFGS" key in OASIS, and adapt the code to use the fix described above when that key is triggered?
Впишете се за да се присъедините към разговора.
Няма етап
Няма изпълнител
2 участника
Зареждане...
Отказ
Запис
Все още няма съдържание.