#25 Crash on Lemaitre3

Terbuka
dibuka 4 tahun lalu oleh pbarriat · 1 komentar

Dear all,

On Lemaitre3, with the standard configuration (see README), if sometimes you get this crash:

Sep 16 15:27:19 lm3-w045 ifsmaster-ecconf: (hfi/PSM)[94909]: PSM2 can't open hfi unit: -1 (err=23) 

or this one:

forrtl: Remote I/O error

you should change a little bit the code of the OASIS coupler.

From your ec-earth repository, open sources/oasis3-mct/lib/psmile/src/mod_oasis_method.F90 and replace:

 423                WRITE(filename,'(a,i2.2)') 'debug.root.',compid
 429                WRITE(filename2,'(a,i2.2)') 'debug.notroot.',compid
 436            WRITE(filename,'(a,i2.2,a,i6.6)') 'debug.',compid,'.',mpi_rank_local

with:

 423                WRITE(filename,'(a,i2.2)') '/dev/shm/debug.root.',compid
 429                WRITE(filename2,'(a,i2.2)') '/dev/shm/debug.notroot.',compid
 436            WRITE(filename,'(a,i2.2,a,i6.6)') '/dev/shm/debug.',compid,'.',mpi_rank_local

Once done, re-compile oasis, ifs and nemo...

Reason:

the scratch on Lemaitre3 is a BeeGFS file system which "doesn't like" small files. At the beginning of a run, OASIS creates many small files (in a very short period) and sometimes BeeGFS can't handle them.

So it's better to write these files on the RAM (= /dev/shm/) instead of your running directory (scratch)

Dear all, On Lemaitre3, with the standard configuration (see [README](https://gogs.elic.ucl.ac.be/pbarriat/ecearth_patch/src/master/README.md)), if sometimes you get this crash: ``` Sep 16 15:27:19 lm3-w045 ifsmaster-ecconf: (hfi/PSM)[94909]: PSM2 can't open hfi unit: -1 (err=23) ``` or this one: ``` forrtl: Remote I/O error ``` you should change a little bit the code of the OASIS coupler. From your ec-earth repository, open `sources/oasis3-mct/lib/psmile/src/mod_oasis_method.F90` and replace: ``` 423 WRITE(filename,'(a,i2.2)') 'debug.root.',compid 429 WRITE(filename2,'(a,i2.2)') 'debug.notroot.',compid 436 WRITE(filename,'(a,i2.2,a,i6.6)') 'debug.',compid,'.',mpi_rank_local ``` with: ``` 423 WRITE(filename,'(a,i2.2)') '/dev/shm/debug.root.',compid 429 WRITE(filename2,'(a,i2.2)') '/dev/shm/debug.notroot.',compid 436 WRITE(filename,'(a,i2.2,a,i6.6)') '/dev/shm/debug.',compid,'.',mpi_rank_local ``` Once done, re-compile oasis, ifs and nemo... Reason: the scratch on Lemaitre3 is a BeeGFS file system which "doesn't like" small files. At the beginning of a run, OASIS creates many small files (in a very short period) and sometimes BeeGFS can't handle them. So it's better to write these files on the RAM (= /dev/shm/) instead of your running directory (scratch)
Charles Pelletier komentar 4 tahun lalu
Kolaborator

The same bug also affected the PARAMOUR NEMO-CCLM coupled setup. The fix described above solved the issue. Thanks PY.

@klein: Is it possible to include a CPP "BeeFGS" key in OASIS, and adapt the code to use the fix described above when that key is triggered?

The same bug also affected the PARAMOUR NEMO-CCLM coupled setup. The fix described above solved the issue. Thanks PY. @klein: Is it possible to include a CPP "BeeFGS" key in OASIS, and adapt the code to use the fix described above when that key is triggered?
Masuk untuk bergabung dalam percakapan ini.
Tidak ada tonggak
Tidak ada penerima
2 Peserta
Memuat...
Batal
Simpan
Belum ada konten.