#25 Crash on Lemaitre3

Open
opened 4 years ago by pbarriat · 1 comments

Dear all,

On Lemaitre3, with the standard configuration (see README), if sometimes you get this crash:

Sep 16 15:27:19 lm3-w045 ifsmaster-ecconf: (hfi/PSM)[94909]: PSM2 can't open hfi unit: -1 (err=23) 

or this one:

forrtl: Remote I/O error

you should change a little bit the code of the OASIS coupler.

From your ec-earth repository, open sources/oasis3-mct/lib/psmile/src/mod_oasis_method.F90 and replace:

 423                WRITE(filename,'(a,i2.2)') 'debug.root.',compid
 429                WRITE(filename2,'(a,i2.2)') 'debug.notroot.',compid
 436            WRITE(filename,'(a,i2.2,a,i6.6)') 'debug.',compid,'.',mpi_rank_local

with:

 423                WRITE(filename,'(a,i2.2)') '/dev/shm/debug.root.',compid
 429                WRITE(filename2,'(a,i2.2)') '/dev/shm/debug.notroot.',compid
 436            WRITE(filename,'(a,i2.2,a,i6.6)') '/dev/shm/debug.',compid,'.',mpi_rank_local

Once done, re-compile oasis, ifs and nemo...

Reason:

the scratch on Lemaitre3 is a BeeGFS file system which "doesn't like" small files. At the beginning of a run, OASIS creates many small files (in a very short period) and sometimes BeeGFS can't handle them.

So it's better to write these files on the RAM (= /dev/shm/) instead of your running directory (scratch)

Dear all, On Lemaitre3, with the standard configuration (see [README](https://gogs.elic.ucl.ac.be/pbarriat/ecearth_patch/src/master/README.md)), if sometimes you get this crash: ``` Sep 16 15:27:19 lm3-w045 ifsmaster-ecconf: (hfi/PSM)[94909]: PSM2 can't open hfi unit: -1 (err=23) ``` or this one: ``` forrtl: Remote I/O error ``` you should change a little bit the code of the OASIS coupler. From your ec-earth repository, open `sources/oasis3-mct/lib/psmile/src/mod_oasis_method.F90` and replace: ``` 423 WRITE(filename,'(a,i2.2)') 'debug.root.',compid 429 WRITE(filename2,'(a,i2.2)') 'debug.notroot.',compid 436 WRITE(filename,'(a,i2.2,a,i6.6)') 'debug.',compid,'.',mpi_rank_local ``` with: ``` 423 WRITE(filename,'(a,i2.2)') '/dev/shm/debug.root.',compid 429 WRITE(filename2,'(a,i2.2)') '/dev/shm/debug.notroot.',compid 436 WRITE(filename,'(a,i2.2,a,i6.6)') '/dev/shm/debug.',compid,'.',mpi_rank_local ``` Once done, re-compile oasis, ifs and nemo... Reason: the scratch on Lemaitre3 is a BeeGFS file system which "doesn't like" small files. At the beginning of a run, OASIS creates many small files (in a very short period) and sometimes BeeGFS can't handle them. So it's better to write these files on the RAM (= /dev/shm/) instead of your running directory (scratch)
Charles Pelletier commented 4 years ago
Collaborator

The same bug also affected the PARAMOUR NEMO-CCLM coupled setup. The fix described above solved the issue. Thanks PY.

@klein: Is it possible to include a CPP "BeeFGS" key in OASIS, and adapt the code to use the fix described above when that key is triggered?

The same bug also affected the PARAMOUR NEMO-CCLM coupled setup. The fix described above solved the issue. Thanks PY. @klein: Is it possible to include a CPP "BeeFGS" key in OASIS, and adapt the code to use the fix described above when that key is triggered?
Sign in to join this conversation.
No Milestone
No assignee
2 Participants
Loading...
Cancel
Save
There is no content yet.