Monday, January 28, 2008

Life in the trenches III

One of my ensembles has died. The run dropped an error file as follows (excerpt):
print_memusage iam 47 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 52301 22403 4114 806 0
print_memusage iam 13 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 52441 22616 4121 806 0
print_memusage iam 36 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 52135 24788 5987 806 0
[0] [MPI Abort by user] Aborting Program!
done.
/opt/lsf/bin/ibrun: line 6: 545 Killed pam -g 1 mvapich_wrapper $*
[0] [MPI Abort by user] Aborting Program!
done.
/opt/lsf/bin/ibrun: line 6: 13870 Killed pam -g 1 mvapich_wrapper $*
[0] [MPI Abort by user] Aborting Program!
done.
/opt/lsf/bin/ibrun: line 6: 3813 Killed pam -g 1 mvapich_wrapper $*
[0] [MPI Abort by user] Aborting Program!
done.

The error file is 1132 lines long. Python would give me more legible information.

I am simply going to hope it has something to do with Lonestar's I/O problems. I need to rerun the exact case, of course.

The ensemble controller somehow knew something was awry. It wasn't written by a very seasoned programmer and I'm eager to rewrite it but the time never shows up. Fortunately this time it appears to have done the right thing.

Meanwhile
  • Charles has convinced me that he really has something going with his current proposal so I am trying to help firm up the text (for practical purposes, on my own time)
  • I still have yet to get sensible output from defineqflux, or understand how to build it for that matter (see Trenches II)
  • I need to back up my expired files from Chicago
  • I need to document and version control my December work
  • I need to finish building the NCAR diagnostic toolkit now that NCL works
  • round up speakers for scientific software day
  • somehow we need a local MPI install; yow!
  • eventually we need a CCSM build on a TACC platform, but don't hold your breath
This is on halftime! I'm not even supposed to be in on Mondays!

Will I ever write a line of my own code again?

Update: The error file says nothing whatsoever. Successful runs have the same errors, up to and not including the Abort messages, where instead they contain 64 lines consisting of "0".

1 comment:

David B. Benson said...

Doesn't seem that way, does it?