Wednesday, January 23, 2008

Life in the Trenches: Part II

This story is probably more illustrative (than is Part I) of some of our productivity sinks:

We have a modified set of CAM parameters that demonstrably improves climate fidelity, and are looking to establish its CO2 doubling sensitivity. This is an atmosphere-only model with six of its magic numbers tweaked by our (UTIG's, i.e., Sen, Stoffa and Jackson's) smart ensemble search algorithm. Our suspicion is that we will see a sensitivity closer to 3 C (increasingly regarded as the most likely value) rather than NCAR's 2.4 C.

Now, CO2 sensitivity is a very common use case, perhaps the most famous of all. However, rather than a data ocean it requires an interactive ocean in order to make any sense.

This use case is so common it merits a section in the User Guide.

You will note that the description of what is going on is strikingly terse. If you haven't hung around climate modelers you will not be able to parse it at all, I'll wager.

So note if you can or take my word if you must that step 1 seems more or less model independent, though admittedly it is resolution dependent. However, since CAM only supports a few resoilutions, it's unclear why I should have to jump through the hoops.

Nevertheless, deep within the CAM file hierarchy you do find the directory in question with a perfectly fine Makefile to build definemldlxl, definemldbdy and defineqflux, which in my case naturally doesn't work.

You sort of have to know to tell Lonestar module load netcdf and then env to get the netcdf paths from the environment, which you can then use to create the environment variables the NCAR buld file expects. None of this is documented anywhere, and this is enough to get through the compile phase but still no joy at link time.

/opt/apps/netcdf/netcdf-3.6.1/lib/libnetcdf.a(attr.o): In function `dup_NC_attrarrayV':
/home/build/rpms/BUILD/netcdf-3.6.1/src/libsrc/attr.c:199: undefined reference to `_intel_fast_memset'


Scratching my head on this one. Maybe someone at TACC can help; will file a ticket and see what happens.

Meanwhile, we have tracked down an executable for the last one in a directory of a worker who has left and his data file for the first phase; we are hoping that is good enough. Note, though, that the docs do not say what fields need to be in the output of the file you pass to defineqflux. It turns out that the ones we normally save aren't the right ones.

No problem, you say. This is a runtime variable ('namelist' variable in fortranese) and I won't have to rebuild the whole model for that, will I? I'll just add the missing fields (rummage through the source code to find them... rummage, rummage...) and Bob will be my uncle.

Well actually this would have cost me two days, except that Lonestar lost two days to a panic and I was off Monday, so it cost me a whole week. You see, if you do a restart run, (i.e., a continuation of a prior experiment) the fortran will happily ingest your list of output variables and ignore them (without any message to that effect) in favor of the list you used in the spinup run. No, to have your namelist read and actually used, you have to to a "branch" run rather than a "restart" run, which requires significantly different set of namelist parameters that is essentially ill-documented. In fact if you try to do this based on the CAM manual, you will not find sufficient information. The clew, as Sherlock Holmes would say, is here
This appendix describes a small subset of the namelist variables recognized by CLM 3.0. For more information, please see the CLM User's Guide.

Aha; well that does help. After a while you understand that the namelist contains two (!) definitions of NREVSN (the restart file name) in "branch" mode (the namelist apparently can have multiple namespaces, though Fortran jargin calls them some other thing most likely), whereas in "restart" mode it suffices to use the default values for REST_PFILE which is a small text file containing what in the other case is the value of NREVSN.

So by close of business today we should be able to get the flux file we need to actually run the SOM version of CAM, the building of which was another story but it's in the past so I won't bother you with it.

Update: Still haven't pursued the build question.

After rerunning the spinup, the run of defineqflux fails, but later in the process, with the message:

nc_put_vara_double -1 failure: Variable not found.

So another 12 hours spinup run needed, once I figure out what it needs. This is a good example of a bad error message. In a real language it would not be too much trouble to report which variable wasn't found and where. In a reasonable document the requisiste fields would be listed. If I were more competent I'd have done a better job perusing the source. 0 for 3.

Update 1/28: Used default outputs and set it running on Friday. Trioed to run defineqflux today with the same error. I don't think it's a matter of which fields I use. Perhaps the version of defineqflux which I have is defective. Still don't really know how to build it so though it is only modestly complicated C code, I can't debug it.

...
Using SST mid-months 8 and 10 to compute QFLUX mid-month 9
dayfact=62
Using SST mid-months 9 and 11 to compute QFLUX mid-month 10
dayfact=60
Using SST mid-months 10 and 0 to compute QFLUX mid-month 11
dayfact=62
nc_put_vara_double -1 failure: Variable not found
Abort


Update 1/29: I have managed to build against a NetCDF library that I myself built from scratch. Still trying to resolve the error, but at least I can put some tracers in the code now.

Meanwhile, a ticket is pending at TACC for how to build against their library.

Update 1/31: Still no joy. Looks very much as if I need the other pieces of the tool chain. But they require the fortran as well as the C libraries. Linking against TACC's library fails as I reported some time ago. My own build of Fortran NetCDF yields

vcc works; but missing link magic for f77 I/O. - NO fort. None of gcc, cc or vcc generate required names. - f2c : Use #define f2cFortran, or cc -Df2cFortran - NAG f90: Use #define NAGf90Fortran, or cc -DNAGf90Fortran - Absoft UNIX F77: Use #define AbsoftUNIXFortran or cc -DAbsoftUNIXFortran - Absoft Pro Fortran: Use #define AbsoftProFortran - Portland Group Fortran: Use #define pgiFortran - PathScale Fortran: Use #define PATHSCALE_COMPILER"
make[3]: *** [fort-attio.lo] Error 1
make[3]: Leaving directory `/home/utexas/ig/tobis/CAMYA/cam1/models/atm/cam/tools/defineqflux/NCDF/netcdf-3.6.2/fortran'

so here I sit helpless. Two weeks since starting to try to run a standard use case and I am set back to trying to build system-supported libraries. TACC has been episodically helpful but is presently silent.

I am going to try from the beginning on a UTIG machine next.

3 comments:

Anonymous said...

Michael,

I noticed this online.. and I think I can help your problem. I think you just need to tell the compiler where the netcdf library is.

on the command line do a:

module load netcdf

and then on your compile flags (something like CFLAGS in the makefile) you will have to add

-I $(TACC_NETCDF_INC)

and then on the LFLAGS

-L $(TACC_NETCDF_LIB) -lnetcdf


You probably also need to add -lguide to the link line as well. The missing _intel_fastmem, according to the intel website, is what occurs when you have a mis-matching compiled binary. I'd be sure to clean up your install directory before you try to compile again. (remove all .o binary files with a make clean)

let me know if you need help

Evan Turner
eturner@tacc.utexas.edu

Michael Tobis said...

TACC rocks!

Michael Tobis said...

No, actually without the -lguide that is exactly equivalent to what I already did.

With the -lguide it just says it can't find -lguide .