Tuesday, January 29, 2008

My Deep Naivetee

I've made a couple of comments on Bryan's blog early this morning. One is a continuation of my usual ranting, and another is an expression of confusion about what web services have to do with anything scientific coders care about.

Meanwhile I am amused to see my position described as "deeply naive" in a blog by Dan Hughes that I haven't spotted before. It seems worth following for the in crowd.

I don't, actually, think a ground-up rewrite would be easy, and I have some ideas for a less ambitious approach to prove the pudding for a higher level abstraction approach. We'll see what NSF says about it.

However, I think the tenacious attachment to the code which Dan himself goes so far as to describe as "very olde, very poorly constructed, undocumented, and yet very actively used" (much further than I would go) is damned peculiar. It seems to imply that what we are doing doesn't matter. Well, if it doesn't matter we should quit, and if it does matter we should bite the bullet and write something clean and maintainable and (dare I say it) even literate.

In the end this matter is too important for anything less than maximal transparency. It completely baffles me that this goal is so thoroughly outside the field of vision of practitioners. I think it's fundamental.

Update: Whoops. I mistook what Dan Hughes was about. He isn't attached to the code. He doesn't work with the codes, he just snipes at them from a distance. He's an "auditor". Likely, he doesn't want to believe that climate modeling is feasible, or at least is playing to an audience that thinks like that. One can hope he does not reach the illogical conclusion that climate protection is not a useful goal of policy pending the substantial improvement of such models.

Let me make it very clear from observing real productive climate scientists that extant models are useful tools for science, and that the scientists are well aware of their imperfections and flaws.

This doesn't mean they/we should be immune from criticism, nor that we have a clear sense of how useful the extant models are for projection. I have a lot doubts on that score, but I take the rational risk-weighted response of being more worried about our collective future, rather than less so, as a consequence.

Further Update: That said, the discussions on Hughes blog are not without value. I found this one especially interesting.

Further Update: While it might appear that there is no intent by Hughes to provide a constructive alternative (see comment #6 to this thread) he does in fact offer us up a strawman proposal for an alternative approach here (linked from #7 in the same thread). His server is slow and unreliable, but it's there on occasion. Reading...

Monday, January 28, 2008

Life in the trenches III

One of my ensembles has died. The run dropped an error file as follows (excerpt):
print_memusage iam 47 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 52301 22403 4114 806 0
print_memusage iam 13 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 52441 22616 4121 806 0
print_memusage iam 36 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 52135 24788 5987 806 0
[0] [MPI Abort by user] Aborting Program!
done.
/opt/lsf/bin/ibrun: line 6: 545 Killed pam -g 1 mvapich_wrapper $*
[0] [MPI Abort by user] Aborting Program!
done.
/opt/lsf/bin/ibrun: line 6: 13870 Killed pam -g 1 mvapich_wrapper $*
[0] [MPI Abort by user] Aborting Program!
done.
/opt/lsf/bin/ibrun: line 6: 3813 Killed pam -g 1 mvapich_wrapper $*
[0] [MPI Abort by user] Aborting Program!
done.

The error file is 1132 lines long. Python would give me more legible information.

I am simply going to hope it has something to do with Lonestar's I/O problems. I need to rerun the exact case, of course.

The ensemble controller somehow knew something was awry. It wasn't written by a very seasoned programmer and I'm eager to rewrite it but the time never shows up. Fortunately this time it appears to have done the right thing.

Meanwhile
  • Charles has convinced me that he really has something going with his current proposal so I am trying to help firm up the text (for practical purposes, on my own time)
  • I still have yet to get sensible output from defineqflux, or understand how to build it for that matter (see Trenches II)
  • I need to back up my expired files from Chicago
  • I need to document and version control my December work
  • I need to finish building the NCAR diagnostic toolkit now that NCL works
  • round up speakers for scientific software day
  • somehow we need a local MPI install; yow!
  • eventually we need a CCSM build on a TACC platform, but don't hold your breath
This is on halftime! I'm not even supposed to be in on Mondays!

Will I ever write a line of my own code again?

Update: The error file says nothing whatsoever. Successful runs have the same errors, up to and not including the Abort messages, where instead they contain 64 lines consisting of "0".

Friday, January 25, 2008

Why Python?

Other people have other reasons, but mine is succinctly summarized by Paul Prescod in his article On the Relationship between Python and Lisp. Here's the core of the argument:
Paul Graham says that a language designed for "the masses" is actually being designed for "dufuses". My observation is that some of the smartest people on the planet are dufuses when it comes to programming, (or in some cases just dynamic programming languages) and I am pleased to share a language (and code!) with them.

I usually spend a big chunk of my day in Python. But most professional programmers will not be able to do that until Python is a dominant language. In the meantime they must switch back and forth between Python and the language of their day-job. Python is designed to make that easy. During the day they can pound out accounting code and at night switch to hacking distributed object oriented file systems. Every intuititively named function name makes it that much easier to "swap" Python back into your brain. After we have been away from a language for a while, we are all dufuses, and every choice made in favor of the duffers actually benefits even high-end programmers.

I get paid to share my code with "dufuses" known as domain experts. Using Python, I typically do not have to wrap my code up as a COM object for their use in VB. I do not have to code in a crappy language designed only for non-experts. They do not have to learn a hard language designed only for experts. We can talk the same language. They can read and sometimes maintain my code if they need to.

From a purely altruistic point of view, it feels good to me to know that I am writing code (especially open source code) that can become a lesson for a high school student or other programming beginner. I like to give programming classes to the marketing departments at the companies where I work.

Even hard-core geeks get a certain aesthetic pleasure out of using something that feels minimal and simple because most of the complexity has been hidden away or removed.
Code is a communication medium. The machine is not the only reader. Python is the only language explicitly designed with both beginners and experts in mind. This has direct benefits for the transition from beginner to expert, and it also has direct benefits for development collaboration among groups with distinct expertise.

Here is an example. I have never taken a compiler course and I still consider code compilation to be deep magic, though not as much as I did before Python began boosting my sophistication. Nevertheless, I can understand and appreciate the following.


# romanNumerals.py
#
# Copyright (c) 2006, Paul McGuire
#

from pyparsing import *

def romanNumeralLiteral(numeralString, value):
return Literal(numeralString).setParseAction(replaceWith(value))

one = romanNumeralLiteral("I",1)
four = romanNumeralLiteral("IV",4)
five = romanNumeralLiteral("V",5)
nine = romanNumeralLiteral("IX",9)
ten = romanNumeralLiteral("X",10)
forty = romanNumeralLiteral("XL",40)
fifty = romanNumeralLiteral("L",50)
ninety = romanNumeralLiteral("XC",90)
onehundred = romanNumeralLiteral("C",100)
fourhundred = romanNumeralLiteral("CD",400)
fivehundred = romanNumeralLiteral("D",500)
ninehundred = romanNumeralLiteral("CM",900)
onethousand = romanNumeralLiteral("M",1000)

numeral = ( onethousand | ninehundred | fivehundred | fourhundred |
onehundred | ninety | fifty | forty | ten | nine | five |
four | one ).leaveWhitespace()

romanNumeral = OneOrMore(numeral).setParseAction( lambda s,l,t : sum(t) )

print romanNumeral.parseString("XLII") # prints "42"



Try doing that in a dozen or so lines of C++ or Java, mate, so that a random reader could get half a clue as to what was happening! Yes of course the "import" statement hides a great deal of magic, but that's the whole point, see?

Wednesday, January 23, 2008

Life in the Trenches: Part II

This story is probably more illustrative (than is Part I) of some of our productivity sinks:

We have a modified set of CAM parameters that demonstrably improves climate fidelity, and are looking to establish its CO2 doubling sensitivity. This is an atmosphere-only model with six of its magic numbers tweaked by our (UTIG's, i.e., Sen, Stoffa and Jackson's) smart ensemble search algorithm. Our suspicion is that we will see a sensitivity closer to 3 C (increasingly regarded as the most likely value) rather than NCAR's 2.4 C.

Now, CO2 sensitivity is a very common use case, perhaps the most famous of all. However, rather than a data ocean it requires an interactive ocean in order to make any sense.

This use case is so common it merits a section in the User Guide.

You will note that the description of what is going on is strikingly terse. If you haven't hung around climate modelers you will not be able to parse it at all, I'll wager.

So note if you can or take my word if you must that step 1 seems more or less model independent, though admittedly it is resolution dependent. However, since CAM only supports a few resoilutions, it's unclear why I should have to jump through the hoops.

Nevertheless, deep within the CAM file hierarchy you do find the directory in question with a perfectly fine Makefile to build definemldlxl, definemldbdy and defineqflux, which in my case naturally doesn't work.

You sort of have to know to tell Lonestar module load netcdf and then env to get the netcdf paths from the environment, which you can then use to create the environment variables the NCAR buld file expects. None of this is documented anywhere, and this is enough to get through the compile phase but still no joy at link time.

/opt/apps/netcdf/netcdf-3.6.1/lib/libnetcdf.a(attr.o): In function `dup_NC_attrarrayV':
/home/build/rpms/BUILD/netcdf-3.6.1/src/libsrc/attr.c:199: undefined reference to `_intel_fast_memset'


Scratching my head on this one. Maybe someone at TACC can help; will file a ticket and see what happens.

Meanwhile, we have tracked down an executable for the last one in a directory of a worker who has left and his data file for the first phase; we are hoping that is good enough. Note, though, that the docs do not say what fields need to be in the output of the file you pass to defineqflux. It turns out that the ones we normally save aren't the right ones.

No problem, you say. This is a runtime variable ('namelist' variable in fortranese) and I won't have to rebuild the whole model for that, will I? I'll just add the missing fields (rummage through the source code to find them... rummage, rummage...) and Bob will be my uncle.

Well actually this would have cost me two days, except that Lonestar lost two days to a panic and I was off Monday, so it cost me a whole week. You see, if you do a restart run, (i.e., a continuation of a prior experiment) the fortran will happily ingest your list of output variables and ignore them (without any message to that effect) in favor of the list you used in the spinup run. No, to have your namelist read and actually used, you have to to a "branch" run rather than a "restart" run, which requires significantly different set of namelist parameters that is essentially ill-documented. In fact if you try to do this based on the CAM manual, you will not find sufficient information. The clew, as Sherlock Holmes would say, is here
This appendix describes a small subset of the namelist variables recognized by CLM 3.0. For more information, please see the CLM User's Guide.

Aha; well that does help. After a while you understand that the namelist contains two (!) definitions of NREVSN (the restart file name) in "branch" mode (the namelist apparently can have multiple namespaces, though Fortran jargin calls them some other thing most likely), whereas in "restart" mode it suffices to use the default values for REST_PFILE which is a small text file containing what in the other case is the value of NREVSN.

So by close of business today we should be able to get the flux file we need to actually run the SOM version of CAM, the building of which was another story but it's in the past so I won't bother you with it.

Update: Still haven't pursued the build question.

After rerunning the spinup, the run of defineqflux fails, but later in the process, with the message:

nc_put_vara_double -1 failure: Variable not found.

So another 12 hours spinup run needed, once I figure out what it needs. This is a good example of a bad error message. In a real language it would not be too much trouble to report which variable wasn't found and where. In a reasonable document the requisiste fields would be listed. If I were more competent I'd have done a better job perusing the source. 0 for 3.

Update 1/28: Used default outputs and set it running on Friday. Trioed to run defineqflux today with the same error. I don't think it's a matter of which fields I use. Perhaps the version of defineqflux which I have is defective. Still don't really know how to build it so though it is only modestly complicated C code, I can't debug it.

...
Using SST mid-months 8 and 10 to compute QFLUX mid-month 9
dayfact=62
Using SST mid-months 9 and 11 to compute QFLUX mid-month 10
dayfact=60
Using SST mid-months 10 and 0 to compute QFLUX mid-month 11
dayfact=62
nc_put_vara_double -1 failure: Variable not found
Abort


Update 1/29: I have managed to build against a NetCDF library that I myself built from scratch. Still trying to resolve the error, but at least I can put some tracers in the code now.

Meanwhile, a ticket is pending at TACC for how to build against their library.

Update 1/31: Still no joy. Looks very much as if I need the other pieces of the tool chain. But they require the fortran as well as the C libraries. Linking against TACC's library fails as I reported some time ago. My own build of Fortran NetCDF yields

vcc works; but missing link magic for f77 I/O. - NO fort. None of gcc, cc or vcc generate required names. - f2c : Use #define f2cFortran, or cc -Df2cFortran - NAG f90: Use #define NAGf90Fortran, or cc -DNAGf90Fortran - Absoft UNIX F77: Use #define AbsoftUNIXFortran or cc -DAbsoftUNIXFortran - Absoft Pro Fortran: Use #define AbsoftProFortran - Portland Group Fortran: Use #define pgiFortran - PathScale Fortran: Use #define PATHSCALE_COMPILER"
make[3]: *** [fort-attio.lo] Error 1
make[3]: Leaving directory `/home/utexas/ig/tobis/CAMYA/cam1/models/atm/cam/tools/defineqflux/NCDF/netcdf-3.6.2/fortran'

so here I sit helpless. Two weeks since starting to try to run a standard use case and I am set back to trying to build system-supported libraries. TACC has been episodically helpful but is presently silent.

I am going to try from the beginning on a UTIG machine next.

The Cybernetic Perspective

In the original sense, "cybernetics" means optimal use of information in decisionmaking.

It emerges that to develop a statistics of information requires a change of perspective from the simple one where developing the model is informed by the system under study to one where the model and the system under study are two components of a larger system. The Wiener filter and its descendant the Kalman filter are the best known examples of this approach but it is more general.

I believe this use of the word "model" is not entirely coincident with the software engineering sense as in "model-driven development" though it does mesh more or less well with "Model-View-Controller". In our world the View is relatively trivial and usually done offline as postprocessing; it's the utter absence of the Controller that troubles me.

(Update: That last is really a stretch on second thought. Probably more sellable is the idea that the M in MVC corresponds to system state (variables) and the C corresponds to executable code (statements). In the end the word "model" is just as much a source of confusion as ever.)

Anyway I think the cybernetic perspective has value in getting climate modeling unstuck.

Tuesday, January 22, 2008

Quote of the Day

Chris McAvoy on the ChiPy list:
I don't know for sure, but from a "Zen of Python" perspective, when I
can't do something with Python it usually ends up that what I'm trying
to do shouldn't be done.

Life in the Climate Trenches: Part 1

I've got not one but two build problems going on. If the rest of my life weren't going well I'd be awfully depressed. I really hate wasting time on this sort of mundane frustration.

Lonestar (the supercomputer) is offline for alternate Tuesday maintenance, so for today I will share build Probleme Numero Un with you. I welcome any skill transfer that can help untangle all this; comments welcome. Rather than polluting the blogspace with innumerable followups I will just provide updates to this entry.

Tomorrow I'll start documenting Probleme Numero Deux: running a standard CAM use case.

It may be that we'll more be documenting my own ignorance than any fundamental problem with the software from some point of view; that will be a learning experience just the same.

You may object that this isn't the sort of thing you want your tax dollars paying PhD's to fumble around with; in that case I would be in agreement with you. In the private sector this would clearly be a systems task and not a scientist task. So I appeal to your mercy as a lousy admin and a potentially productive scientist, albeit of a somewhat eccentirc stripe. The sooner I get these things done the sooner I can be a scientist.

OK, here's the scoop (unix/posix skills required to make any sense out of this):

To participate in NCAR CAM development we need to use their diagnostic package which in turn depends on NCL. Having a legitimate connection to a US research institution, I have long since jumped through the hoops and am registered on the Earth System Grid. So I download the package and all its dependencies.

OK, so the UTIG disk systems are all cross-mounted and so, though I have root on the climate research machine I don't have write privileges on /usr . Our excellent systems folk have set it up so that climate users just define $CLIMATE in their .cshrc which will invoke anything I put (or conceivably somebody else puts) into /usr/local/climate so all I have to do that is nonstandard is insert "climate/" into the path; I make my own ../bin ../lib ../man in there; everyone interested in climate ends up with it in their path, nobody else has their namespace messed up, a very nice approach.

Alas, the lengthy but very clear build instructions do not work (this is standard practice with NCAR products alas; apparently they lack the skills and/or funding to actually make their products portable, though the build instructions are usually very hopeful.) In the present case the snag occurs even before I get to NCAR's stuff.

NetCDF is already available. I have acquired and build jpeglib and zlib as follows:

cd jpeg-6b
make clean
./configure --prefix=/usr/local/climate
make
make install

which puts five executables into /usr/local/climate/bin and

cd zlib-1.2.3
make clean
./configure --prefix=/usr/local/climate
make
make install


which puts a library into /usr/local/climate/lib

My current snag is building HDF-4, which in the configure phase always stops at:

checking zlib.h usability... yes
checking zlib.h presence... yes
checking for zlib.h... yes
checking for compress2 in -lz... yes
checking jpeglib.h usability... no
checking jpeglib.h presence... no
checking for jpeglib.h... no
checking for jpeg_start_decompress in -ljpeg... no
configure: error: couldn't find jpeg library

I have tried various combinations here. What seems to me ought to work is

./configure --prefix=/usr/local/climate --with-zlib=/usr/local/climate \
--with-jpeg=/usr/local/climate --disable-netcdf

but it seems to ignore the prefix altogether. I change --with-zlib to point to a totally bogus address, and yet it finds the zlib stuff anyway. It seems to be ignoring the flags I am setting.

Working hypothesis: I am building jpeg wrong, and some system setting is overriding my value of --with-zlib.

A less satisfying but easier solution could come from downloading binaries. At this point I simply have to swallow my pride and ask which if any of the binaries listed here might work.

Advice would be even more welcome than sympathy.

Update 1/24:

UTIG support advised option 4, which worked more or less smoothly.

Also, from UTIG support:

Oh I didn't see you had your own (up a directory). The command you
were looking for would be

./configure --prefix=/usr/local/climate \
--with-zlib=/usr/local/climate/NCLc3/zlib-1.2.3 \
--with-jpeg=/usr/local/climate/NCLc3/jpeg-6b --disable-netcdf

And I'm idiotically proud of this script, my first conditional in a shell script in a long time:

set hn = `hostname`
if ($hn == 'foo.whatever.utexas.edu') then
echo "climate paths added"
setenv NCARG_ROOT /usr/local/climate/NCLBin/
set path = ( /usr/local/climate/NCLBin/bin/ $path )
endif


So there's finally a happy ending on this one. TACC is still having hardware or configuration problems with Lonestar which is preventing any progress on the other front.

Tuesday, January 15, 2008

Language of Choice

Presumably everyone knows that Python is on the rise with a bullet. Of course, now that Python is sexy it will probably attract some weaker programmers, diluting its reputation. Such are the costs of great beauty. Right now, though, the upswing seems to be near its maximum rate.

In the Slashdot article discussing this I saw a ref to a lexicon definition the concept of Language of Choice in the Jargon File. FORTRAN is dismissed as a niche language for scientists, but nothing better is proposed.

So what should the language of choice be for high performance numerical computing?

There are teams working on replacements (Fortress, Chapel, some IBM thing with a forgettable name), but I think much of the high performance work of the future can be done in Python with a code generation strategy. I'm not sure what the target language ought to be for the code generation. Would there be advantages to Fortress or Chapel for this? For contemporary purposes even F77 would work (as I said in the PyNSol papers) but to be honest the F77 target is almost a stunt, just to make a point. I don't expect it would be useful in the long run.

I am currently thinking most of the system should be agnostic as to the target language. If done right, multiple targets can be supported in an architecture-dependent bottom layer. Even Verilog is a candidate. ( F90 and onward are not. I'd sooner use BF or Whitespace. )

Update: Bryan Lawrence has a nice summary of the issues.

Monday, January 14, 2008

Do NCAR or GISS Ask for This?

This crossed my path on a CS mailing list I follow. The context is about whether students need more software engineering or more computer science. It's a bit harsh on us older types but it certainly is an interesting list. I post it so scientific computing types can consider what they may or may not be missing. I should also note that there are practicing scientific coders for whom I have great respect who would find this list ludicrous.
As a software engineer in the industry and as a person who hires developers I have to say that 2 out of every 3 interviews I take from people with a CS or MS in computer science I do not hire because they are completely inadequate as software engineers. More often than not, they do not know about or have distaste for one or more of the following:
  • Design Patterns
  • Agile engineering / programming
  • Extreme Programming
  • Test Driven Development
  • Composition vs. Inheritance
We also give a programming test as a part of the interview process, it is mathematically / algorithmically moderate in challenge and most interviewees (80%) can solve the problem, but only 15-20% solve it with good programming practices and maybe 5-10% write unit tests. Actually, if an interviewee who has good engineering, but does not solve the problem is more appealing than someone who solves the problem.

I do recall from my undergraduate program at Purdue that we did have a "Software Engineering" class that did try to drive home the points I mentioned, but I have to say that it did not go far enough.

I would never diminish CS courses, mathematics, or general studies programs ever. I would say that anyone who only takes their CS and never masters the engineering principles either through other course work (several hours worth) or through an internship does not stand much of a chance getting hired at the company I work for.

In the end, I would say that if I knew 6 years ago what I knew now, I would have gone for the software engineering program and tried for a dual major in CS if I had the bandwidth.
For what it's worth I believe that design patterns in practice are overstressed, and that the list isn't especially elegant or complete. That said I also have the impression that these ideas are only very dimly perceived in the climate modeling centers and a great deal of benefit could be gained from them.

Thursday, January 10, 2008

Progress

OLPC is a great achievement if only for this story:
"Microsoft struggles to port Windows to a device originally conceived to run Linux."
Still waiting on mine...