Wednesday, December 26, 2007

Why is Climate Modeling Stuck?

I believe that progress in climate modeling has been relatively limited since the first successes in linking atmosphere and ocean models without flux corrections. (That's about a decade now, long enough to start being cause for concern.) One idea is that tighter codesign of components such as atmosphere and ocean models in the first place would help, and there's something to be said for that, but I don't think that's the core issue.

I suggest that there is a deeper issue based on workflow presumptions. The relationship between the computer science community and the application science community is key. I suggest that the relationship is ill-understood and consequently the field is underproductive.

The relationship between the software development practitioners and the domain scientists is misconstrued by both sides, and both are limited by past experience. Success in such fields as weather modeling and tide prediction provide a context which inappropriately dominates thinking, planning and execution.

Operational codes are the wrong model because scientists do not modify operational codes. Commercial codes are also the wrong model because bankers, CFOs and COOs do not modify operational codes. The primary purpose of scientific codes as opposed to operational codes is to enable science, that is, free experimentation and testing of hypotheses.

Modifiability by non-expert programmers should be and sadly is not treated as a crucial design constraint. The application scientist is an expert on physics, perhaps on certain branches of mathematics such as statistics and dynamics, but is typically a journeyman programmer. In general the scientist does not find the abstractions of computer science intrinsically interesting and considers the program to be an expensive and balky laboratory tool.

Being presented with codes that are not designed for modification greatly limits scientific productivity. Some scientists have enormous energy for the task (or the assistance of relatively unambitious and unmarketable Fortran-ready assistants) and take on the task with energy and panache, but the sad fact is that they have little idea of what to do or how to do it. This is hardly their fault; they are modifying opaque and unwelcoming bodies of code. Under the daunting circumstances these modifications have the flavor of "one-offs", scripts intended to perform a single calculation, and treated as done more or less when the result "looks reasonable". The key abstractions of computer science and even its key goals are ignored, just as if you were writing a five-liner to, say, flatten a directory tree with some systematic renaming. "Hmm, looks right. OK, next issue."

This, while scientific coding has much to learn from the commercial sector, the key use case is rather atypical. The key is in providing an abstraction layer useful to the journeyman programmer, while providing all the verification, validation, replicability, version control and process management the user needs, whether the user knows it or not. As these services become discovered and understood, the value of these abstractions will be revealed, and the power of the entire enterprise will resume its forward progress.

It's my opinion that Python provides not only a platform for this strategy but also an example of it. When a novice Python programmer invokes "a = b + c", a surprisingly large number of things potentially happen. An arithmetic addition is commonly but not inevitably among the consequences and the intentions. The additional machinery is not in the way of the young novice counting apples but is available to the programmer extending the addition operator to support user defined classes.

Consider why Matlab is so widely preferred over the much more elegant and powerful Mathematica platform by application scientists. This is because the scientists are not interested in abstractions in their own right; they are interested in the systems they study. Software is seen as a tool to investigate the systems and not as a topic of intrinsic interest. Matlab is (arguably incorrectly) perceived as better than Mathematica because it exposes only abstractions that map naturally onto the application scientist's worldview.

Alas, the application scientist's worldview is largely (see Marshall McLuhan) formed by the tools with which the scientist is most familiar. The key to progress is the Pythonic way, which is to provide great abstraction power without having it get in the way. Scientists learn mathematical abstractions as necessary to understand the phenomena of interest. Computer science properly construed is a branch of mathematics (and not a branch of trade-school mechanics thankyouverymuch) and scientists will take to the more relevant of its abstractions as they become available and their utility becomes clear.

Maybe starting from a blank slate we can get moving again toward a system that can actually make useful regional climate prognoses. It is time we took on the serious task of determining the extent to which such a thing is possible. I also think the strategies I have hinted at here have broad applicability in other sciences.

I am trying to work through enough details of how to extend this Python mojo to scientific programming to make a credible proposal. I think I have enough to work with, but I'll have to treat the details as a trade secret for now. Meanwhile I would welcome comments.

8 comments:

EliRabett said...

Matlab is a hell of a lot cheaper than Mathematica. You can ask a class to buy a Matlab student license.

EliRabett said...

Matlab is a hell of a lot cheaper than Mathematica. You can ask a class to buy a Matlab student license.

Michael Tobis said...

True. However UT has site licenses for both, and I am surrounded by Matlab users and know of nobody using Mathematica.

Most of them should switch to Python, of course.

Jordan said...

I've wanted to build a python framework like this for years, but I have no motivation and too many papers I need to get written. If you need some help, I can code my way around a wrapped Fortran or C module OK.

Bryan Lawrence said...

Mmm. Drop me an email ... we are doing a lot of work on python data handling ... and some thinking about distributed (in the cluster sense) data analysis using python ... data analysis isn't that different than climate modelling ... after all prediction is just analysis followed by analysis ... rather a lot of times ...

Meanwhile, take a look at genie if you haven't already (it's not a project I'm involved with, but it may have some relevance to your aims).

Anonymous said...

Matlab comfortingly resembles Fortran whilst Mathematica doesn't! I used to use Mathematica to do symbolic calculus that I could no longer remember how to do with my brain.

I did consider using Python at home for my little personal programming projects, and it looks like a plausible route using NumPy, SciPy and Matlibplot (or something like that) modules and the Ipython interpreter shell. However, I can get stuff done much more easily in Matlab (because I've been using it for 10 years), the IDE allows me to do the things I want to do and there's a pretty good supply of user-written routines for other stuff as well.

Were I to start over then Python looks like a good bet.

More generally, I can see scientist s learning (and applying) more software engineering techniques would be really good. I'm not at all convinced that it would lead to a step change improvement in the 'power' of climate models.

Anonymous said...

Mathematica is good for symbolic algebraic manipulations but its pretty bad at a lot of things that Matlab is good at. How do you do Garch in Matlab for instance?

Secondly I strongly believe that Mathematica would be a bad choice for large scale software development. Its production rules method of evaluation would lead to unpredictable results. There are also other issues like variable scoping that it doesn't handle well.

Pythons not a bad choice but there would be a need for it to provide the massive number of libraries in Matlab and very nice syntax for dealing with vectors and matrices.

Adam Abernathy said...

As a seasoned computer scientist turned atmospheric scientist, I agree with the disconnect in scientific research and computer engineers. In the application of software platforms and development environments, what needs to be looked at is what is the intent of the computation.

Wolfram's Mathematica is a beautiful symbolic platform. When working with extensive calculations and desktop problem solving there is no better choice. Mathematica provides a rapid development environment to solve extensive math & physics problems without too much programming knowledge.

MathWork's MATLAB excels is when you need to create a series of scripts to manipulate extensive datasets. MATLAB offers a "C++" like syntax for programming basic functions and simplistic object-oriented program architecture.

Both of these applications have their scope and their limits. I've seen too many people claim one is better than the other, this simply can't be done. These two applications have their intended purposes and perform them both excellently.

As for Python, it is still a scripting language. Scripting languages will always be limited by the fact that they are high level languages and must execute on top of a compiler and runtime environment.

For the learning curve that Python provides, it could be argued that learning C++ is a better alternative.

What all this boils down to is, you have to pick the best option for your application. With data transport options such as XML, SQL and even simple CSV, data can easily be moved between applications and platforms.