Tuesday, April 14, 2009

Seeking help from Statistician

Suppose I have distinct samples of a cumulative density function at 0.01, 0.25, 0.5, 0.75, and 0.99 obtained from an interview. I can provide some constraint that the 0.01 point is not "too close" to the 0.25 point and similarly for the 0.99 and the 0.75 points.

I need to interpolate between the given points.

Now, of course there is not a unique interpolation, but I guess there is a maximum entropy interpolation that is in some sense closest to a Gaussian. I wonder how I would find out what it is.

Secondly, I would like to find distributions of the sum of two such distributions. I am imagining a Monte Carlo method would work, but if there is a closed form, so much the better. It needs to be reasonably computationally efficient and either available as a Python package or relatively easily programmed and tested. 

Thirdly, similarly for a product of two such distributions.

If there is not a closed form, I need a way to take samples of the distribution from Python. It's not hard for me to construct a method that is crudely correct in the sense that the biases are unlikely to matter for the purposes at hand, but it seems someone ought to be able to tell me how to do better.

It would be nearly hopeless for a small traffic blog like this to get an answer by itself, but I wonder if Twitter isn't going to prove useful for scientists to answer questions of this sort. So I've tweeted a call for assistance and pointed to this posting. It will be very encouraging if I get a useful response; I have little hangups like this one all the time.

7 comments:

Bob O'Hara said...

For this sort of problem you really need to provide more context. There is obviously an infinite number of possible distributions that would fit perfectly, so we need to know more about the problem. Where is the distribution coming from? What is it a distribution of? Are there any limits on the distribution? Can we assume symmetry? Etc. etc.

John said...

In my experience, the values people give you in an interview like you mentioned are very fuzzy. They could give you substantially different values depending on how long it's been since they've had something to eat! So depending on your circumstances, you could feel free to change the responses a bit for your convenience.

Are your values symmetric? i.e. are the 0.01 and 0.25 values as far from 0.5 as their 0.99 and 0.75 counterparts? If so, maybe you could simply assume a normal distribution. If not, maybe you could use something like a gamma.

If you'd like to discuss this more, please let me know.

Michael Tobis said...

John understands the question. The distribution is simply a coarse Bayesian prior; I am trying to get at people's beliefs about problems that are too complex and for formal solution, in order to draw some conclusions from them that themselves are well posed and straightforward.

I am trying to do some math on people's rough estimates. Two purposes come to mind immediately: 1) to detect whether people's estimates of components of a problem is consistent with their estimates of the whole an 2) to obtain consensus estimates. However, I don;t like his suggestion of a gamma.

Nothing constrains the values to be symmetric; they are monotonically ordered. That is all. Though I can think of ways to specify that which should be awkward, so I am willing to constrain the 1 % points to be in some sense "far" form the 25% points, just to handle that.

If a person specifies a mean and a standard deviation, there are many distributions that would satisfy that, but the one which makes the fewest assumptions, i.e., adds no new information, is the Gaussian. I don't know what the equivalent is if you specify five degrees of freedom rather than two. That is really the hard part of the question.

So one constraint is that if the five points are consistent with a Gaussian, the resulting distribution should in fact be Gaussian. Gamma does not satisfy this.

A less formal constraint is that the curve should be bell-shaped. If the points are (1,2,10,15, 20) this can't be done, since the region from 1 to 2 will have to carry (nearly) as big a fraction of the distribution from 2 to 10. I am not sure what to do in that case. But if a bell-like curve is possible, I want a bell, and if a normal distribution is possible, that is the specific bell that should emerge.

The right formal constraint is probably some flavor of maximum entropy, but I have no idea how to formalize that.

Bob O'Hara said...

There are skew-normal and skew-t distributions available. But with 5 fuzzy points, I wouldn't went to go any further than specifying location, scale, and skewness. You could look at a skewed-normal distribution.

I'm not sure what you're meaning by it being bell-shaped: are you saying that it should be unimodal and symmetric? If so, try a t-distribution.

I'm not sure I'd suggest a skewed t, although they do exist. That feels like over-fitting.

There is also a literature on prior elicitation: Tony O'Hagan is a name to start from. But it's not a literature I'm familiar with.

Michael Tobis said...

I think skew-normal and skew-t may have been the hints I was looking for. Thanks!

Anonymous said...

Like Bob, I'm not too familiar with the prior elicitation literature, but Tony O'Hagan is also an author I'd start with. Try:

http://www.stat.cmu.edu/tr/tr808/tr808.pdf
http://www.shef.ac.uk/content/1/c6/03/09/33/uncertainty_in_elicitation%5B1%5D.pdf
http://www.jpgosling.co.uk/Pub/GoslingThesis.pdf

The latter two take a nonparametric approach which tries to avoid assuming a particular form for the distribution. They also treat uncertainty in the fitted pdf, which is probably important if you only have a few points in the cdf to go by. These methods might be overkill for your problem, though.

James Annan said...

Why not just use piecewise Gaussian approximations for each interval in turn - obviously this only does the interpolation and not extrapolation, but that's all you asked for! That's easy to do, simple to understand and if the points were actually taken from a Gaussian then that is what you will get back. If you want a smoother answer, you could transform with erf(x), fit a spline (which will be a straight line if the distribution is gaussian) and convert back.