How to Fit a Distribution to Data
This article discusses techniques for fitting a given distribution type to historical data. The problem of determining which distribution type best reflects a data set is a bit different and is not covered here.
Generalized regression techniques such as Logistic Regression are used to predict the probability of an outcome from many input variables. This article covers a simpler topic -- estimating a marginal distribution without any conditionality on any inputs. For example, given a set of data between 0 and 1, how would you find the parameters of the best fit Beta distribution?
Once a distribution type has been identified, the parameters to be estimated have been fixed, so that a best-fit distribution is usually defined as the one with the maximum likelihood parameters given the data.
Specific Estimation Formulae
Many textbooks provide parameter estimation formulas or methods for most of the standard distribution types. Use of these are, by far, the easiest and most efficient way to proceed. For example, the parameters of a best-fit Normal distribution are just the sample Mean and sample standard deviation.
The book Uncertainty by Morgan and Henrion, Cambridge University Press, provides parameter estimation formula for many common distributions (Normal, LogNormal, Exponential, Poisson, Gamma, Weibull, Uniform, Triangular, and Beta). Estimation formula for other distribution types can often be found on Wikipedia.
Some estimation formula are summarized here. Data is denoted by «x», and the index of the data by «I».
- m = Mean(x, I)
- s = SDeviation(x, I)
See Normal for details.
- med = Exp(Mean(Ln(x)))
- gsdev = Exp(SDeviation(Ln(x)))
See LogNormal for details.
- m = Mean(x, I)
See Exponential for details.
- m = Mean(x, I)
See Poisson for details.
- a = Mean(x, I)^2/SDeviation(x, I)^2
- b = SDeviation(x, I)^2/Mean(x, I)
See Gamma for details.
The parameters for Weibull are fit using a regression. By re-arranging the CDF of the Weibull and substituting Z = Ln(-Ln(1-F(x))) and Y = Ln(x), the relationship between Z and Y is linear, so we can use Regression to fit Z = mY + b. Thus, the procedure is this:
Index bm := ['b', 'm']; var Fx := (Rank(Data, Data_Index) - 0.5)/Size(Data_Index); var Z := Ln(-Ln(1 - Fx)); var fit := Regression(Z, Array(bm, [1, ln(x)]), I, bm); var m := fit[bm = 'm']; var b := fit[bm = 'b']; ...
Then the Weibull parameters are given by
- k = m
- c = Exp(-b/m)
With the Uniform function, a maximum likelihood criteria suggests using:
- a = Min(x, I) - Epsilon
- b = Max(x, I) + Epsilon
where Epsilon is a very small positive number. However, in this case, this criteria does not correspond to what intuition would consider a best fit. Some people use:
- a = Mean(x, I) - 3*SDeviation(x, I)
- b = Mean(x, I) + 3*SDeviation(x, I)
This does not guarantee that the fitted distribution contains all the points (i.e., the likelihood of the data could be zero). A different approach is to consider that it is highly improbable that the smallest point observed was exactly at the lower bound. Most likely the distance from the lower bound to the smallest point is about half the average distance between adjacent points. This leads to a re-estimation formula of:
- a = Min(x, I) - (Max(x, I)-Min(x, I)) / (2*Size(I))
- b = Max(x, I) + (Max(x, I) - Min(x, I))/(2*Size(I))
Triangular(a, b, c)
- a = (Mean(x, I)^2 - Mean(x, I)^3 - SDeviation(x, I)^2*Mean(x, I))/SDeviation(x, I)^2
- b = (Mean(x, I)*(1-Mean(x , I)^2 - SDeviation(x, I)^2*(1 - Mean(x, I)))/SDeviation(x, I)^2
See Beta for details.
General Parameter Fitting
When a known parameter estimation formula is not available, the Analytica Optimizer can, in general, be used to find the best fit parameters. The technique here consists of these steps:
- If the distribution has more than one parameter, create an index so that we can return the parameters in an array. For example, if you were fitting a Weibull distribution, you would create the index
Index Weibull_Param := ['shape', 'scale']
- Find or create a function to compute the probability (for a discrete dist) or probability density (for a continuous distribution) at a point x. For built-in distributions, these functions are found in the Distribution Densities.ana library. These functions usually start with Dens_... such as Dens_Weibull.
- Create a User-defined function to compute the Log-Likelihood for a data set for your distribution type. For example:
Function Weibull_LogLikelihood(params: array[Weibull_param]; x: Array[I]; I: index) := var k := params[Weibull_parameters = 'shape']; var c := params[Weibull_parameters = 'scale']; if k < 0 or c < 0 then -inf else sum(ln(Dens_Weibull(x, k, c)), I)
- Define the optimization to find the maximum likelihood parameters, using the Log Likelihood function as the objective to maximize, e.g.:
var params := 5; var nlp := NlpDefine(Weibull_parameters, x: params, lb: 0, obj: Weibull_LogLikelihood(params, x, I), maximize: true); LpSolution(nlp)
This is a very general purpose approach that can almost always be applied. In some cases, care must be taken to make sure that parameter combinations visited during the search process that might be out of reasonable range don't cause hard errors to occur.