# RegressionDist

## RegressionDist(Y, B, I, K, C, S)

RegressionDist is similar to Regression(Y, B, I, K), but it returns linear regression coefficients not as mid values but as a probability distribution reflecting the uncertainty in the regression fit and measurement noise. You can use the uncertain coefficients from RegressionDist to generate a predictive probability distribution on «Y» that reflects this uncertainty.

Suppose you have data where «Y» was produced as:

`Y = Sum(C*B, K) + Normal(0, S)`

«S» is the measurement noise. You have the data («B[I, K]» and «Y[I]»). You might or might not know the measurement noise «S». So you perform a linear regression to obtain an estimate of «C». Because your estimate is obtained from a finite amount of data, your estimate of «C» is itself uncertain. This function returns the coefficients «C» as a distribution (i.e., in Sample mode, it returns a sampling of coefficients indexed by Run and «K»), reflecting the uncertainty in the estimation of these parameters.

## Library

Multivariate Distributions library functions (Multivariate Distributions.ana)

## Examples

If you know the noise level «S» in advance, then you can use historical data as a starting point for building a predictive model of «Y», as follows:

`{ Your model of the dependent variables: }`
`Variable Y := your historical dependent data, indexed by I`
`Variable B := your historical independent data, indexed by I, K`
`Variable X := { indexed by K. Maybe others. Possibly uncertain }`
`Variable S := { the known noise level }`
`Chance C := RegressionDist(Y, B, I, K)`
`Variable Predicted_Y := Sum(C*X, K) + Normal(0, S)`

If you don't know the noise level, then you need to estimate it. You'll need it for the normal term of `Predicted_Y` anyway, and you'll need to do a regression to find it. So you can pass these optional parameters into RegressionDist. The last three lines above become:

`Variable E_C := Regression(Y, B, I, K)`
`Variable S := RegressionNoise(Y, B, I, K, E_C)`
`Chance C := RegressionDist(Y, B, I, K, E_C)`
`Variable Predicted_Y := Sum(C*X, K) + Normal(0, S)`

If you use RegressionNoise to compute «S», you should use `Mid(RegressionNoise(...))` for the «S» parameter. However, when computing «S» for your prediction, don't RegressionNoise in context. Better is if you don't know the measurement noise in advance, don't supply it as a parameter.

## Errors That Might Result

`Evaluation Error in C:`
`Array is not symmetric in System Function Decompose.`
`while evaluating function Gaussian.`
`Call stack:`
`Gaussian`
`RegressionDist`
`C`

Possible causes:

• One of your independent variables might be zero for every data point. As of Analytica 4.2, RegressionDist is not robust to this singularity. Note that this singularity is problematic -- the mean coefficient value for that variable is undefined and the variance on the coefficient uncertainty is infinite.
Remedy: Eliminate independent variables that are everywhere zero from the basis before calling.
• Your data (most likely in the basis) contains NaN values.