Difference between revisions of "Regression"

 Release: 4.6  •  5.0  •  5.1  •  5.2  •  5.3

Regression(y, b, I, K)

Given data points for a dependent variable «y» indexed by «I» and data for a "basis" (independent variables) «b» indexed by «I» and basis index «K», it returns coefficients C for a linear model:

Variable C := Regression(Y, B, I, K)
Variable Y_est := Sum(C*B, K)


where Y_est contains estimated values of Y.

Regression uses least-squares estimation, meaning that it minimizes the sum of squares of the residuals (estimation error) -- the difference between actual and estimated values:

Sum((y - Y_est)^2, K)


Parameters:

«y»
Values of the dependent variable, indexed by «I».
«b»
Values of the basis values (independent variables), indexed by «I and «K».
«I»
Index for the data points. Each element of «I» corresponds to a different data point.
«K»
(Optional)Basis index, or list of independent variables and, usually, a constant. This can be omitted when «b» is a scalar or has an implicit index (like a list) for the different basis terms.

See Regression analysis in Analytica User Guide for basic usage.

Basics

The form of a linear regression model is

$y = \sum_{k} c_k b_k(\bar x)$

where for any single data point, $\bar x$ is a vector of values, know as the independent values, $b_k(\bar x)$ are arbitrary functions of $\bar x$ known as the basis functions, and $y$ is the dependent value. Sometimes the basis functions are trivial functions, such as $b_0(x)=1$ and $b_1(x)=x$, for $k=0,1$, which leads to the one-variable linear model:

$y = c_1 *x + c_0$

When fitting an nth-degree polynomial of one-variable, x, the basis functions are $b_k(x)=x^k$ for $k=0..n$. Or when you have several variables, $x_1, x_2, ..., x_n$, you can fit a hyperplane using $b_0(\bar x)$=1 and $b_k(\bar x)=x_k$ for $k=1..n$. The hyperplane example could be encoded as

Index K := [0,1,2,3,4]
Variable B := Table(K)(1, x1, x2, x3, x4)

Most regressions include a constant in the basis, e.g., $b_0(\bar x)=1$, although this is not absolutely required. The coefficient associated with the constant term is called the bias, offset or y-intercept, and the constant basis term is called the bias term. Without a constant basis term, the fitted model is forced to pass through the origin. This is appropriate when you know in advance that y must be directly proportional to all your basis terms. But more commonly, leaving out a constant basis term is a mistake.

If you don't want to include a constant term in your basis, Regression() will return the bias as a second return value, provided that your expression captures the second return value. For example, you can fit a single variable $y = m*x+b$ model using simply

(m,b) := Regression( y, x, I )

The function knows whether the second return value is captured or not. When it isn't, the model must pass through the origin and the value for m will be different. When your independent variables x are already indexed by K and you don't want to map these to a new index having one additional element for the constant basis term, it may be more convenient to capture the bias separately, for example:

Local (c, bias) := Regression( y, x, I, K );
Sum( c * x, K ) + bias

In this case, c is indexed by K, and bias is scalar. When you want to use this pattern, but you want c and bias to be global variables, a pattern that works nicely is

Variable bias := ComputedBy(c)
Variable c :=
        Local a;        (a, bias) := Regression( y, x, I, K )

This captures the bias is variable bias, but passes the first return value through as the result for c.

Details

Underconstrained Problems

When you do a regression fit, the number of data points, Size(I), should be greater than the number of basis terms, Size(K). When the number of data points is less than the number of basis terms, the problem is under-constrained. Provided that there are no two data points having the same basis values but different «Y» values, the fit curve in an underconstrained problem will perfectly pass through all data points, however, the co-efficients in that case are not unique. In the under-constrained case, Analytica will issue a warning, since this most likely indicates that the «I» and «K» index parameters were inadvertently swapped. If you ignore the warning, embed the call within an IgnoreWarnings function call, or have the "Show Result Warnings" preference disabled, a set of coefficients that passes through the existing data points is arbitrarily chosen and returned. The algorithm used is computational inefficient in the under-constrained case where Size(I) << Size(K) -- i.e., the number of basis terms is much larger than the number of data points. If you know your problem is highly underconstrained, then you probably do not intend to use a regression.

Secondary Statistics

The Regression function computes the coefficients for the best-fit curve, but it does not compute secondary statistics such as parameter covariances, R-value correlation, or goodness-of-fit.

In what follows, we'll assume that Variable C is the computed regression coefficients, e.g.

Variable C := Regression(Y, B, I, K)

For each data point, the predicted expected value (from the regression) is given by Sum(C*B, K).

However, this prediction provides only the expected value. The RegressionDist function may be used to obtain a distribution over C, and hence a probabilistic estimate of «Y».

The R-squared value, also called the percentage of variance explained by the model, is given by:

Correlation(Y, Sum(C*B, K), I)^2

If your basis «B» might contain NaN or INF values, the corresponding coefficient in C will generally be zero. However, because 0*NaN and 0*INF are indeterminate, the expression Sum(C*B, K) will return NaN in those cases. To avoid this, use the following expression instead:

Sum(If C = 0 Then 0 Else C*B, K)

If you know the measurement noise in advance, then S is given and may (optionally) be indexed by «I» if the measurement noise varied by data point. If you do not know s in advance, then S can be obtained from the RegressionNoise function as RegressionNoise(Y, B, I, K, C)

Alternatively, S may be estimated as

Var y2 := Sum(C*B, K);
Sqrt( Sum((Y - Y2)^2, I)/(Size(I) - Size(K)))

Estimating S in either of these ways assumes that the noise level is the same for each data point.

In a generalized linear regression, the goodness of fit, or merit, is often characterized using a Chi-squared statistic, computed as:

Sum((Y-Sum(C*B, K))^2/S^2, I)

The S used in this metric must be obtained or estimated separately from the data used to computed the Chi-squared statistic. The resulting metric is a single number that follows a Chi-Squared distribution with Size(I)-Size(K) degrees of freedom under the assumption that the underlying data was generated from a linear model of your basis with normally distributed noise with a standard deviation of S. S is allowed to vary with I, although in most cases it does not.

Denoting the above as chi2, The probability that the data fit as poor as this would occur by chance is given as:

GammaI(Size(I)/2 - 1, chi2/2)

This metric can be conveniently obtained using the RegressionFitProb function.

Another set of secondary statistics are the covariances of the fitted parameters. The covariance is an estimate of the amount of uncertainty in the parameter estimate given the available data. As the number of data points increases (for a given basis), the variances and covariances tend to decrease. To compute the covariances, a copy of Index «K» is required (since the covariance matrix is square in «K»); hence, you need to create a new index node defined as:

Index K2 := CopyIndex(K)

The co-variances are then computed as:

Invert(Sum(B * B[K = K2]/S^2, I), K, K2)

The diagonal elements of this matrix give the variance in each parameter. Since there is only a finite number of samples, the parameter estimate may be off a bit due to random chance, even if the linear model assumption is correct; this variance indicates how much error exists from random chance at the given data set size.

With S and CV_C (the covariance of parameters C), a distribution on the expected value of «Y» can be obtained for a given input X (indexded by «I»), using:

Variable Coef := Gaussian(C, CV_C, K, K2)
Variable Y_predicted := Sum(Coef * X, K) + Normal(0, S)

The RegressionDist returns the uncertain coefficients directly, and is the more convenient function to use when you wish to estimate the uncertainty in your predicted value.

Numeric Limitations

The Regression function is highly robust to the presence of redundant basis terms. For example, if one of the basis functions is a linear combination of a subset of other basis functions, the coefficients are not unique determined. For many other implementations of the Regression function (e.g., in other products), this can lead to numeric instabilities, with very large coefficients and losses of precision from numeric round-off. Analytica uses an SVD-based method for Regression which is extremely robust to these effects, and guarantees good results even when basis terms are redundant.

Prior to build 4.0.0.59, this method that ensures robustness works for basis values up to about 100M. If basis values exceed 100M, then they should be scaled prior to using Regression. Starting with build 4.0.0.59, Analytica automatically scales basis values so that large values are handled robustly as well.

Weighted Regression

Weighted regression assigns a non-negative weight to each data point. The weight, «w», therefore is indexed by «I». You may assign a weight of zero to points that you don't want to contribute at all to the result.

One way to interpret these weights is to indicate the relative informativeness of each point -- or inversely the level of noise -- when that varies across the data points. Suppose each point comes from a linear model with noise with distribution Normal(0, s/w_i), where the mean is zero, s is an unknown global noise level, and w_i is the noise for the ith data point. Compare this to a non-weighted regression, where all points have the same amount of noise, according to Normal(0, s). A point with weight 0 has infinite standard deviation, and thus no usable information.

The coefficients for a weighted regression are given by

Regression(Y*w, B*w, I, K)

where «Y», «B», «I», and «K» are the customary parameters to regression, and «w» is the relative weighting which is indexed by «I».

Sometimes the raw data may include multiple observations with the same values, and the number of observations itself is included in the data. In this case, if n (indexed by «I») represents the number of observations for each element of «Y» and «B», the weighting w should be Sqrt(n).

Using weights of 0 and 1 makes it possible to ignore certain points. However, to ignore points where a basis term or the «Y» value might be NaN, you need to test for w:

Local Y2 := If w = 0 Then 0 Else Y*w;
Local B2 := If w = 0 Then 0 Else B*w;
Regression(Y2, B2, I, K)

Plotting Regression lines Compared to Data

To overlay the regressed curves on the original data, a good method is to continue using a scatter plot for both data and curves. Create a variable as a list of identifiers -- the first identifier being the original «Y»-value data, the second identifier being the fit-«Y» value (i.e., Sum(Basis*X, K)). Then plot the result as an XY plot, using your X value variable as an XY Comparison Source. The model RegressionCompare.ana demonstrates this, with the plot shown here:

Comparison of Alternative Bases

Adding more basis terms to your regression model will improve the fit on your training data, but you may reach a point where the extra terms decreases the quality of predictions on new data. This phenomena is referred to as over-fitting. As a general rule, as the number of data points increases, the more basis terms you can have before overfitting becomes a problem.

One approach that is often employed to evaluate whether the improvement justifies the additional basis terms is to use an F-Test. The output of this test is a p-value, which gives the probability that the improvement obtained is just due to sample noise under the assumption that the extra basis terms do not actually contribute any additional information. A small p-value means that there is statistically significant support that the extra basis terms improves the goodness of fit. One standard is to accept a model with more basis terms when the p-value is less than 5%.

To compute the F-test p-value, we'll use these variables:

• Basis1 indexed by K1 and «I» : The smaller basis (simpler model)
• Basis2 indexed by K2 and «I»: The larger basis (more complex model)
• «Y» indexed by «I» : The observed values for the dependent variable

Then the regression coefficients for each model are:

Variable c1 := Regression(Y, Basis1, I, K1)
Variable c2 := Regression(Y, Basis2, I, K2)

The forecasted values from each model are:

Variable y1 := Sum(c1*Basis1, K1)
Variable y2 := Sum(c2*Basis1, K2)

And the sum-of-square residuals are:

Variable Rss1 := Sum( (y1- y)^2, I)
Variable Rss2 := Sum( (y2 - y)^2, I)

The F-statistic is given by:

Variable Fstat := (Rss1 - Rss2)/Rss2 * (IndexLength(I)-IndexLength(K2))/(IndexLength(K2) - IndexLength(K1))

And the p-value is

Variable pValue := 1- CumFDist(F, IndexLength(K2) - IndexLength(K1), IndexLength(I) - IndexLength(K2))

Nulls

When a Null appears in «Y», the corresponding point is ignored.

When a Null appears in «B», the corresponding basis term along «K» is not used. The coefficient for that basis term in the final result is Null.

Starting in Analytica 5.0, if every term in «B» for a given point is Null, then the point is not used, and this does not cause any basis terms to be removed from consideration. However, if any non-null appears for that point in «B», then the terms where the Null values appear are removed from the basis.