Statistical functions


A statistical function, such as Mean, Median, or Variance, summarizes a sample of values by a single value. By default, they expect their parameter(s) to be a probabilistic value represented by a random sample of values over the Run index. But, you can also apply a statistical function over an array of data index by specifying that index as an optional parameter. You can also see common statistics of an uncertain variable, including Mean, Standard deviation, Min, and Max, by selecting the Statistics option from the uncertainty views in the Result window.

Statistical functions force prob mode evaluation

Unlike other functions, statistical functions force their main parameter(s) to be evaluated in prob mode (probabilistically) — even if the evaluation context is for mid mode -- unless you provide an index parameter other than Run. Their result is not probabilistic. For example:

Chance X := Normal(0, 1)
Variable X90 := Getfract(X, 90%)
X90 → 1.259

Evaluating variable X90 causes variable X to be evaluated in prob mode, so that Getfract(X, 90%) can estimate the 90th percentile (0.9 fractile) of the distribution for X. X90 itself has only a mid value, and no probabilistic value.

The Mid(x) function is the exception among statistical functions: It always evaluates its parameter X in mid mode, whether its context (where the Mid(x) appears) is deterministic or probabilistic.

Statistics from non-probabilistic arrays

The default usage of statistical functions is over a probability distribution, represented as a random sample indexed by Run. You can also use these functions to compute statistics over an array with a different index by specifying that index explicitly. For example, you could fit a normal distribution to the mean and standard deviation of an array Data with index K:

Index K := 1..1000
Variable Data := Table(K)(123.4, 252.9, 221.4, ...)
Variable Xfitted := Normal(Mean(Data, K), Sdeviation(Data, K))

Xfitted is a normal distribution fitted to Data with the same mean and standard deviation.

Tip

All statistical functions produce estimates from the underlying random sample for each probabilistic quantity. These estimates are not exact, but vary from one evaluation to the next due to the variability inherent in random sampling. So, your results might not exactly match the results shown in the examples here. For greater precision, use a larger sample size. See Selecting the Sample Size.

Statistical Functions and Importance Weighting

By default, statistical functions assume equal weight for each sample or data value. But, you can also specify weights for each sample using optional parameter w, which should be indexed by Run or the i index. That provides a simple and powerful way to compute weighted samples. For details, see Statistical Functions and Importance Weighting.

Statistics and text-valued distributions

Most statistical functions require their parameters to be numerical. A few statistical functions, those that use ordinal (ordered) values, also work on discrete distributions with text values (whose domain is a list of text): Frequency (use Frequency(X, X)), Mid, Min, Max, ProbBands, and Sample. These functions assume the values are ordered as specified in the domain list of labels, e.g., Low, Mid, High.

Notation in statistical formulas

The formulas used to define statistics use this notation:

xi The i-th sample value of x
$ \bar x $ The mean of x (see Mean(x))
s Standard deviation of x (see Sdeviation(x))
m Sample size (or size of parameter i) (see Selecting the Sample Size)

Example model

The examples below use these variables:

Variable Alt_ fuel_ price := Normal(1.25, 0.1)
Variable Fuel_price := Normal(1.19, 0.1)
Variable Skfuel_price := Beta(4, 2,1,1.5)

Mean(x)

An estimate of the mean (or expected value) of «x» if «x» is probabilistic. Otherwise, it returns «x». Mean(x) uses this formula.

$ \frac{1}{m} \sum_{i=1}^{m} x_i = \bar x $

Examples

Mean(Fuel_price) → 1.19
Mean(Skfuel_price) → 1.33

Median(x)

Returns an estimate of the median of «x» from its sample if «x» is probabilistic. By definition, half the sample is smaller and half is larger than the median. When «x» is non-probabilistic, returns «x». Median(x) is equivalent to GetFract(x, 0.5).

Examples:

Median(Fuel_price) → 1.19

Sdeviation(x)

An estimate of the standard deviation of «x» from its sample if «x» is probabilistic. If «x» is non-probabilistic, it returns 0. The standard deviation is the square root of the variance. Sdeviation(x) uses this formula:

$ \sigma = \sqrt{\frac{1}{m - 1} \sum_{i=1}^{m} (x_i - \bar x)^2} $

Example:

Sdeviation(Fuel_price) → 0.10

Variance(x)

An estimate of the variance of «x» if «x» is probabilistic. If not, it returns 0. The variance is the square of the standard deviation. Variance() uses this formula:

$ \sigma^2 = \frac{1}{m - 1} \sum_{i=1}^{m} (x_i - \bar x)^2 $

Example:

Variance(Fuel_price) → 0.01

Skewness(x)

Returns an estimate of the skewness of «x». Skewness is a measure of the asymmetry of the distribution. A positively skewed distribution has a thicker upper tail than lower tail, while a negatively skewed distribution has a thicker lower tail than upper tail. A normal distribution has a skewness of zero.

Skewness() uses this formula:

$ \frac{1}{m} \sum_{i=1}^{m} [\frac {x_{i} - \bar x}{\sigma}]^3 $

Example:

Skewness(Skfuel_price) → -0.45

Kurtosis(x)

Returns an estimate of the kurtosis of «x».

Kurtosis is a measure of the peakedness of a distribution. A distribution with long thin tails has a positive kurtosis. A distribution with short tails and high shoulders, such as the uniform distribution, has a negative kurtosis. A normal distribution has zero kurtosis. A constant value (with no variation) has a kurtosis of -3.

Kurtosis(x) uses this formula:

$ (\frac{1}{m} \sum_{i=1}^{m} [\frac {x_{i} - \bar x}{\sigma}]^4) - 3 $

Example:

Kurtosis(Skfuel_prices) → -0.48

Probability(b)

Returns an estimate of the probability that an uncertain Boolean value «b» is True. See also Probability().

Example:

Probability(Fuel_price < 1.19) → 0.5

GetFract(x, p, I)

Returns an estimate of the «p»th fractile (also known as quantile or percentile) of an uncertain quantity «x». Or, if you specify index «I», it retruns the fractile of a sample of data over that index, instead of the Run index. This is the value of «x» such that «x» has a probability «p» of being less than that value. If «x» is constant over index «I» --for example, a non-probabilistic variable -- all fractiles are equal to «x». See GetFract().

The value of «p» must be a number, or array of numbers, between 0 and 1, inclusive.

Examples:

Getfract(x, 0.5) returns an estimate of the median of x.
Getfract(Fuel_price, 0.5) → 1.19

This returns a table containing estimates of the 10%ile and 90%ile values, that is, an 80% confidence interval.

Index Fract := [0.1, 0.9]
Getfract(Fuel_price, Fract) →
Fract ▶
0.10 0.90
1.06 1.32

ProbBands(x)

Returns an estimate of probability or “confidence” bands for «x» if «x» is probabilistic. Otherwise returns «x» for every band. You can specify the probabilities in the Uncertainty Setup dialog, Probability Bands option. See also ProbBands().

Example:

Probbands(Fuel_price) →
Probability ▶
0.05 0.25 0.5 0.75 0.95
1.025 1.123 1.19 1.257 1.355

Covariance(x, y)

Returns an estimate of the covariance of uncertain variables «x» and «y». If «x» or «y» are non-probabilistic, it returns 0. The covariance is a measure of the degree to which «x» and «y» both tend to be in the upper (or lower) end of their ranges at the same time. Covariance() is defined as:

$ \sum_{i=1}^{n} (x_i - \bar x)(y_i - \bar y) $

Suppose you have an array x of uncertain quantities indexed by i:

Index i := 1 .. 5
Variable x := Array(i, […])

You can compute the covariance matrix of each element of X against each other’s element (over i), thus:

INDEX j := CopyIndex(I)
Covariance(x, x[i = j])

We create index j as a copy of index i and then create a copy of x that replaces i by j so that the covariance is computed for each slice of x over i against each slice over j. The result is the covariance matrix indexed by i and j. Each diagonal element contains the variance of the variable, since Variance(x) = Covariance(x, x). You can use this same method to generate a correlation matrix using the Correlation() or Rankcorrel() functions described below.

Correlation(x, y)

An estimate of the correlation between the probabilistic expressions «x» and «y». -1 means perfect negative correlation; 0 means no correlation; and 1 means perfect positive correlation.

Correlation(x, y) is a measure of probabilistic dependency between uncertain variables, sometimes known as the Pearson product moment coefficient of correlation, r. It measures the strength of the linear relationship between «x» and «y», using the formula:

$ \frac{\sum_i (x_i - \bar x)(y_i - \bar y)}{\sqrt{\sum_i (x_i - \bar x)^2 \times \sum_i (y_i - \bar y)^2}} $

With sampleSize set to 100 and number format set to two decimal digits:

Correlation(Alt_fuel_price + Fuel_price, Fuel_price) → 0.71

If two distributions are independent, their Correlation approaches 0 as the sample size approaches infinity. But, it may be significantly different from zero (positive or negative) for a small sample size, just due to random noise:

With sampleSize = 20:

Correlation(Normal(1.19, 0.1), Normal(1.19, 0.1)) → -.28

With sampleSize = 1000:

Correlation(Normal(1.19, 0.1), Normal(1.19, 0.1)) → 0.03

Rankcorrel(x, y)

Returns an estimate of the rank-order correlation coefficient between the distributions «x» and «y». «x» and «y» must be probabilistic.

Rankcorrel(x,y), a measure of the dependence between «x» and «y», is sometimes known as Spearman’s rank correlation coefficient, rs.

Rank-order correlation is measured by computing the ranks of the probability samples, and then computing their correlation. By using the rank order of the samples, the measure of correlation is not affected by skewed distributions or extreme values, and is, therefore, more robust than simple correlation. Rank-order correlation is used for importance analysis.

Example:

With sampleSize = 100:

Rankcorrel(Fuel_price, Alt_fuel_price) → .02

Frequency(x, i)

If «x» is a discrete uncertain variable, containing numbers or text values, Frequency(x, i) returns an array indexed by «i», giving the frequency, or number of occurrences of discrete values «i». «i» must contain unique values, usually matching the values in «x»; if numeric, the values must be increasing.

If «x» is a continuous uncertain variable and «i» is an index of numbers in increasing order, it returns an array indexed by «i», with the count of values in the sample «x» that are equal to or less than each value of «i» and greater than the previous value of «i».

If «x» is non-probabilistic, Frequency(x, i) returns sampleSize for each value of «i» equal to «x».

Since Frequency() is computed by counting occurrences in the probabilistic sample, it is a function of sampleSize (see Uncertainty Setup dialog). If you want the relative frequency rather than the count of each value, divide the result by SampleSize.

Example (continuous):

Index Index_a := [1.2, 1.25]
Frequency(Fuel_price, Index_a) →
Index_a ▶
1.2 1.25
54 19

Example (discrete):

Bern_out: [0,1]

(Possible outcomes of the Bernoulli Distribution.)

With Samplesize = 100:
Frequency(Bernoulli (0.3), Bern_out) →
Bern_out ▶
0 1
70 30
With Samplesize = 25:
Frequency(Bernoulli (0.3), Bern_out) →
Bern_out ▶
0 1
18 7

(Compare to the Bernoulli example and see Bernoulli().)

Mid(x)

Returns the mid value of «x». Unlike other statistical functions, Mid() forces deterministic evaluation in contexts where «x» would otherwise be evaluated probabilistically.

The mid value is calculated by substituting the median for most full probability distributions in the definition of a variable or expression, and using the mid value of any inputs. The mid value of a variable or expression is not necessarily equal to its true median, but is usually close to it.


Example:

Mid(Fuel_price) → 1.19

Sample(x)

Forces «x» to be evaluated probabilistically and returns a sample of values from the distribution of «x» in an array indexed by the system variable Run. If «x» is not probabilistic, it just returns its mid value. The system variable sampleSize specifies the size of this sample. You can set sampleSize in the Uncertainty Setup dialog.

When to use: Use when you want to force probabilistic evaluation.

Example:

Here are the first six values of a sample:

Sample(Fuel_price) →
Iteration(Run) ▶
1 2 3 4 5 6
1.191 1.32 1.19 1.164 1.191 0.962

Statistics(x)

Returns an array of statistics of «x». You can select the statistics to display in the Uncertainty Setup dialog, Statistics option.

Example:

Statistics(Fuel_price) →
Statistics ▶
Min Median Mean Max Std. Dev.
0.93 1.19 1.19 1.45 1.10

PDF(X) and CDF(X)

These functions generate histograms from a sample «X». They are similar to the methods used to generate the probability density function (PDF) and cumulative probability distribution function (CDF) as uncertainty views in a result window as graph or table. But, as functions, they return the resulting histogram as an arrays available for further processing, display, or export. For example:

PDF(X)
CDF(X)

These functions evaluate X in prob mode, and return an array of points on the density or cumulative distribution respectively.

You can also use PDF and CDF to generate a histograms (direct or cumulative) of data that is not uncertain, but indexed by something other than Run. For example, to generate a histogram of Y over index J, specify the index explicitly:

PDF(Y, J)

If it decides that «X» is discrete rather than continuous, PDF generates a probability mass distribution and CDF generates a cumulative mass distribution, with a probability for each discrete value of «X». It uses the same method as the uncertainty views in results to decide if «X» is discrete — if it has text values, if it has many repeated numerical values, or if «X» has a domain attribute that is discrete (see The domain attribute and discrete variables). Alternatively, you can control the result by setting the optional parameter discrete as true or false. For example:

Variable X := Poisson(20)
PDF(X, Discrete: True)

This generates a discrete histogram over X. If X contains text values, i.e., categorical data, you might want to control the order of the categories, e.g., ["Low", "Medium", "High"]. You can do this by specifying the Domain attribute of X as a List of Labels with these values, or as an Index, referring to an Index using them. Alternatively, you can provide PDF or CDF with the optional Domain parameter provided as the list of labels. If X is an expression rather than a variable, this is your only choice.

PDF and CDF have one required parameter:

X The sample data points, indexed by «i».

In additional, PDF and CDF have these optional parameters:

Parameter Description
i The index over which they generate the histogram. By default this is Run (i.e., a Monte Carlo sample) but you can also specify another index to generate a histogram over another dimension.
w The sample weights. Can be used to weight each sample point differently. Defaults to system variable SampleWeighting.
discrete Set true or false to force discrete or continuous treatment. By default, it guesses, usually correctly.
spacingMethod Selects the histogramming method used. Otherwise it uses the system default set in the Uncertainty Setup dialog from the Result menu. Options are:

0 “equal-X”: Equal steps along the «X» axis (values of «X»).

1 “equal-sample-P”: Equal numbers of sample values in each step.

2 “equal-weighted-P”: Equal sum of weights of samples, weighted by «w».

samplesPerStep An integer specifying the number of samples per bin. Otherwise, it uses the default SampleSize set in the Uncertainty Setup dialog from the Result menu.
smoothingMethod Selects the method for estimating the continuous density, Pdf(..) only.

0 : Histogramming.

1 : Kernel Density Smoothing.

2 : Kernel Density Smoothing with given bandwidth.

smoothingFactor Pdf(..) function only. When «smoothingMethod» is 1 or 2, this determines the degree of smoothing. For smoothingMethod: 1, «smoothingFactor» should be a value between -1 and 1 indicating the desired smoothing relative to what Pdf(..) determines to be the optimal bandwidth. A value of 0 indicates that the optimal bandwidth should be used, negative values for more detail, positive values for more smoothing.

When smoothingMethod: 2 is used, Pdf(..) performs a Fast Gaussian Transform (FGT) using the positive value to «smoothingFactor» as the Gaussian bandwidth.

domain A list of numbers or labels, or the identifier of a variable whose Domain attribute should be used to specify the sequence of possible values for discrete distribution. If omitted, it uses the domain from the sample values.

See Also


Comments


You are not allowed to post comments.