﻿ Glossary

# Glossary

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

### A

Alpha level

The conditional probability of a Type-I error in hypothesis test, when the null hypothesis is true

Alternative hypothesis

The opposite of the null hypothesis. Often, the alternative hypothesis represents a new theory the scientist would like to prove. The theory's scientific status becomes stronger if experiments repeatedly show that the null hypothesis is untenable.

Analysis of Variance

A technique which tests differences between two or more groups by comparing the variation

### B

Beta Distribution

Use for random variables between 0 and 1. The beta distribution is often used to model the distribution of order statistics and to model events which are defined by minimum and maximum values. It is also used in Bayesian statistics.

The beta distribution with alpha shape parameter a and beta shape parameter b has the probability distribution function:

Between effects

In a repeated measures ANOVA, there will be at least one factor that is measured at each level for every subject. This is a within (repeated measures) factor. For example, in an experiment in which each subject performs the same task twice, trial (or trial number) is a within factor. There may also be one or more factors that are measured at only one level for each subject, such as gender. This type of factor is a between or grouping factor

Bias

An estimator for a parameter is unbiased if its expected value is the true value of the parameter. Otherwise, the estimator is biased.

Binomial Distribution

Used to describe a process where the outcomes can be labeled as an event or nonevent if, for example, an item passes or fails inspection or a political party wins or loses. Often used in quality control, public opinion surveys, medical research, and insurance.

The binomial distribution with parameter n and p has probability mass function:

Boxplot

A boxplot is a graph summarizing the distribution of a set of data values. The upper and lower ends of of the center box indicate the 75th and 25th percentiles of the data, the center box indicates the median, and the center + indicates the mean. Suspected outliers appear in a boxplot as individual points o or x outside the box. The o outlier values are known as outside values, o and the x outlier values as far outside values.

If the difference (distance) between the 75th and 25th percentiles of the data is H, then the outside values are those values that are more than 1.5H but no more than 3H above the upper quartile, and those values that are more than 1.5H but no more than 3H below the lower quartile. The far outside values are values that are at least 3H above the upper quartile or 3H below the lower quartile.

### C

Categorical data

Data that is divided into categories, or distinct groups (in contrast to continuous data that may fall on any point given a common scale).

Examples:

Gender of babies born at the community hospital in July

Grade level of children in a school district

Year of birth of athletes on a soccer team

Type of paint flaws on appliances: peel, scratch, smudge, other

Ratings of automobile handling characteristics on a poor, fair, good, excellent scale

Category Axis

An axis that displays values individually, without necessarily arranging them to scale. (A scale axis, in contrast, displays numerical values to scale.) Bar charts, line charts, and area charts usually have one category axis and at least one scale axis. Scatterplots and histograms do not have a category axis.

Cell

In a contingency table, a cell is an individual combination of possible levels (values) of the factors. For example, if there are two factors, gender with values male and female and risk with values low, medium, and high, then there are 6 cells: males with low risk, males with medium risk, males with high risk, females with low risk, females with medium risk, and females with high risk.

Central tendency

The generalized concept of the "average" value of a \$\$distribution\$\$. Typical measures of central tendency are the mean, the median, the mode, and the geometric mean.

Centroid

The centroid of a set of multi-dimensional data points is the data point that is the mean of the values in each dimension. For X-Y data, the centroid is the point at (mean of the X values, mean of the Y values). A simple linear regression line always passes through the centroid of the X-Y data.

Chi-Square Distribution

A common distribution used in tests of statistical significance to:

Test how well a sample fits a theoretical distribution.

Test the independence between categorical variables.

The chi-square distribution with v degrees of freedom has the probability distribution function:

Chi-square test for independence

Pearson's chi-square test for independence for a contingency table tests the null hypothesis that the row classification factor and the column classification factor are independent. The chi-square test for independence compares observed and expected frequencies (counts). The expected frequencies are calculated by assuming the null hypothesis is true.

The chi-square test statistic is basically the sum of the squares of the differences between the observed and expected frequencies, with each squared difference divided by the corresponding expected frequency. Note that the chi-square statistic is always calculated using the counted frequencies. It can not be calculated using the observed proportions, unless the total number of subjects (and thus the frequencies) is also known.

Confidence interval

A random interval that has a known probability (the "confidence coefficient" or "confidence level") of including the true value of a parameter. Defines an interval within which the true population parameter is likely to lie. It can be thought of as a measure of the precision of a sample statistic.

Contingency Table

If individual values are cross-classified by levels in two different attributes (factors), such as gender and tumor vs no tumor, then a contingency table is the tabulated counts for each combination of levels of the two factors, with the levels of one factor labeling the rows of the table, and the levels of the other factor labeling the columns of the table. For the factors gender and presence of tumor, each with two levels, we would get a 2x2 contingency table, with rows Male and Female, and columns Tumor and No Tumor.

The counts for each cell in the table would be the number of subjects with the corresponding row level of gender and column level of tumor vs no tumor: females with tumors in row 1, column 1; females without tumors in row 1, column 2; males with tumors in row 2, column 1; and males without tumors in row 2, column 2, as shown in the picture. Contingency tables are also known as cross-tabulations. The most common method of analyzing such tables statistically is to perform a (Pearson) chi-square test for independence.

Cook's distance

Distance between the coefficients calculated with and without the ith observation.

This calculation is

where ei = residual, s2 = MS Error, p = number of predictors + 1, and hi = leverage.

Cophenetic Diastances

The cophenetic distance between two objects is defined to be the intergroup distance when the objects are first combined into a single cluster in the linkage tree. The correlation between the original Distances and the Cophenetic Distances is sometimes taken as a measure of the appropriateness of a cluster analysis relative to the original data

Correlation

Correlation is the linear association between two random variables X and Y. It is usually measured by a correlation coefficient, such as Pearson's r, such that the value of the coefficient ranges from -1 to 1.

A positive value of r means that the association is positive; i.e., that if X increases, the value of Y tends to increase linearly, and if X decreases, the value of Y tends to decrease linearly.

A negative value of r means that the association is negative; i.e., that if X increases, the value of Y tends to decrease linearly, and if X decreases, the value of Y tends to increase linearly.

The larger r is in absolute value, the stronger the linear association between X and Y. If r is 0, X and Y are said to be uncorrelated, with no linear association between X and Y. \$\$Independent\$\$ variables are always uncorrelated, but uncorrelated variables need not be independent.

### D

Degrees of freedom

A value based on sample size and number of variables in the model. Degrees of freedom of a statistical test are used in the determination of the p-value.

DFITS

A measure of the influence of each observation on the fitted value. Represents the number of standard deviations that the fitted value changes when each case is removed from the data set. Observations with values greater than 2*sqrt(p / n) are considered large and should be examined, where p is the number of predictors (including the constant) and n is the number of observations.

This calculation is:

where n = number of observations, p = number of coefficients, SSE = error sum of squares, ei = residual, and hi = leverage.

Distribution

A distribution function (also known as the probability distribution function) of a continuous random variable X is a mathematical relation that gives for each number x, the probability that the value of X is less than or equal to x. For example, a distribution function of height gives, for each possible value of height, the probability that the height is less than or equal to that value.

For discrete random variables, the distribution function is often given as the probability associated with each possible discrete value of the random variable; for instance, the distribution function for a fair coin is that the probability of heads is 0.5 and the probability of tails is 0.5.

### E

Expected Frequencies

For nominal (categorical) data in which the count of items in each category has been tabulated, the observed frequency is the actual count, and the expected frequency is the count predicted by the theoretical distribution underlying the data. For example, if the hypothesis is that a certain plant has yellow flowers 3/4 of the time and white flowers 1/4 of the time, then for 100 plants, the expected frequencies will be 75 for yellow and 25 for white. The observed frequencies will be the actual counts for 100 plants (say, 73 and 27).

Exponential Distribution

Most often used to model the behavior of units that have a constant failure rate. The exponential distribution has a wide range of applications in analyzing the reliability and availability of electronic systems, queuing theory, and Markov chains.

The exponential distribution with lambda parameter has the probability distribution function:

### F

F Distribution

Used in hypothesis testing to determine whether two population variances are equal. The F distribution is a sampling distribution of two independent random variables with chi-square distributions, each divided by its degrees of freedom.

The F distribution with degrees of freedom u and v has the probability distribution function:

Factors

A factor is a single discrete classification scheme for data, such that each item classified belongs to exactly one class (level) for that classification scheme. For example, in a drug experiment involving rats, sex (with levels male and female) or drug received could be factors. A one-way analysis of variance involves a single factor classifying the subjects (e.g., drug received).

### G

Gamma Distribution

Often used to model positively skewed data when random variables are greater than 0. For example, the gamma distribution can describe the time for an electrical component to fail. Most electrical components of a given type will fail around the same time, but a few will take a long time to fail. The gamma distribution is commonly used in reliability survival studies.

The gamma distribution with alpha shape parameter a and beta scale parameter b has the probability distribution function:

Geometric Mean

Computed as

Goodness of fit

Goodness-of-fit tests test the conformity of the observed data's empirical distribution function with a posited theoretical distribution function. The Kolmogorov-Smirnov test does this by calculating the maximum vertical distance between the empirical and posited distribution functions.

### H

Harmonic Mean

Computed as

Heavy-tailed

A heavy-tailed distribution is one in which the extreme portion of the distribution (the part farthest away from the median) spreads out further relative to the width of the center (middle 50%) of the distribution than is the case for the normal distribution. For a symmetric heavy-tailed distribution like the Cauchy distribution, the probability of observing a value far from the median in either direction is greater than it would be for the normal distribution.

Histogram

A histogram is a graph of grouped (binned) data in which the number of values in each bin is represented by the area of a rectangular box.

Hypothesis test

The acceptance or rejection of an assertion (the null hypothesis) about one or more parameters according to the assertion's compatibility with the data.

### I

Independent

Two random variables are independent if their joint probability density is the product of their individual (marginal) probability densities. Less technically, if two random variables A and B are independent, then the probability of any given value of A is unchanged by knowledge of the value of B. A sample of mutually independent random variables is an independent sample.

Intercept

The constant in a regression equation; the point where a regression line intercepts the vertical axis, if the horizontal axis has a true zero origin.

### K

Kurtosis

Kurtosis is a measure of the heaviness of the tails in a distribution, relative to the normal distribution. A distribution with negative kurtosis (such as the uniform distribution) is light-tailed relative to the normal distribution, while a distribution with positive kurtosis (such as the Cauchy distribution) is heavy-tailed relative to the normal distribution. The population kurtosis is usually defined as

### L

Levels

When factors are used to classify subjects, each subject is assigned to one class value; e.g., male or female for the factor sex or the specific treatment given for the factor treatment. These individual class values within a factor are called levels. Each subject is assigned to exactly one level for each factor. Each unique combination of levels for each factor is a cell.

Leverage

Identify observations with unusual or outlying x-values. Observations with large leverage may exert considerable influence on the fitted value and the model. Leverage values fall between 0 and 1. Experts consider a leverage value greater than 2p/n or 3p/n, where p is the number of predictors or factors plus the constant and n is the number of observations, large and suggest you examine the corresponding observation.

Leverages are obtained from the hat matrix (H), which is a n x n projection matrix specified as:

H = X (X'X)-1 X'

where X is the matrix of x-values.

The leverage of the ith observation is the ith diagonal element, hi of H.

Light-tailed

A light-tailed distribution is one in which the extreme portion of the distribution (the part farthest away from the median) spreads out less far relative to the width of the center (middle 50%) of the distribution than is the case for the normal distribution. For a symmetric light-tailed distribution like the uniform distribution, the probability of observing a value far from the median in either direction is smaller than it would be for the normal distribution.

Logistic Distribution

Used as a growth curve and to model binary responses. Used in the fields of biostatistics and economics. The logistic distribution is described by its scale and location parameters. The logistic distribution has no shape parameter, which means that the probability density function has only one shape.

The logistic distribution with location µ and scale s has probability distribution function:

Lognormal Distribution

Use when random variables are greater than 0. Used for reliability analysis and in financial applications, such as modeling stock behavior.

The lognormal distribution with mean µ and standard deviation σ has probability distribution function:

Lower One-sided

An hypothesis test in which large deviations in left direction from the null hypothesis are to be considered significant. See also Two-sided test.

### M

Maximum likelihood

The method of maximum likelihood is a general method of finding estimated (fitted) values of parameters. Estimates are found such that the joint likelihood function, the product of the values of the distribution function for each observed data value, is as large as possible. The estimation process involves considering the observed data values as constants and the parameter to be estimated as a variable, and then using differentiation to find the value of the parameter that maximizes the likelihood function.

Mean Deviation

The mean deviation is the mean of the absolute deviations about the mean. The mean deviation is defined by

Median

The median of a distribution is the value X such that the probability of an observation from the distribution being below X is the same as the probability of the observation being above X. For a continuous distribution, this is the same as the value X such that the probability of an observation being less than or equal to X is 0.5.

### N

Negative Binomial Distribution

When performing an experiment with only two outcomes, this discrete distribution can model the number of trials necessary to produce a specified number of a certain outcome. It can also model the number of nonevents that occur before you observe the specified number of outcomes.

The negative binomial distribution with parameter r and p has probability mass function:

Nonparametric tests

Nonparametric tests are tests that do not make distributional assumptions, particularly the usual distributional assumptions of the normal-theory based tests. These include tests that do not involve population parameters at all

Normal

The normal or Gaussian distribution is a continuous symmetric distribution that follows the familiar bell-shaped curve. The distribution is uniquely determined by its mean and variance. It has been noted empirically that many measurement variables have distributions that are at least approximately normal. Even when a distribution is nonnormal, the distribution of the mean of many independent observations from the same distribution becomes arbitrarily close to a normal distribution as the number of observations grows large. Many frequently used statistical tests make the assumption that the data come from a normal distribution.

Normal Distribution

The normal distribution is the most common statistical distribution because approximate normality arises naturally in many physical, biological, and social measurement situations. Many statistical analyses require that the data come from normally distributed populations.

The normal distribution with mean µ and standard deviation σ has probability distribution function:

Null hypothesis

The null hypothesis for a statistical test is the assumption that the test uses for calculating the probability of observing a result at least as extreme as the one that occurs in the data at hand. For the two-sample unpaired t test, the null hypothesis is that the two population means are equal, and the t test involves finding the probability of observing a t statistic at least as extreme as the one calculated from the data, assuming the null hypothesis is true.

### O

One-sided

An hypothesis test in which large deviations in left (Lower One-sided) or right (Upper One-sided) direction from the null hypothesis are to be considered significant. See also Two-sided test.

Outliers

Outliers are anomalous values in the data. They may be due to recording errors, which may be correctable, or they may be due to the sample not being entirely from the same population. Apparent outliers may also be due to the values being from the same, but nonnormal (in particular, heavy-tailed), population distribution.

# P

Pareto Distribution

The Pareto distribution with shape parameter α has the probability distribution function:

Percentile

Percentiles, including quantiles, quartiles, and the median, are useful for a detailed study of a distribution. For a set of measurements arranged in order of magnitude, the pth percentile is the value that has p percent of the measurements below it and (100-p) percent above it. The median is the 50th percentile.

Poisson Distribution

Describes the number of times an event occurs in a finite observation space. The Poisson distribution is often used in quality control, reliability/survival studies, and insurance. The Poisson distribution is defined by one parameter: µ. This parameter equals the mean and variance. As µ increases, the Poisson distribution approaches a normal distribution.

The Poisson distribution with shape parameter µ has probability mass function:

Population

The population is the universe of all the objects from which a sample could be drawn for an experiment. If a representative random sample is chosen, the results of the experiment should be generalizable to the population from which the sample was drawn, but not necessarily to a larger population. For example, the results of medical studies on males may not be generalizable for females.

P-value

In a statistical hypothesis test, the P-value is the probability of observing a test statistic at least as extreme as the value actually observed, assuming that the null hypothesis is true. This probability is then compared to the pre-selected significance level of the test. If the P-value is smaller than the significance level, the null hypothesis is rejected, and the test result is termed significant.

The P-value depends on both the null hypothesis and the alternative hypothesis. In particular, a test with a one-sided alternative hypothesis will generally have a lower P-value (and thus be more likely to be significant) than a test with a two-sided alternative hypothesis. However, one-sided tests require more stringent assumptions than two-sided tests. They should only be used when those assumptions apply.

### Q

Qualitative

Qualitative variables are variables for which an attribute or classification is measured. Examples of qualitative variables are gender or disease state.

Quantitative

Quantitative variables are variables for which a numeric value representing an amount is measured.

### R

Random variables

A random variable is a rule that assigns a value to each possible outcome of an experiment. For example, if an experiment involves measuring the height of people, then each person who could be a subject of the experiment has associated value, his or her height. A random variable may be discrete (the possible outcomes are finite, as in tossing a coin) or continuous (the values can take any possible value along a range, as in height measurements).

Rank tests

Rank tests are nonparametric tests that are calculated by replacing the data by their rank values. Rank tests may also be applied when the only data available are relative rankings. Examples of rank tests include the Wilcoxon signed rank test, the Mann-Whitney rank sum test and the Kruskal-Wallis test.

Repeated measures

In a repeated measures ANOVA, there will be at least one factor that is measured at each level for every subject in the experiment. This is a within (repeated measures) factor. For example, in an experiment in which each subject performs the same task twice is a repeated measures design, with trial (or trial number) as the within factor. If every subject performed the same task twice under each of two conditions, for a total of 4 observations for each subject, then both trial and condition would be within factors.

In a repeated measures design, there may also be one or more factors that are measured at only one level for each subject, such as gender. This type of factor is a between or grouping factor.

Residuals

A residual is the difference between the observed value of a response measurement and the value that is fitted under the hypothesized model. For example, in a two-sample unpaired t test, the fitted value for a measurement is the mean of the sample from which it came, so the residual would be the observed value minus the sample mean.

Root Mean Square

RMS, sometimes called the quadratic mean, is the square root of the mean squared value. The RMS value is given by

### S

Scale

The generalized concept of the variability or dispersion of a distribution. Typical measures of scale are variance, standard deviation, range, and interquartile range. Scale and spread both refer to the same general concept of variability.

Scale Axis

An axis that displays numerical values to scale. (A category axis, in contrast, displays individual values separately and not necessarily to scale.) Bar charts and line charts usually have at least one scale axis, plus one category axis. Scatter Charts have at least two scale axes but no category axis. Histograms have a scale axis and an interval axis.

Shape

The general form of a Distribution, often characterized by its skewness and kurtosis (heavy or light tails relative to a normal distribution).

Significance level

The significance level (also known as the alpha-level) of a statistical test is the pre-selected probability of (incorrectly) rejecting the null hypothesis when it is in fact true. Usually a small value such as 0.05 is chosen. If the P-value calculated for a statistical is smaller than the significance level, the null hypothesis is rejected.

Similarities

The similarity between two clusters i and j is given by

where dm is the maximum value in the original distance matrix

Skewness

Skewness is a lack of symmetry in a distribution. Data from a positively skewed (skewed to the right) distribution have values that are bunched together below the mean, but have a long tail above the mean. (Distributions that are forced to be positive, such as annual income, tend to be skewed to the right.) Data from a negatively skewed (skewed to the left) distribution have values that are bunched together above the mean, but have a long tail below the mean. Population skewness is defined as

The generalized concept of the variability of a distribution. Typical measures of spread are variance, standard deviation, range, and interquartile range. Spread and scale both refer to the same general concept of variability.

Standard Deviation

A measure of dispersion around the mean, equal to the square root of the variance. The standard deviation is measured in the same units as the variable itself. If all sample values are multiplied by a constant, the sample standard deviation is multiplied by the same constant. The standard deviation calculated using the “unbiased” method is given by

Standard Deviation (biased)

The standard deviation calculated using the “biased” or “n” method is given by

Standardized Residuals

The raw residuals divided by the square root of the expected counts. Standardized residuals which are also known as Pearson residuals, have a mean of 0 and a standard deviation of 1.

Sum of Square

The uncorrected sum of squares, computed as

Sum of Squared Errors

The sum of squares corrected for the mean, computed as:

### T

t distribution

The Student's t distribution with v degrees of freedom has the probability distribution function:

Test of independence

A test of independence for a contingency table tests the null hypothesis that the row classification factor and the column classification factor are independent. One such test is Pearson's chi-square test for independence.

Triangular Distribution

Used primarily to describe a population for which limited sample data are available.

The triangular distribution with lower bound a, upper bound b and mode c has the probability distribution function:

Two-sided

An hypothesis test in which large deviations in either direction from the null hypothesis are to be considered significant.

Type I error

A type I error occurs if, based on the sample data, we decide to reject the null hypothesis when in fact (ie, in the population) the null hypothesis is true.

Type II error

A type II error occurs if, based on the sample data, we decide not to reject the null hypothesis when in fact (ie, in the population) the null hypothesis is false.

### U

Uniform Distribution

A continuous distribution that describes variables that have a constant probability. The uniform distribution is also known as the rectangular distribution.

The uniform distribution with parameter a and b has the probability distribution function:

Upper One-sided

An hypothesis test in which large deviations in right direction from the null hypothesis are to be considered significant. See also Two-sided test.

### V

Variance

A measure of dispersion around the mean, equal to the sum of squared deviations from the mean divided by one less than the number of cases. The variance is measured in units that are the square of those of the variable itself. The difference between a value and the mean is called a deviation from the mean. Thus, the variance approximates the mean of the squared deviations.

When all the values lie close to the mean, the variance is small but never less than zero. When values are more scattered, the variance is larger. If all sample values are multiplied by a constant, the sample variance is multiplied by the square of the constant.

Variance (biased)

Computed as

Violation of assumptions

Statistical hypothesis tests generally make assumptions about the population(s) from which the data were sampled. For example, many normal-theory-based tests such as the t test and ANOVA assume that the data are sampled from one or more normal distributions, as well as that the variances of the different populations are the same (homoscedasticity:). If test assumptions are violated, the test results may not be valid.

### W

Within effects

In a repeated measures ANOVA, there will be at least one factor that is measured at each level for every subject. This is a within (repeated measures) factor. For example, in an experiment in which each subject performs the same task twice, trial number is a within factor. There may also be one or more factors that are measured at only one level for each subject, such as gender. This type of factor is a between or grouping factor.

### Y

Yates' correction for continuity

The Yates' continuity correction improves the approximation of the discrete sample chi-square statistic to a continuous chi-square distribution. The continuity-adjusted chi-square is most useful for small sample sizes. The use of the continuity adjustment is controversial; this chi-square test is more conservative, and more like Fisher's exact test, when your sample size is small. As the sample size increases, the statistic becomes more and more like the Pearson chi-square. The following is Yates' corrected version of Pearson's chi-squared statistic

Under the null hypothesis of independence, QC has an asymptotic chi-square distribution with (r-1)×(c-1) degrees of freedom.

### Z

Z-score

The number of standard deviations from the mean. For a value from a normal distribution, the z-score is found by dividing by subtracting the mean of the distribution and dividing by the standard deviation. Most commonly used for test statistics, since the z-score can be referred to tables of the standard normal distribution to determine the p-value.