#### Outlier

Outlier: Sometimes a set of data will have one or more items with unusually large or unusually small values. Such extreme values are called outliers. Outliers often arise from some mistakes in data-gathering or data-recording procedures. It is good practice to inspect a data set…

#### Parameter

Parameter: A Parameter is a numerical value that describes one of the characteristics of a probability distribution or population. For example, a binomial distribution is completely specified if the number of trials and probability of success are known. Here, the number of trials and the…

#### Sample

Sample: A sample is a portion of the elements of a population. A sample is chosen to make inferences about the population by examining or measuring the elements in the sample. Browse Other Glossary Entries

#### Sample Space

Sample Space: The set of all possible outcomes of a particular experiment is called the sample space for the experiment. If a coin is tossed twice, the sample space is {HH, HT, TH, TT}, where TH, for example, means getting tails on the first toss…

#### Sampling Distribution

Sampling Distribution: When a sample is drawn, some summary value (called a statistic) is usually computed. For example, the sample mean and the sample variance are two statistics. The value of the statistic changes with the sample we have. The probability distribution of the statistic…

#### Standard Score

Standard Score: The standard score of an observation is the number of standard deviation units it is above or below the mean. The standard score of an observation is calculated by subtracting the mean from the observation, then dividing by the standard deviation. Browse Other…

#### State Space

Statistical Glossary State Space: State space is an abstract space representing possible states of a system. A point in the state space is a vector of the values of all relevant parameters of the system. It is often assumed that the system is dynamic -…

#### Statistic

Statistic: 1. A number measuring something 2. A measure calculated from a sample of data. Contrast "statistic" (drawn from a sample) with "parameter," which is a characteristic of a population. For example, the sample mean is a statistic; the population mean is a parameter of…

#### Statistics

Statistical Glossary Statistics: 1. A collection of numerical data that measure something. 2. The science of recording, organizing, analyzing and reporting quantitative information. See also: statistic Browse Other Glossary Entries

#### Survival Analysis

Survival Analysis: Survival analysis is concerned with "time-to-event" data. In medical statistics, the data are often in the form of "time-to-death". In the analysis of production or industrial data, "time-to-failure" is a typical application. However, the event of interest need not either be failure or…

#### Time Series

Time Series: Time series data are measurements of a variable taken at regular intervals over time. Time series are represented as sequences of values like x(1), x(2), ... . A wide class of practically important data are represented as time series: economic and social data,…

#### Time Series Analysis

Time Series Analysis: Time series analysis is a branch of statistics dealing with data represented as time series . Time series analysis includes almost all classes of statistical approaches and problems: data description, hypothesis testing , parameter estimation , regression , etc. The practical importance…

#### Transformation

Transformation: Transformation is the conversion of a data set into a transformed data set by the application of a function. The statistical purpose of transformation is to produce a transformed data set that better conforms to the requirements of a statistical procedure. A typical use…

#### Truncation

Truncation: Truncation, generally speaking, means to shorten. In statistics it can mean the process of limiting consideration or analysis to data that meet certain criteria (for example, the patients still alive at a certain point). Or it can refer to a data distribution where values…

#### Uncertainty and Statistics

Statistical Glossary Uncertainty and Statistics: A main goal of statistics is to quantify or measure uncertainty; this branch of statistics is called "inferential statistics." classical statistics measures uncertainty using fundamental concepts and theories of probability and randomness. Modern statistics often applies Monte Carlo simulation as…

#### Univariate

Univariate: Univariate analysis involves a single variable of interest. Browse Other Glossary Entries

#### Variables (in design of experiments)

Variables (in design of experiments): Many statistical methods rest on a statistical model which states a relationship Y = f(X1,..,XN) between a dependent variable (Y) and independent variable(s) X1,...,XN. In designed experiments, the dependent variable is often named "response", independent variables manipulated by the experimenter…

#### Z score

Z score: An observation's z-score tells you the number of standard deviations it lies away from the population mean (and in which direction). The calculation is as follows: z =  x - m s , where x is the observation itself, m is the mean…

#### Average Deviation

Average Deviation: The average deviation or the average absolute deviation is a measure of dispersion. It is the average of absolute deviations of the individual values from the median or from the mean. Browse Other Glossary Entries

#### Box Plot

Box Plot: A box plot is a graph that characterizes the pattern of variation of the data. The plot simultaneously displays several measures of central tendency and dispersion of the data at hand. The box plot provides the following information: (1) the position of the…

#### Chernoff Faces

Statistical Glossary Chernoff Faces: Chernoff faces are a category of icon plots . Each unit is represented as a schematic face. Variables of interest are represented by particular parameters of the face, e.g. the nose size, eye-to-eye distance, etc. Browse Other Glossary Entries

#### Coefficient of variation

Coefficient of variation: The coefficient of variation is the standard deviation of a data set, divided by the mean of the same data set. Browse Other Glossary Entries

#### Column icon plots

Statistical Glossary Column icon plots: See sequential icon plots . Browse Other Glossary Entries

#### Correlation Coefficient

Correlation Coefficient: The correlation coefficient indicates the degree of linear relationship between two variables. The correlation coefficient always lies between -1 and +1. -1 indicates perfect linear negative relationship between two variables, +1 indicates perfect positive linear relationship and 0 indicates lack of any linear…

#### Correlation Matrix

Correlation Matrix: A Correlation matrix describes correlation among M variables. It is a square symmetrical MxM matrix with the (ij)th element equal to the correlation coefficient r_ij between the (i)th and the (j)th variable. The diagonal elements (correlations of variables with themselves) are always equal…

#### Correspondence Plot

Correspondence Plot: A correspondence plot represents the results of correspondence analysis (CA). For each category (possible value of a variable), its scores derived by CA for the first two dimensions are depicted as a point on the x-y plane. An interesting feature of the correspondence…

#### Covariance

Covariance: The covariance between two random variables X and Y is the expected value of the product of the variables´ deviations from their means. If there is a high probability that large values of X go with large values of Y and small values of…

#### Descriptive Statistics

Descriptive Statistics: Descriptive statistics refers to statistical techniques used to summarize and describe a data set, and also to the statistics (measures) used in such summaries. Measures of central tendency (e.g. mean, median) and variation (e.g. range, standard deviation) are the main descriptive statistics. Displays…

#### Geometric mean

Geometric mean: The geometric mean of n values is determined by multiplying all n values together, then taking the nth root of the product. It is useful in taking averages of ratios. The geometric mean is often used for data which take only on positive…

#### Geometric Mean and Mean (comparison)

Geometric Mean and Mean (comparison): The quantitative distinction between the geometric mean and the mean can be illustrated by the following table: Data set Mean Geometric Mean 1, 1, 1 1 1 1, 2, 3 2  1.6 1, 2, 1000  334  6.7 The analytical relation…

#### Gini coefficient

Gini coefficient: The Gini coefficient is used in economics to measure income inequality. Generally speaking, it is used to measure the extent of departure from a perfectly even distribution of income. A "0" indicates no departure, i.e. everyone has the same income. A "1" indicates…

#### Gini´s Mean Difference

Gini´s Mean Difference: Gini´s mean difference is a descriptive statistic , a measure of variation. For a sample of N values the Gini´s mean difference is the average of all pairwise absolute differences: GMD =  1 N(N-1) ? ij  |xi-xj|;   i,j = 1,...,N;   i ?…

#### Hotelling Trace Coefficient

Hotelling Trace Coefficient: The Hotelling Trace coefficient (also called Lawley-Hotelling or Hotelling-Lawley Trace) is a statistic for a multivariate test of mean differences between two groups. The null hypothesis is that centroid s don´t differ between two groups. The coefficient is equal to Hotelling´s T-Square…

#### Hotelling-Lawley Trace

Hotelling-Lawley Trace: See Hotelling Trace coefficient . Browse Other Glossary Entries

#### Icon Plots

Statistical Glossary Icon Plots: Icon plots are graphical tools for multivariate analysis. They provide graphical representation of observed units described by many variables. Each unit or observation is represented by a small image which depends on the values of the variables of interest. Icon plots…

#### Interquartile Range

Interquartile Range: The difference between the 3d and 1st quartiles is called the interquartile range and it is used as a measure of variability (dispersion). Browse Other Glossary Entries

#### Kurtosis

Kurtosis: Kurtosis measures the "heaviness of the tails" of a distribution (in compared to a normal distribution). Kurtosis is positive if the tails are "heavier" then for a normal distribution, and negative if the tails are "lighter" than for a normal distribution. The normal distribution…

#### Life Tables

Life Tables: In survival analysis, life tables summarize lifetime data or, generally speaking, time-to-event data. Rows in a life table usually correspond to time intervals, columns to the following categories: (i) not "failed", (ii) "failed", (iii) censored (withdrawn), and the sum of the three called…

#### Likert Scales

Likert Scales: Likert scales are categorical ordinal scale s used in social sciences to measure attitude. Measurements at Likert scales usually take on an odd number of values with a middle point, e.g. "strongly agree", "agree", "undecided", "disagree", "strongly disagree". The middle value is usually…

#### Log-log Plot

Log-log Plot: A log-log plot represents observed units described by two variables, say x and y , as a scatter graph . In a log-log plot, the two axes display the logarithm of values of the variables, not the values themselves. If the relationship between…

#### Logit

Logit: Logit is a nonlinear function of probability. If p is the probability of an event, then the corresponding logit is given by the formula: logit(p) = log  p (1 - p) Logit is widely used to construct statistical models, for example in logistic regression…

#### Logit and Odds Ratio

Logit and Odds Ratio: The following relation between the odds ratio and logit is often used for constructing statistical models: log  OR(p1, p2) = logit  (p1) - logit  (p2) where p1, p2 are probabilities, OR  (p1, p2) is the odds ratio for p1 and p2 . See also: Logit…

#### Mean

Mean: For a population or a sample, the mean is the arithmetic average of all values. The mean is a measure of central tendency or location. See also: Expected Value. Browse Other Glossary Entries

#### Mean Deviation

Mean Deviation: See Average deviation Browse Other Glossary Entries

#### Mean Squared Error

Statistical Glossary Mean Squared Error: The mean squared error is a measure of performance of a point estimator. It measures the average squared difference between the estimator and the parameter. For an unbiased estimator, the mean squared error is equal to the variance of the…

#### Median

Median: In a population or a sample, the median is the value that has just as many values above it as below it. If there are an even number of values, the median is the average of the two middle values. The median is a…

#### Mode

Mode: The mode is a value that occurs with the greatest frequency in a population or a sample. It could be considered as the single value most typical of all the values. Browse Other Glossary Entries

#### Moments

Moments: For a random variable x, its Nth moment is the expected value of the Nth power of x, where N is a positive integer. The Nth moment of the deviation of x from its mean is called "the Nth central moment". The 1st moment…

#### Odds Ratio

Odds Ratio: The odds ratio compares two probabilities (or proportions) P1 and P2 in the following way: q =  P1/(1-P1) P2/(1-P2) . If P1 and P2 are equal, the odds ratio is equal to 1. If the symbols do not display properly, try the graphic…

#### Order Statistics

Statistical Glossary Order Statistics: The order statistics of a random sample X1, X2, . . ., Xn are the sample values placed in ascending order. They are denoted by X(1), X(2), . . ., X(n) . Here, X(1) X(2) . . . X(n) . For…

#### Path coefficients

Path coefficients: In path analysis and structural equation modeling a path coefficient is the partial correlation coefficient between the dependent variable and an independent variable, adjusted for other independent variables. Browse Other Glossary Entries

#### Pearson correlation coefficient

Pearson correlation coefficient: See correlation coefficient. Browse Other Glossary Entries

#### Percentile

Percentile: In a population or a sample, the Pth percentile is a value such that at least P percent of the values take on this value or less and at least (100-P) percent of the values take on this value or more. See also: quartile,…

#### Asymptotic Efficiency

Asymptotic Efficiency: For an unbiased estimator, asymptotic efficiency is the limit of its efficiency as the sample size tends to infinity. An estimator with asymptotic efficiency 1.0 is said to be an "asymptotically efficient estimator". Roughly speaking, the precision of an asymptotically efficient estimator tends…

#### Asymptotic Property

Asymptotic Property: An asymptotic property is a property of an estimator that holds as the sample size approaches infinity. Browse Other Glossary Entries

#### Asymptotically Unbiased Estimator

Asymptotically Unbiased Estimator: An asymptotically unbiased estimator is an estimator that is unbiased as the sample size tends to infinity. Some biased estimators are asymptotically unbiased but all unbiased estimators are asymptotically unbiased. Browse Other Glossary Entries

#### Biased Estimator

Biased Estimator: An estimator is a biased estimator if its expected value is not equal to the value of the population parameter being estimated. Browse Other Glossary Entries

#### Coefficient of Determination

Coefficient of Determination: In regression analysis, the coefficient of determination is a measure of goodness-of-fit (i.e. how well or tightly the data fit the estimated model). The coefficient is defined as the ratio of two sums of squares: r2 =  SSR SST , where SSR…

#### Confidence Interval

Confidence Interval: A confidence interval is an interval that brackets a sample estimate that quantifies uncertainty around this estimate. Since there are a variety of samples that might be drawn from a population, there are likewise a variety of confidence intervals that might be imagined…

#### Efficiency

Efficiency: For an unbiased estimator, efficiency indicates how much its precision is lower than the theoretical limit of precision provided by the Cramer-Rao inequality. A measure of efficiency is the ratio of the theoretically minimal variance to the actual variance of the estimator. This measure…

#### Estimation

Estimation: Estimation is deriving a guess about the actual value of a population parameter (or parameters) from a sample drawn from this population. See also Estimator. Browse Other Glossary Entries

#### Estimator

Estimator: A statistic, measure, or model, applied to a sample, intended to estimate some parameter of the population that the sample came from. Browse Other Glossary Entries

#### Forward Selection

Forward Selection: Forward selection is one of several computer-based iterative variable-selection procedures. It resembles step-wise regression except that a variable added to the model is not permitted to be removed in the subsequent steps. See also Backward elimination. Browse Other Glossary Entries

#### Kaplan-Meier Estimator

Kaplan-Meier Estimator: The Kaplan-Meier estimator is aimed at estimation of the survival function from censored life-time data. The value of the survival function between successive distinct uncensored observations is taken as constant, and the graph of the Kaplan-Meier estimate of the survival function is a…

#### Least Squares Method

Least Squares Method: In a narrow sense, the Least Squares Method is a technique for fitting a straight line through a set of points in such a way that the sum of the squared vertical distances from the observed points to the fitted line is…

#### Line of Regression

Line of Regression: The line of regression is the line that best fits the data in simple linear regression, i.e. the line that corresponds to the "best-fit" parameters (slope and intercept) of the regression equation. Browse Other Glossary Entries

#### Linear Regression

Linear Regression: Linear regression is aimed at finding the "best-fit" linear relationship between the dependent variable and independent variable(s). See also: Regression analysis, Simple linear regression, Multiple regression Browse Other Glossary Entries

#### Logistic Regression

Logistic Regression: Logistic regression is used with binary data when you want to model the probability that a specified outcome will occur. Specifically, it is aimed at estimating parameters a and b in the following model: Li = log  pi 1-pi = a + b…

#### Loglinear regression

Loglinear regression: Loglinear regression is a kind of regression aimed at finding the best fit between the data and a loglinear model . The major assumption of loglinear regression is that a linear relationship exists between the log of the dependent variable and the inependent…

#### Margin of Error

Margin of Error: A margin of error typically refers to a range within which an unknown parameter is estimated to fall, given the variation that can arise from one sample to another. For example, in an opinion survey based on a randomly-drawn sample from a…

#### Maximum Likelihood Estimator

Maximum Likelihood Estimator: The method of maximum likelihood is the most popular method for deriving estimators - the value of the population parameter T maximizing the likelihood function is used as the estimate of this parameter. The general idea behind maximum likelihood estimation is to…

#### Multiple Least Squares Regression

Multiple Least Squares Regression: Multiple least squares regression is a special (and the most common) type of multiple regression . It relies on the least squares method to fit the regression model to the data. See also: ordinary least squares regression . Browse Other Glossary…

#### Multiple Regression

Multiple Regression: Multiple (linear) regression is a regression technique aimed at finding a linear relationship between the dependent variable and multiple independent variables. (See regression analysis.) The multiple regression model is as follows: Yi = B0 + B1 X1i + B2 X2i + Ã‚Â¼+ Bm…

#### Non-parametric Regression

Non-parametric Regression: Non-parametric regression methods are aimed at describing a relationship between the dependent and independent variables without specifying the form of the relationship between them a priori. See also: Regression analysis Browse Other Glossary Entries

#### Ordinary Least Squares Regression

Ordinary Least Squares Regression: Ordinary least squares regression is a special (and the most common) kind of ordinary linear regression . It is based on the least squares method of finding regression parameters. Technically, the aim of ordinary least squares regression is to find out…

#### Ordinary Linear Regression

Ordinary Linear Regression: See: simple linear regression Browse Other Glossary Entries

#### Orthogonal Least Squares

Orthogonal Least Squares: In ordinary least squares, we try to minimize the sum of the vertical squared distances between the observed points and the fitted line. In orthogonal least squares, we try to fit a line which minimizes the sum of the squared distances between…

#### Precision

Precision: Precision is the degree of accuracy with which a parameter is estimated by an estimator. Precision is usually measured by the standard deviation of the estimator and is known as the standard error. For example, the sample mean is used to estimate the population…

#### Regression

Regression: See regression analysis. Browse Other Glossary Entries

#### Regression Analysis

Regression Analysis: Regression analysis provides a "best-fit" mathematical equation for the relationship between the dependent variable (response) and independent variable(s) (covariates). There are two major classes of regression - parametric and non-parametric. Parametric regression requires choice of the regression equation with one or a greater…

#### Residuals

Residuals: Residuals are differences between the observed values and the values predicted by some model. Analysis of residuals allows you to estimate the adequacy of a model for particular data; it is widely used in regression analysis . Browse Other Glossary Entries

#### Resistance

Statistical Glossary Resistance: Resistance, used with respect to sample estimators, refers to the sensitivity of the estimator to extreme observations. Estimators that do not change much with the addition of deletion of extreme observations are said to be resistant. The median is a resistant estimator…

#### Backward Elimination

Backward Elimination: Backward elimination is one of several computer-based iterative variable-selection procedures. It begins with a model containing all the independent variables of interest. Then, at each step the variable with smallest F-statistic is deleted (if the F is not higher than the chosen cutoff…

#### Simple Linear Regression

Simple Linear Regression: The simple linear regression is aimed at finding the "best-fit" values of two parameters - A and B in the following regression equation: Yi = A Xi + B + Ei,     i=1,Ã‚Â¼,N where Yi, Xi, and Ei are the values of the…

#### Uplift or Persuasion Modeling

Uplift or Persuasion Modeling: A combination of treatment comparisons (e.g. send a sales solicitation, or send nothing) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments. Here are the steps, in conceptual terms, for a typical uplift…

#### Step-wise Regression

Step-wise Regression: Step-wise regression is one of several computer-based iterative variable-selection procedures. Variables are added one-by-one based on their contribution to R-squared, but first, at each step we determine whether any of the variables (already included in the model) can be removed. If none of…

#### Sufficient Statistic

Sufficient Statistic: Suppose X is a random vector with probability distribution (or density) P(X | V), where V is a vector of parameters, and Xo is a realization of X. A statistic T(X) is called a sufficient statistic if the conditional probability (density) P(X |…

#### Variable-Selection Procedures

Variable-Selection Procedures: In regression analysis, variable-selection procedures are aimed at selecting a reduced set of the independent variables - the ones providing the best fit to the model. The criterion for selecting is usually the following F-statistic: F(x1,...,xp; xp+1) =  SSE(x1,...,xp) - SSE(x1,...,xp, xp+1) SSE(x1,...,xp)…

#### Alpha Spending Function

Alpha Spending Function: In the interim monitoring of clinical trials, multiple looks are taken at the accruing results. In such circumstances, akin to multiple testing, the alpha-value at each look must be adjusted in order to preserve the overall Type-1 Error. Alpha spending functions, (the…

#### Attribute

Attribute: In data analysis or data mining, an attribute is a characteristic or feature that is measured for each observation (record) and can vary from one observation to another. It might measured in continuous values (e.g. time spent on a web site), or in categorical…

#### Categorical Data

Categorical Data: Categorical data are reflecting the classification of objects into different categories. For example, people who receive a mail order offer might be classified as "no response," "purchase and pay," "purchase but return the product," and "purchase and neither pay nor return." Browse Other…

#### Cross sectional study

Cross sectional study: Cross sectional studies are those that record data from a sample of subjects at a given point in time. See also cross sectional data , longitudinal study . Browse Other Glossary Entries

#### Cross-sectional Analysis

Cross-sectional Analysis: Cross-sectional analysis is concerned with statistical inference from cross-sectional data . Browse Other Glossary Entries

#### Cross-sectional Data

Cross-sectional Data: Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual. A simple example of cross-sectional data is the gross annual income for each of 1000 randomly chosen households in New York…

#### Cohort data

Cohort data: Cohort data records multiple observations over time for a set of individuals or units tied together by some event (say, born in the same year). See also longitudinal data and panel data. Browse Other Glossary Entries

#### Crossover Design

Crossover Design: In randomized trials, a crossover design is one in which each subject receives each treatment, in succession. For example, subject 1 first receives treatment A, then treatment B, then treatment C. Subject 2 might receive treatment B, then treatment A, then treatment C.…

#### Disproportionate Stratified Random Sampling

Disproportionate stratified random sampling: See Stratified Sampling (method ii). Browse Other Glossary Entries

#### Effect

Effect: In design of experiments, the effect of a factor is an additive term of the model, reflecting the contribution of the factor to the response. See Variables (in design of experiments) for an explanatory example. Browse Other Glossary Entries

#### Effect Size

Effect Size: In a study or experiment with two groups (usually control and treatment), the investigator typically has in mind the magnitude of the difference between the two groups that he or she wants to be able to detect in a hypothesis test. This magnitude,…

#### Error Spending Function

Error Spending Function: See alpha spending function. Browse Other Glossary Entries