2-Tailed vs. 1-Tailed Tests

2-Tailed vs. 1-Tailed Tests: The purpose of a hypothesis test is to avoid being fooled by chance occurrences into thinking that the effect you are investigating (for example, a difference between treatment and control) is real. If you are investigating, say, the difference between an…

A Priori Probability

A Priori Probability: A priori probability is the probability estimate prior to receiving new information. See also Bayes Theorem and posterior probability. Browse Other Glossary Entries

A-B Test

A-B Test: An A-B test is a classic statistical design in which individuals or subjects are randomly split into two groups and some intervention or treatment is applied - one group gets treatment A, the other treatment B. Typically one of the treatments will be…

Acceptance Region

Acceptance Region: In hypothesis testing, the test procedure partitions all the possible sample outcomes into two subsets (on the basis of whether the observed value of the test statistic is smaller than a threshold value or not). The subset that is considered to be consistent…

Acceptance Sampling

Acceptance Sampling: Acceptance sampling is the use of sampling methods to determine whether a shipment of products or components is of sufficient quality to be accepted. Browse Other Glossary Entries

Acceptance Sampling Plans

Acceptance Sampling Plans: For a shipment or production lot, an acceptance sampling plan defines a sampling procedure and gives decision rules for accepting or rejecting the shipment or lot, based on the sampling results. Browse Other Glossary Entries

Additive effect

Statistical Glossary Additive effect: An additive effect refers to the role of a variable in an estimated model. A variable that has an additive effect can merely be added to the other terms in a model to determine its effect on the independent variable. Contrast…

Additive Error

Statistical Glossary Additive Error: Additive error is the error that is added to the true value and does not depend on the true value itself. In other words, the result of the measurement is considered as a sum of the true value and the additive…

Agglomerative Methods (of Cluster Analysis)

Agglomerative Methods (of Cluster Analysis): In agglomerative methods of hierarchical cluster analysis , the clusters obtained at the previous step are fused into larger clusters. Agglomerative methods start with N clusters comprising a single object, then on each step two clusters from the previous step…

Aggregate Mean

Aggregate Mean: In ANOVA and some other techniques used for analysis of several samples, the aggregate mean is the mean for all values in all samples combined, as opposed to the mean values of the individual samples. The term "aggregate mean" is also used as…

Alpha Level

Alpha Level: See Type I Error. Browse Other Glossary Entries

Alpha Spending Function

Alpha Spending Function: In the interim monitoring of clinical trials, multiple looks are taken at the accruing results. In such circumstances, akin to multiple testing, the alpha-value at each look must be adjusted in order to preserve the overall Type-1 Error. Alpha spending functions, (the…

Alternate-Form Reliability

Alternate-Form Reliability: The alternate-form reliability of a survey instrument, like a psychological test, helps to overcome the "practice effect", which is typical of the test-retest reliability . The idea is to change the wording of the survey questions in a functionally equivalent form, or simply…

Alternative Hypothesis

Alternative Hypothesis: In hypothesis testing, there are two competing hypotheses - the null hypothesis and the alternative hypothesis. The null hypothesis usually reflects the status quo (for example, the proposed new treatment is ineffective and the observed results are just due to chance variation). The…

Analysis of Commonality

Analysis of Commonality: Analysis of commonality is a method for causal modeling . In a simple case of two independent variables x1 and x2 , for example, analysis of commonality posits three sources of causation, described by three latent variables: u1 and u2 , which…

Analysis of Covariance (ANCOVA)

Analysis of Covariance (ANCOVA): Analysis of covariance is a more sophisticated method of analysis of variance. It is based on inclusion of supplementary variables (covariates) into the model. This lets you account for inter-group variation associated not with the "treatment" itself, but with covariate(s). Suppose…

Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA): A statistical technique which helps in making inference whether three or more samples might come from populations having the same mean; specifically, whether the differences among the samples might be caused by chance variation. Browse Other Glossary Entries

ANCOVA

ANCOVA: See Analysis of covariance   Browse Other Glossary Entries

ANOVA

ANOVA: See Analysis of variance   Browse Other Glossary Entries

ARIMA

ARIMA: ARIMA as an acronym for Autoregressive Integrated Moving Average Model (also known as Box-Jenkins model ). It is a class of models of random processes in discrete time or time series . ARIMA model is widely used in time series analysis . ARIMA model…

Arithmetic Mean

Arithmetic Mean: The arithmetic mean is a synonym of the mean . The word "arithmetic" is used to discern this statistic from other statistics having "mean" in their names, like the geometric mean , the harmonic mean , the power mean , the quadratic mean…

Association Rules

Association Rules: Association rules is a method of data mining . The idea is to find a statistical association between some items in a large set of items, e.g. items purchased in a supermarket by a customer in one visit. In contrast to deterministic (non-statistical)…

Asymptotic Efficiency

Asymptotic Efficiency: For an unbiased estimator, asymptotic efficiency is the limit of its efficiency as the sample size tends to infinity. An estimator with asymptotic efficiency 1.0 is said to be an "asymptotically efficient estimator". Roughly speaking, the precision of an asymptotically efficient estimator tends…

Asymptotic Property

Asymptotic Property: An asymptotic property is a property of an estimator that holds as the sample size approaches infinity. Browse Other Glossary Entries

Asymptotic Relative Efficiency (of estimators)

Asymptotic Relative Efficiency (of estimators): Unbiased estimators are usually compared in terms of their variances. The limit (as the sample size tends to infinity) of the ratio of the variance of the first estimator to the variance of the second estimator is called the asymptotic…

Asymptotically Unbiased Estimator

Asymptotically Unbiased Estimator: An asymptotically unbiased estimator is an estimator that is unbiased as the sample size tends to infinity. Some biased estimators are asymptotically unbiased but all unbiased estimators are asymptotically unbiased. Browse Other Glossary Entries

Attribute

Attribute: In data analysis or data mining, an attribute is a characteristic or feature that is measured for each observation (record) and can vary from one observation to another. It might measured in continuous values (e.g. time spent on a web site), or in categorical…

Autocorrelation

Autocorrelation: See Serial correlation. Browse Other Glossary Entries

Autoregression

Autoregression: Autoregression refers to a special branch of regression analysis aimed at analysis of time series. It rests on autoregressive models - that is, models where the dependent variable is the current value and the independent variables are N previous values of the time series.…

Autoregression and Moving Average (ARMA) Models

Autoregression and Moving Average (ARMA) Models: The autoregression and moving average (ARMA) models are used in time series analysis to describe stationary time series . These models represent time series that are generated by passing white noise through a recursive and through a nonrecursive linear…

Autoregressive (AR) Models

Autoregressive (AR) Models: The autoregressive (AR) models are used in time series analysis . to describe stationary time series . These models represent time series that are generated by passing the white noise through a recursive linear filter . The output of such a filter…

Average Deviation

Average Deviation: The average deviation or the average absolute deviation is a measure of dispersion. It is the average of absolute deviations of the individual values from the median or from the mean. Browse Other Glossary Entries

Average Group Linkage

Average Group Linkage: The average group linkage is a method of calculating distance between clusters in hierarchical cluster analysis . The linkage function specifying the distance between two clusters is computed as the distance between the average values (the mean vectors or centroids ) of…

Average Linkage Clustering

Average Linkage Clustering: The average linkage clustering is a method of calculating distance between clusters in hierarchical cluster analysis . The linkage function specifying the distance between two clusters is computed as the average distance between objects from the first cluster and objects from the…

Azure ML

Azure is the Microsoft Cloud Computing Platform and Services. ML stands for Machine Learning, and is one of the services. Like other cloud computing services, you purchase it on a metered basis - as of 2015, there was a per-prediction charge, and a compute time…

Backward Elimination

Backward Elimination: Backward elimination is one of several computer-based iterative variable-selection procedures. It begins with a model containing all the independent variables of interest. Then, at each step the variable with smallest F-statistic is deleted (if the F is not higher than the chosen cutoff…

Bag-of-words

Bag-of-words: Bag-of-words is a simplified natural language processing concept. Text documents are parsed and output as collections of words (i.e. stripped of punctuation, etc.). In the bag-of-words concept, the resulting collection of words is considered for further analytics without regard to order, grammar, etc. (but…

Bagging

Bagging: In predictive modeling, bagging is an ensemble method that uses bootstrap replicates of the original training data to fit predictive models. For each record, the predictions from all available models are then averaged for the final prediction. For a classification problem, a majority vote…

Bandits

Bandits: Bandits refers to a class of algorithms in which users or subjects make repeated choices among, or decisions in reaction to, multiple alternatives. For example, a web retailer might have a set of N ways of presenting an offer. The task of the algorithm…

Bayes´ Theorem

Bayes´ Theorem: Bayes theorem is a formula for revising a priori probabilities after receiving new information. The revised probabilities are called posterior probabilities. For example, consider the probability that you will develop a specific cancer in the next year. An estimate of this probability based…

Bernoulli Distribution

Bernoulli Distribution: A random variable x has a Bernoulli distribution with parameter 0 < p < 1 if   P(x) = ì ï í ï î 1-p, x=0 p, x=1 0, x Ï {0, 1} where P(A) is the probability of outcome A. The parameter…

Bernoulli Distribution (Graphical)

Bernoulli Distribution: A random variable x has a Bernoulli distribution with parameter 0 < p < 1 if where P(A) is the probability of outcome A. The parameter p is often called the "probability of success". For example, a single toss of a coin has…

Beta Distribution

Beta Distribution: Suppose x1, x2, ... , xn are n independent values of a random variable uniformly distributed within the interval [0,1]. If you sort the values in ascending order, then the k-th value will have a beta distribution with parameters a = k, b…

Beta Distribution (Graphical)

Beta Distribution: Suppose x1, x2, ... , xn are n independent values of a random variable uniformly distributed within the interval [0,1]. If you sort the values in ascending order, then the k-th value will have a beta distribution with parameters , . The density…

Bias

Bias: A general statistical term meaning a systematic (not random) deviation of an estimate from the true value. A bias of a measurement or a sampling procedure may pose a more serious problem for a researcher than random errors because it cannot be reduced by…

Biased Estimator

Biased Estimator: An estimator is a biased estimator if its expected value is not equal to the value of the population parameter being estimated. Browse Other Glossary Entries

Bimodal

Statistical Glossary Additive Error: Bimodal literally means "two modes" and is typically used to describe distributions of values that have two centers. For example, the distribution of heights in a sample of adults might have two peaks, one for women and one for men. Browse…

Binomial Distribution

Binomial Distribution: Used to describe an experiment, event, or process for which the probability of success is the same for each trial and each trial has only two possible outcomes. If a coin is tossed n number of times, the probability of a certain number…

Bivariate Normal Distribution

Bivariate Normal Distribution: Bivariate normal distribution describes the joint probability distribution of two variables, say X and Y, that both obey the normal distribution. The bivariate normal is completely specified by 5 parameters: mx, my are the mean values of variables X and Y, respectively;…

Bonferroni Adjustment

Bonferroni Adjustment: Bonferroni adjustment is used in multiple comparison procedures to calculate an adjusted probability a of comparison-wise type I error from the desired probability aFW0 of family-wise type I error. The calculation guarantees that the use of the adjusted a in pairwise comparisons keeps…

Bonferroni Adjustment (Graphical)

Bonferroni Adjustment: Bonferroni adjustment is used in multiple comparison procedures to calculate an adjusted probability of comparison-wise type I error from the desired probability of family-wise type I error. The calculation guarantees that the use of the adjusted in pairwise comparisons keeps the actual probability…

Boosting

boosting: In predictive modeling, boosting is an iterative ensemble method that starts out by applying a classification algorithm and generating classifications. The classifications are then assessed, and a second round of model-fitting occurs in which the records classified incorrectly in the first round are given…

Bootstrapping

Bootstrapping: Bootstrapping is sampling with replacement from observed data to estimate the variability in a statistic of interest. See also permutation tests, a related form of resampling. A common application of the bootstrap is to assess the accuracy of an estimate based on a sample…

Box Plot

Box Plot: A box plot is a graph that characterizes the pattern of variation of the data. The plot simultaneously displays several measures of central tendency and dispersion of the data at hand. The box plot provides the following information: (1) the position of the…

Box´s M

Box´s M: Box´s M is a statistic which tests the homoscedasticity assumption in MANOVA - that is the assumption that all covariances are the same for any category. The results should be interpreted with caution because Box´s M is not robust - it is very…

Calibration Sample

Calibration Sample: The calibration sample is the subset of the data available to a data mining routine used as the training set . Browse Other Glossary Entries

Canonical Correlation Analysis

Canonical Correlation Analysis: The purpose of canonical correlation analysis is to explain or summarize the relationship between two sets of variables by finding a linear combinations of each set of variables that yields the highest possible correlation between the composite variable for set A and…

Canonical root

Canonical root: See discriminant function . Browse Other Glossary Entries

Canonical variates analysis

Canonical variates analysis: Several techniques that seek to illuminate the ways in which sets of variables are related one another. The term refers to regression analysis , MANOVA , discriminant analysis , and, most often, to canonical correlation analysis . Browse Other Glossary Entries

Categorical Data

Categorical Data: Categorical data are reflecting the classification of objects into different categories. For example, people who receive a mail order offer might be classified as "no response," "purchase and pay," "purchase but return the product," and "purchase and neither pay nor return." Browse Other…

Categorical Data Analysis

Categorical Data Analysis: Categorical data analysis is a branch of statistics dealing with categorical data . This sort of analysis is of great practical importance because a wide variety of data are of a categorical nature. The most common type of data analyzed in categorical…

Causal analysis

Causal analysis: See causal modeling . Browse Other Glossary Entries

Causal modeling

Causal modeling: Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model:   y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable,…

Census Survey

Census Survey: In a census survey , all units from the population of interest are analyzed. A related concept is the sample survey, in which only a subset of the population is taken. The main advantage of the census survey (as compared to the sample…

Central Limit Theorem

Central Limit Theorem: The central limit theorem states that the sampling distribution of the mean approaches Normality as the sample size increases, regardless of the probability distribution of the population from which the sample is drawn. If the population distribution is fairly Normally-distributed, this approach…

Central Location

Central Location: Central location is a synonym of central tendency . Browse Other Glossary Entries

Central Tendency (Measures)

Central Tendency (Measures): Any measure of central tendency provides a typical value of a set of values . Normally, it is a value around which values are grouped. The most widely used measures of central tendency are (arithmetic) mean , median , trimmed mean ,…

Centroid

Centroid: The centroid of several continuous variables is the vector of means of those variables. The concept of centroid plays the same role, for example, in multiple analysis of variance (MANOVA) as the mean plays in analysis of variance (ANOVA) . Browse Other Glossary Entries

CHAID

CHAID: CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a learning sample comprising already-classified objects. An essential feature is the use of the chi-square test for contingency tables to decide which variables are of…

Chebyshev´s Theorem

Chebyshev´s Theorem: For any positive constant ´k´, the probability that a random variable will take on a value within k standard deviations of the mean is at least 1 - 1/k2 . Browse Other Glossary Entries

Chernoff Faces

Statistical Glossary Chernoff Faces: Chernoff faces are a category of icon plots . Each unit is represented as a schematic face. Variables of interest are represented by particular parameters of the face, e.g. the nose size, eye-to-eye distance, etc. Browse Other Glossary Entries

Chi-Square Distribution

Chi-Square Distribution: The square of a random variable having standard normal distribution is distributed as chi-square with 1 degree of freedom. The sum of squares of ´n´ independently distributed standard normal variables has a Chi-Square distribution with ´n´ degrees of freedom. The distribution is typically…

Chi-Square Statistic

Chi-Square Statistic: The chi-square statistic (or -statistic) measures agreement between the observed and hypothetical frequencies. This statistic is computed from two entities: hypothetical probabilities of the values of a discrete random variable , and the observed frequencies of these values - the numbers of observations…

Chi-Square Test

Chi-Square Test: Chi-square test (or -test) is a statistical test for testing the null hypothesis that the distribution of a discrete random variable coincides with a given distribution. It is one of the most popular goodness-of-fit tests . For example, in a supermarket, relative frequencies…

Circular Icon Plots

Circular Icon Plots: Circular icon plots are a category of icon plots . Each variable is represented by a ray or direction; all rays start in the center. The value of each variable is reflected as the distance from the center. The most common categories…

Classification and Regression Trees (CART)

Classification and Regression Trees (CART): Classification and regression trees (CART) are a set of techniques for classification and prediction. The technique is aimed at producing rules that predict the value of an outcome (target) variable from known values of predictor (explanatory) variables. The predictor variables…

Classification Trees

Classification Trees: Classification trees are one of the CART techniques. The main distinction from regression trees (another CART technique) is that the dependent variable is categorical. One of the oldest methods for classification trees is CHAID . Browse Other Glossary Entries

Cluster Analysis

Cluster Analysis: In multivariate analysis, cluster analysis refers to methods used to divide up objects into similar groups, or, more precisely, groups whose members are all close to one another on various dimensions being measured. In cluster analysis, one does not start with any apriori…

Clustered Sampling

Clustered Sampling: Clustered sampling is a sampling technique based on dividing the whole population into groups ("clusters"), then using random sampling to select elements from the groups. For example, if the target population is the whole population of a city, a researcher might select 100…

Cochran´s Q Statistic

Cochran´s Q Statistic: Cochran´s Q statistic is computed from replicated measurements data with binary responses. This statistic tests a difference in effects among 2 or more treatments applied to the same set of experimental units. Consider the results of a study of M treatments applied…

Cochran-Mantel-Haenszel (CMH) test

Cochran-Mantel-Haenszel (CMH) test: The Cochran-Mantel-Haenszel (CMH) test compares two groups on a binary response, adjusting for control variables. The initial data are represented as a series of K 2x2 contingency table s, where K is the number of strata. Traditionally, in each table the rows…

Coefficient of Determination

Coefficient of Determination: In regression analysis, the coefficient of determination is a measure of goodness-of-fit (i.e. how well or tightly the data fit the estimated model). The coefficient is defined as the ratio of two sums of squares:   r2 =  SSR SST , where…

Coefficient of variation

Coefficient of variation: The coefficient of variation is the standard deviation of a data set, divided by the mean of the same data set. Browse Other Glossary Entries

Cohen´s Kappa

Cohen´s Kappa: Cohen´s kappa is a measure of agreement for Categorical data . It is a special case of the Kappa statistic corresponding to the case of only 2 raters. Historically, this statistic was invented first. Later it was generalized to the case of an…

Cohort data

Cohort data: Cohort data records multiple observations over time for a set of individuals or units tied together by some event (say, born in the same year). See also longitudinal data and panel data. Browse Other Glossary Entries

Cohort study

Cohort study: A cohort study is a longitudinal study that identifies a group of subjects sharing some attributes (a "cohort") then takes measurements on the subjects at various points in time and records data for the group. A cohort study is often used to compare…

Cointegration

Cointegration: Cointegration is a statistical tool for describing the co-movement of data measured over time. The concept of cointegration is widely used in applied time series analysis, especially in econometrics. Two (or a greater number) of nonstationary time series are called to be cointegrated if…

Collaborative filtering

Collaborative filtering: Collaborative filtering algorithms are used to predict whether a given individual might like, or purchase, an item. One popular approach is to find a set of individuals (e.g. customers) whose item preferences (ratings) are similar to those of the given individual over a…

Collinearity

Collinearity: In regression analysis , collinearity of two variables means that strong correlation exists between them, making it difficult or impossible to estimate their individual regression coefficients reliably. The extreme case of collinearity, where the variables are perfectly correlated, is called singularity . See also:…

Column icon plots

Statistical Glossary Column icon plots: See sequential icon plots . Browse Other Glossary Entries

Comparison-wise Type I Error

Comparison-wise Type I Error: In multiple comparison procedures, the comparison-wise type I error is the probability that, even if the samples come from the same population, you will wrongly conclude that they differ. See also Family-wise type I error. Browse Other Glossary Entries

Complete Block Design

Complete Block Design: In complete block design, every treatment is allocated to every block. In other words, every combination of treatments and conditions (blocks) is tested. For example, an agricultural experiment is aimed at finding the effect of 3 fertilizers (A,B,C) for 5 types of…

Complete Linkage Clustering

Complete Linkage Clustering: The complete linkage clustering (or the farthest neighbor method) is a method of calculating distance between clusters in hierarchical cluster analysis . The linkage function specifying the distance between two clusters is computed as the maximal object-to-object distance , where objects belong…

Complete Statistic

Complete Statistic: A sufficient statistic T is called a complete statistic if no function of it has zero expected value for all distributions concerned unless this function itself is zero for all possible distributions concerned (except possibly a set of measure zero). The property of…

Composite Hypothesis

Composite Hypothesis: A statistical hypothesis which does not completely specify the distribution of a random variable is referred to as a composite hypothesis. Browse Other Glossary Entries

Concurrent Validity

Concurrent Validity: The concurrent validity of survey instruments, like the tests used in psychometrics , is a measure of agreement between the results obtained by the given survey instrument and the results obtained for the same population by another instrument acknowledged as the "gold standard".…

Conditional Probability

Conditional Probability: When probabilities are quoted without specification of the sample space, it could result in ambiguity when the sample space is not self-evident. To avoid this, the sample space can be explicitly made known. The probability of an event A given sample space S,…

Confidence Interval

Confidence Interval: A confidence interval is an interval that brackets a sample estimate that quantifies uncertainty around this estimate. Since there are a variety of samples that might be drawn from a population, there are likewise a variety of confidence intervals that might be imagined…

Consistent Estimator

Consistent Estimator: An estimator is a measure or metric intended to be calculated from a sample drawn from a larger population. A consistent estimator is an estimator with the property that the probability of the estimated value and the true value of the population parameter…