Serial Correlation

Serial Correlation: In analysis of time series, the Nth order serial correlation is the correlation between the current value and the Nth previous value of the same time series. For this reason serial correlation is often called "autocorrelation". Browse Other Glossary Entries

Stationary time series

Stationary time series: A time series x(t); t=1,... is called to be stationary if its statistical properties do not depend on time t . A time series may be stationary in respect to one characteristic, e.g. the mean, but not stationary in respect to another,…

Random Walk

Statistical Glossary Random Walk: A random walk is a process of random steps, motions, or transitions. It might be in one dimension (movement along a line), in two dimensions (movements in a plane), or in three dimensions or more. There are many different types of…

Vector time series

Vector time series: Vector time series are a natural generalization of ordinary (scalar) time series . Vector time series are measurements of a vector variable taken at regular intervals over time. They are represented as sequences of vector values like V(1), V(2), ... An simplest…

Convolution of Distribution Functions (Graphical)

Convolution of Distribution Functions: If F1(·) and F1(·) are distribution functions, then the function F(·) is called the convolution of distribution functions F1 and F2. This is often denoted as . The convolution provides the distribution function of the sum of two independent random variables…

Concurrent Validity

Concurrent Validity: The concurrent validity of survey instruments, like the tests used in psychometrics , is a measure of agreement between the results obtained by the given survey instrument and the results obtained for the same population by another instrument acknowledged as the "gold standard".…

Content Validity

Content Validity: The content validity of survey instruments, like psychological tests, is assessed by overview of the items by trained individuals and/or by the individuals from the target population. The individuals make their judgments about the relevance of the items and about the unambiguity of…

Continuous Distribution

Continuous Distribution: A continuous distribution describes probabilistic properties of a random variable which takes on a continuous (not countable) set of values - a continuous random variable . In contrast to discrete distributions , continuous distributions do not ascribe values of probability to possible values…

Construct Validity

Construct Validity: In psychometrics , the construct validity of a survey instrument or psychometric test measures how well the instrument performs in practice from the standpoint of the specialists who use it. In psychology, a construct is a phenomenon or a variable in a model…

Central Location

Central Location: Central location is a synonym of central tendency . Browse Other Glossary Entries

Complete Linkage Clustering: The complete linkage clustering (or the farthest neighbor method) is a method of calculating distance between clusters in hierarchical cluster analysis . The linkage function specifying the distance between two clusters is computed as the maximal object-to-object distance , where objects belong…

Classification Trees

Classification Trees: Classification trees are one of the CART techniques. The main distinction from regression trees (another CART technique) is that the dependent variable is categorical. One of the oldest methods for classification trees is CHAID . Browse Other Glossary Entries

Chi-Square Test

Chi-Square Test: Chi-square test (or -test) is a statistical test for testing the null hypothesis that the distribution of a discrete random variable coincides with a given distribution. It is one of the most popular goodness-of-fit tests . For example, in a supermarket, relative frequencies…

CHAID

CHAID: CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a learning sample comprising already-classified objects. An essential feature is the use of the chi-square test for contingency tables to decide which variables are of…

Chi-Square Statistic

Chi-Square Statistic: The chi-square statistic (or -statistic) measures agreement between the observed and hypothetical frequencies. This statistic is computed from two entities: hypothetical probabilities of the values of a discrete random variable , and the observed frequencies of these values - the numbers of observations…

Central Tendency (Measures)

Central Tendency (Measures): Any measure of central tendency provides a typical value of a set of values . Normally, it is a value around which values are grouped. The most widely used measures of central tendency are (arithmetic) mean , median , trimmed mean ,…

Census Survey

Census Survey: In a census survey , all units from the population of interest are analyzed. A related concept is the sample survey, in which only a subset of the population is taken. The main advantage of the census survey (as compared to the sample…

Complete Block Design

Complete Block Design: In complete block design, every treatment is allocated to every block. In other words, every combination of treatments and conditions (blocks) is tested. For example, an agricultural experiment is aimed at finding the effect of 3 fertilizers (A,B,C) for 5 types of…

Acceptance Region

Acceptance Region: In hypothesis testing, the test procedure partitions all the possible sample outcomes into two subsets (on the basis of whether the observed value of the test statistic is smaller than a threshold value or not). The subset that is considered to be consistent…

Acceptance Sampling

Acceptance Sampling: Acceptance sampling is the use of sampling methods to determine whether a shipment of products or components is of sufficient quality to be accepted. Browse Other Glossary Entries

Acceptance Sampling Plans

Acceptance Sampling Plans: For a shipment or production lot, an acceptance sampling plan defines a sampling procedure and gives decision rules for accepting or rejecting the shipment or lot, based on the sampling results. Browse Other Glossary Entries

Statistical Glossary Additive Error: Additive error is the error that is added to the true value and does not depend on the true value itself. In other words, the result of the measurement is considered as a sum of the true value and the additive…

Statistical Glossary Additive effect: An additive effect refers to the role of a variable in an estimated model. A variable that has an additive effect can merely be added to the other terms in a model to determine its effect on the independent variable. Contrast…

Agglomerative Methods (of Cluster Analysis)

Agglomerative Methods (of Cluster Analysis): In agglomerative methods of hierarchical cluster analysis , the clusters obtained at the previous step are fused into larger clusters. Agglomerative methods start with N clusters comprising a single object, then on each step two clusters from the previous step…

Aggregate Mean

Aggregate Mean: In ANOVA and some other techniques used for analysis of several samples, the aggregate mean is the mean for all values in all samples combined, as opposed to the mean values of the individual samples. The term "aggregate mean" is also used as…

Alternate-Form Reliability

Alternate-Form Reliability: The alternate-form reliability of a survey instrument, like a psychological test, helps to overcome the "practice effect", which is typical of the test-retest reliability . The idea is to change the wording of the survey questions in a functionally equivalent form, or simply…

Arithmetic Mean

Arithmetic Mean: The arithmetic mean is a synonym of the mean . The word "arithmetic" is used to discern this statistic from other statistics having "mean" in their names, like the geometric mean , the harmonic mean , the power mean , the quadratic mean…

ARIMA

ARIMA: ARIMA as an acronym for Autoregressive Integrated Moving Average Model (also known as Box-Jenkins model ). It is a class of models of random processes in discrete time or time series . ARIMA model is widely used in time series analysis . ARIMA model…

Autoregression and Moving Average (ARMA) Models

Autoregression and Moving Average (ARMA) Models: The autoregression and moving average (ARMA) models are used in time series analysis to describe stationary time series . These models represent time series that are generated by passing white noise through a recursive and through a nonrecursive linear…

Autoregressive (AR) Models

Autoregressive (AR) Models: The autoregressive (AR) models are used in time series analysis . to describe stationary time series . These models represent time series that are generated by passing the white noise through a recursive linear filter . The output of such a filter…

Association Rules

Association Rules: Association rules is a method of data mining . The idea is to find a statistical association between some items in a large set of items, e.g. items purchased in a supermarket by a customer in one visit. In contrast to deterministic (non-statistical)…

Average Group Linkage: The average group linkage is a method of calculating distance between clusters in hierarchical cluster analysis . The linkage function specifying the distance between two clusters is computed as the distance between the average values (the mean vectors or centroids ) of…

Average Linkage Clustering: The average linkage clustering is a method of calculating distance between clusters in hierarchical cluster analysis . The linkage function specifying the distance between two clusters is computed as the average distance between objects from the first cluster and objects from the…

Bernoulli Distribution (Graphical)

Bernoulli Distribution: A random variable x has a Bernoulli distribution with parameter 0 < p < 1 if where P(A) is the probability of outcome A. The parameter p is often called the "probability of success". For example, a single toss of a coin has…

Beta Distribution (Graphical)

Beta Distribution: Suppose x1, x2, ... , xn are n independent values of a random variable uniformly distributed within the interval [0,1]. If you sort the values in ascending order, then the k-th value will have a beta distribution with parameters , . The density…

Bias

Bias: A general statistical term meaning a systematic (not random) deviation of an estimate from the true value. A bias of a measurement or a sampling procedure may pose a more serious problem for a researcher than random errors because it cannot be reduced by…

Bonferroni Adjustment: Bonferroni adjustment is used in multiple comparison procedures to calculate an adjusted probability of comparison-wise type I error from the desired probability of family-wise type I error. The calculation guarantees that the use of the adjusted in pairwise comparisons keeps the actual probability…

Calibration Sample

Calibration Sample: The calibration sample is the subset of the data available to a data mining routine used as the training set . Browse Other Glossary Entries

Classification and Regression Trees (CART)

Classification and Regression Trees (CART): Classification and regression trees (CART) are a set of techniques for classification and prediction. The technique is aimed at producing rules that predict the value of an outcome (target) variable from known values of predictor (explanatory) variables. The predictor variables…

White Hat Bias

White Hat Bias is bias leading to distortion in, or selective presentation of, data that is considered by investigators or reviewers to be acceptable because it is in the service of righteous goals. The term was coined by Cope and Allison in 2009, and is…

Natural Language

Natural Language: A natural language is what most people outside the field of computer science think of as just a language (Spanish, English, etc.). The term "natural" simply signifies that the reference is not to a programming language (C++, Java, etc.). The context is usually…

Tokenization

Tokenization: In processing unstructured text, tokenization is the step by which the character string in a text segment is turned into units - tokens - for further analysis. Ideally, those tokens would be words, but numbers and other characters can also count as tokens. A…

Z score (Graphical)

Z score: An observation´s z-score tells you the number of standard deviations it lies away from the population mean (and in which direction). The calculation is as follows: where x is the observation itself, is the mean of the distribution, is the standard deviation of…

Weighted Mean (Calculation)

Statistical Glossary Weighted Mean (Calculation): To simplify calculation of the weighted mean , weights are often standardized to make their sum equal to the unit value, i.e. by dividing every weight by the total sum of all weights: Then, the weighted mean is computed using…

Weighted Mean

Statistical Glossary Weighted Mean: The weighted mean is a measure of central tendency . The weighted mean of a set of values is computed according to the following formula: where are non-negative coefficients, called "weights", that are ascribed to the corresponding values . Only the…

White Noise

White Noise: The white noise is a stationary time series or a stationary random process with zero autocorrelation. In other words, in white noise any pair of values and taken at different moments and of time are not correlated - i.e. the correlation coefficient is…

Ward´s Linkage: Ward´s linkage is a method for hierarchical cluster analysis . The idea has much in common with analysis of variance (ANOVA). The linkage function specifying the distance between two clusters is computed as the increase in the "error sum of squares" (ESS) after…

Variate

Variate: The term "variate" is often used as synonym for "variable". Some definitions require that variate values be numeric. Sometimes "variate" is used as a synonym for "a value of the given variable for particular element of the sample " - e.g. sex is a…

Variable-Selection Procedures (Graphical)

Variable-Selection Procedures: In regression analysis, variable-selection procedures are aimed at selecting a reduced set of the independent variables - the ones providing the best fit to the model. The criterion for selecting is usually the following F-statistic: where n is the total number of data…

Validity

Validity: Validity characterises the extent to which a measurement procedure is capable of measuring what it is supposed to measure. Normally, the term "validity" is used in situations where measurement is indirect, imprecise and cannot be precise in principle, e.g. in psychological IQ tests purporting…

Validation Sample

Validation Sample: The validation sample is the subset of the data available to a data mining routine used as the validation set . Browse Other Glossary Entries

Validation Set

Validation Set: A validation set is a portion of a data set used in data mining to assess the performance of prediction or classification models that have been fit on a separate portion of the same data set (the training set ). Typically both the…

Uniform Distribution

Uniform Distribution: The uniform distribution describes probabilistic properties of a continuous random variable that is equally likely to take any value within an interval , and never takes on values outside this interval. The uniform distribution is characterised by two parameters - the lower and…

t-statistic (Graphical)

t-statistic: T-statistic is a statistic whose sampling distribution is a t-distribution. Often, the term "t-statistic" is used in a narrower sense - as the standardized difference between a sample mean and a population mean , where N is the sample size: where and are the…

Differencing (of Time Series)

Differencing (of Time Series): Differencing of a time series in discrete time is the transformation of the series to a new time series where the values are the differences between consecutive values of . This procedure may be applied consecutively more than once, giving rise…

Test-Retest Reliability

Test-Retest Reliability: The test-retest reliability of a survey instrument, like a psychological test, is estimated by performing the same survey with the same respondents at different moments of time. The closer the results, the greater the test-retest reliability of the survey instrument. The correlation coefficient…

Negative Binomial

Negative Binomial: The negative binomial distribution is the probability distribution of the number of Bernoulli (yes/no) trials required to obtain r successes. Contrast it with the binomial distribution - the probability of x successes in n trials. Also with the Poisson distribution - the probability…

Trimmed Mean

Statistical Glossary Trimmed Mean: The trimmed mean is a family of measures of central tendency . The -trimmed mean of of values is computed by sorting all the values, discarding % of the smallest and % of the largest values, and computing the of the…

Triangular Filter

Statistical Glossary Triangular Filter: The triangular filter is a linear filter that is usually used as a smoother . The output of the rectangular filter at the moment is the weighted mean of the input values at the adjacent moments of discrete time . In…

Training Set

Training Set: A training set is a portion of a data set used to fit (train) a model for prediction or classification of values that are known in the training set, but unknown in other (future) data. The training set is used in conjunction with…

Systematic Error

Statistical Glossary Systematic Error: Systematic error is the error that is constant in a series of repetitions of the same experiment or observation. Usually, systematic error is defined as the expected value of the overall error. An example of systematic error is an electronic scale…

t-distribution (Graphical)

t-distribution: A continuous distribution, with single peaked probability density symmetrical around the null value and a bell-curve shape. T-distribution is specified completely by one parameter - the number of degrees of freedom. If X and Y are independent random variables, X has the standard normal…

Test Set

Test Set: A test set is a portion of a data set used in data mining to assess the likely future performance of a single prediction or classification model that has been selected from among competing models, based on its performance with the validation set.…

Survey

Survey: Statistical surveys are general methods to gather quantitative information about a particular population. "Population" here does not necessarily mean a set of human beings, but may consist of other type of units - firms, households, universities, hospitals, etc. While there are types and forms…

Stochastic Process

Stochastic Process: Stochastic process is a synonym for random process . Browse Other Glossary Entries

Sufficient Statistic (Graphical)

Sufficient Statistic: Suppose X is a random vector with probability distribution (or density) P(X | V), where V is a vector of parameters, and Xo is a realization of X. A statistic T(X) is called a sufficient statistic if the conditional probability (density) does not…

Split-Halves Method

Statistical Glossary Split-Halves Method: In psychometric surveys, the split-halves method is used to measure the internal consistency reliability of survey instruments, e.g. psychological tests. The idea is to split the items (questions) related to the same construct to be measured, e.d. the anxiety level, and…

Standard error

Standard error: The standard error measures the variability of an estimator (or sample statistic) from sample to sample. There are two approaches to estimating standard error: 1. The bootstrap. With the bootstrap, you take repeated simulated samples (usually resamples from the observed data, of the…

Spline

Spline: A spline is a continuous function which coincides with a polynomial on every subinterval of the whole interval on which is defined. In other words, splines are functions which are piecewise polynomial. The coefficients of the polynomial differs from interval to interval, but the…

Spectral Analysis

Spectral Analysis: Spectral analysis is concerned with estimation of the spectrum of a stationary random process or a stationary time series from the observed realization(s) of the process (or series). Methods and concepts of spectral analysis play an important role in time series analysis and…

Spectrum

Spectrum: See Fourier spectrum and power spectrum . Browse Other Glossary Entries

Spatial Field

Spatial Field: A spatial field is a function of spatial variables , or in 3D cases. A spatial field is named a "scalar field" if the function takes on scalar values. For example, the concentration of a toxic substance in the soil at points with…

Smoothing

Smoothing: Smoothing is a class of time series processing which is intended to reduce noise and to preserve the signal itself. The origin of this term is related to the visual appearance of the time series - it looks smoother after this sort of processing…

Sampling Frame

Sampling Frame: Sampling frame (synonyms: "sample frame", "survey frame") is the actual set of units from which a sample has been drawn: in the case of a simple random sample, all units from the sampling frame have an equal chance to be drawn and to…

Smoother (Smoothing Filter)

Smoother (Smoothing Filter): Smoothers, or smoothing filters, are algorithms for time-series processing that reduce abrupt changes in the time-series and make it look smoother. Smoothers constitute a broad subclass of filters. Like all filters, smoothers may be subdivided into linear and nonlinear. Linear filters reduce…

Smoother (Example)

Smoother (Example): A simple example of a smoother is the moving average procedure. It is based on averaging elements closest in time to the current time. Mathematically this can be expressed by the following simple formula: where is the input of the smoother, the original…

Social Network Analytics

Social Network Analytics: Network analytics applied to connections among humans. Recently it has come also to encompass the analysis of web sites and internet services like Facebook. Browse Other Glossary Entries

Single Linkage Clustering: The single linkage clustering method (or the nearest neighbor method) is a method of calculating distance between clusters in hierarchical cluster analysis . The linkage function specifying the distance between two clusters is computed as the minimal object-to-object distance , where objects…

Simple Linear Regression (Graphical)

Simple Linear Regression: The simple linear regression is aimed at finding the "best-fit" values of two parameters - A and B in the following regression equation: where Yi, Xi, and Ei are the values of the dependent variable, of the independent variable, and of the…

Signal

Signal: The signal is the component of the observed data (e.g. of a time series ) that carries useful information. The complementary (opposite) concept is noise . In a narrower sense (e.g. in signal processing ) signals are functions of time, as opposed to fields…

Signal Processing

Signal Processing: Signal processing is a branch of applied statistics concerned with analysis of functions of time that take on scalar or vector values. The functions are normally mixtures of a signal and a noise . A broad range of topics are considered in signal…

Shift Invariance (of Measures)

Shift Invariance (of Measures): Shift invariance is a property of descriptive statistics . If a statistic is shift-invariant, it possesses the following property for any data set : or, in equivalent form In other words, if a statistic is shift-invariant, then addition of an arbitrary…

Seemingly Unrelated Regressions (SUR)

Statistical Glossary Seemingly Unrelated Regressions (SUR): Seemingly unrelated regressions (SUR) is a class of multivariate regression ( multiple regression ) models, normally belonging to the sub-class of linear regression models. A distinctive feature of SUR models is that they consist of several unrelated systems of…

Seasonal Decomposition

Seasonal Decomposition: The seasonal decomposition is a method used in time series analysis to represent a time series as a sum (or, sometimes, a product) of three components - the linear trend, the periodic (seasonal) component, and random residuals. The seasonal decomposition is useful in…

Seasonal Adjustment: The seasonal adjustment is used in time series analysis to remove a periodic component with the known period from the observed time series. This adjustment is normally performed through the seasonal decomposition of the time series followed by subtraction of the seasonal component…

Scale Invariance (of Measures)

Statistical Glossary Scale Invariance (of Measures): Scale invariance is a property of descriptive statistics . If a statistic is scale-invariant, it has the following property for any sample and any non-negative value : (1) or, in mathematically equivalent form In other words, if a statistic…

Sample Survey

Sample Survey: In a sample survey , a sample of units drawn from the population of interest is analyzed. A related concept is the census survey . The main advantage of the sample survey (as compared to the census survey ) is that its implementation…

Statistical Significance

Statistical Significance: Outcomes to an experiment or repeated events are statistically significant if they differ from what chance variation might produce. For example - suppose n people are given a medication. If their response to the medication lies outside the range of how samples of…

Sampling

Sampling: Sampling is a process of drawing a sample from a population . Sampling may be performed from both real and hypothetical populations. Examples of sampling from a real population are opinion polls (when a finite number of individuals is chosen from a much bigger…

Robust Filter

Statistical Glossary Robust Filter: A robust filter is a filter that is not sensitive to input noise values with extremely large magnitude (e.g. those arising due to anomalous measurement errors. The median filter is an example of a robust filter. Linear filters are not robust…

Root Mean Square (Graphical)

Root Mean Square: Root mean square (RMS) of a set of values xi, i=1,...N is the square root of the mean of the squares of the values: RMS is a statistical measure of departure from the null value. Browse Other Glossary Entries

Random Numbers

Random Numbers: Random numbers are the numbers produced by a truly random mechanism (in contrast to pseudo-random numbers ). For example, random numbers with a good degree of randomness may be produced by tossing a coin, recording "0" or "1" (instead of "head" or "tail"),…

Reproducibility

Reproducibility: Reproducibility is the variation of outcomes of an experiment carried out in conditions varying within a typical range, e.g. when measurement is carried out by the same device by different operators, in different laboratories, etc. For example, reproducibility of measurements of mechanical scales is…

Replication

Replication: In statistics, replication is repetition of an experiment or observation in the same or similar conditions. Replication is important because it adds information about the reliability of the conclusions or estimates to be drawn from the data. The statistical methods that assess that reliability…

Repeatability

Repeatability: Repeatability is the variation of outcomes of an experiment carried out in the same conditions, e.g. by the same operator, in the same laboratory. For example, repeatability of measurements of precise mechanical scales is the variation of weight values reported for a given constant…

Replicate

Replicate: A replicate is the outcome of an experiment or observation obtained in course of its replication . In applied statistics, a set of replicates obtained in a series of replications of the experiment or observations is considered as a sample from a much bigger…

Reliability (in Survey Analysis)

Reliability (in Survey Analysis): In survey analysis, e.g. in psychometrics , reliability is a measure of reproducibility of the survey instrument or test. In other words, reliability is a measure of precision - i.e. it describes the random error of the survey instrument. There are…

Reliability

Reliability: Reliability characterises the capability of a device, unit, procedure to perform without fault. Reliability is quantified in terms of probability. This probability is related either to an elementary act or to an interval of time or another continuous variable. Because the probability of failure…

Regression Trees

Regression Trees: Regression trees is one of the CART techniques. The main distinction from classification trees (another CART technique) is that the dependent variable is continuous. <!-- See also this introductory text , this book --> Browse Other Glossary Entries

Rectangular Filter

Rectangular Filter: The rectangular filter is the simplest linear filter ; it is usually used as a smoother . The output of the rectangular filter at the time moment is the arithmetic mean of the input values corresponding to the moments of time close to…