False Discovery Rate

False Discovery Rate: A "discovery" is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state…

Family-wise Type I Error

Family-wise Type I Error: In multiple comparison procedures, family-wise type I error is the probability that, even if all samples come from the same population, you will wrongly conclude that at least one pair of populations differ. If a is the probability of comparison-wise type…

Family-wise Type I Error (Graphical)

Family-wise Type I Error: In multiple comparison procedures, family-wise type I error is the probability that, even if all samples come from the same population, you will wrongly conclude that at least one pair of populations differ. If is the probability of comparison-wise type I…

Farthest Neighbor Clustering

Farthest Neighbor Clustering: The farthest neighbor clustering is a synonym for complete linkage clustering .

Feature

This term is used synonymously with attribute and variable, it is actually an independent variable (see dependent and independent variables). The term feature comes from the machine learning community, often in the phrase "feature selection" (which see).

Feature engineering

Feature engineering: In predictive modeling, a key step is to turn available data (which may come from varied sources and be messy) into an orderly matrix of rows (records to be predicted) and columns (predictor variables or features). The feature engineering process involves review of…

Feature Selection

Feature Selection: In predictive modeling, feature selection, also called variable selection, is the process (usually automated) of sorting through variables to retain variables that are likely to be informative in prediction, and discard or combine those that are redundant. “Features” is a term used by…

Features vs. Variables

Features vs. Variables: The predictors in a predictive model are sometimes given different terms by different disciplines. Traditional statisticians think in terms of variables. The machine learning community calls them features (also attributes or inputs). There is a subtle difference in meaning. In predictive modeling,…

Filter

Filter: A filter is an algorithm for processing a time series or random process . There are two major classes of problems solved by filters: 1. To estimate the current value of a time series (X(t), t = 1,2, ...) , which is not directly…

Finite Mixture Models

Finite Mixture Models: Outside the social research, the term "finite mixture models" is often used as a synonym for "latent class models" in latent class analysis .

Finite Sample Space

Finite Sample Space: If a sample space contains a finite number of elements, then the sample space is said to be a finite sample space. The sample space for the experiment of a toss of a coin is a finite sample space. It has only…

Fisher´s Exact Test

Fisher´s Exact Test: Fisher´s exact test is the first (historically) permutation test. It is used with two samples of binary data, and tests the null hypothesis that the two samples are drawn from populations with equal but unknown proportions of "successes" (e.g. proportion of patients…

Fixed Effects

Fixed Effects: The term "fixed effects" (as contrasted with "random effects") is related to how particular coefficients in a model are treated - as fixed or random values. Which approach to choose depends on both the nature of the data and the objective of the…

Fixed Effects (Graphical)

Fixed Effects: The term "fixed effects" (as contrasted with "random effects") is related to how particular coefficients in a model are treated - as fixed or random values. Which approach to choose depends on both the nature of the data and the objective of the…

Fleming Procedure

Fleming Procedure: Fleming procedure (or O´Brien-Fleming multiple testing procedure ) is a simple multiple testing procedure for comparing two treatments when the response to treatment is dichotomous . This procedure is used in clinical trials. The procedure provides an opportunity to terminate the trial early…

Forward Selection

Forward Selection: Forward selection is one of several computer-based iterative variable-selection procedures. It resembles step-wise regression except that a variable added to the model is not permitted to be removed in the subsequent steps. See also Backward elimination.

Fourier Spectrum

Fourier Spectrum: Any continuous function defined on a finite interval of length can be represented as a weighted sum of cosine functions with periods : where is the frequency of the i-th Fourier component; is the amplitude of the i-th component; is the phase of…

Frequency Distribution

Frequency Distribution: A frequency distribution is a tabular summary of a set of data showing the frequency (or number) of items in each of several non-overlapping classes (or bins). This definition is applicable to both quantitative and categorical (qualitative) data. For quantitative data, the classes…

Frequency Interpretation of Probability

Frequency Interpretation of Probability: The frequency interpretation of probability is the most widely held of several ways of interpreting the meaning of the concept of "probability". According to this interpretation the probability of an event is the proportion of times the said event occurs when…

Functional Data Analysis (FDA)

Functional Data Analysis (FDA): In functional data analysis (FDA), data are considered as continuous functions (or curves). This is in contrast to multivariate statistics, where data are considered as vectors (finite sets of values). Real data are usually collected as discrete samples. In FDA, such…

Gamma Distribution

Gamma Distribution: A random variable x is said to have a gamma-distribution with parameters a > 0 and l > 0 if its probability density p(x) is p(x) = ì ï í ï î  la G(a) xa-1 e-lx, x > 0; 0, Other Glossary…

Gamma Distribution (Graphical)

Gamma Distribution: A random variable x is said to have a gamma-distribution with parameters a > 0 and l > 0 if its probability density p(x) is p(x) = ìïí ïî  la G(a) xa-1 e-lx, x > 0; 0,

Gaussian Filter

Gaussian Filter: The Gaussian filter is a linear filter that is usually used as a smoother . The output of the gaussian filter at the moment is the weighted mean of the input values, and the weights are defined by formula where is the "distance"…

General Association Statistic

General Association Statistic: The general association statistic is one of the statistics used in the generalized Cochran-Mantel-Haenszel tests . It is applicable when both the "treatment" and the "response" variables are measured on a nominal scale . If the treatment and response variables are independent…

General Linear Model

General Linear Model: General (or generalized) linear models (GLM), in contrast to linear models, allow you to describe both additive and non-additive relationship between a dependent variable and N independent variables. The independent variables in GLM may be continuous as well as discrete. (The dependent…

General Linear Model for a Latin Square

General Linear Model for a Latin Square: In design of experiment, a Latin square is a three-factor experiment in which for each pair of factors in any combination of factor values occurs only once. Consider the following Latin Square, B C D A C D…

General Linear Model for a Latin Square (Graphical)

General Linear Model for a Latin Square: In design of experiment, a Latin square is a three-factor experiment in which for each pair of factors in any combination of factor values occurs only once. Consider the following Latin Square, where rows correspond to 4 values…

Generalized Cochran-Mantel-Haenszel tests

Generalized Cochran-Mantel-Haenszel tests: The Generalized Cochran-Mantel-Haenszel tests is a family of tests aimed at detecting of association between two categorical variables observed in K strata. The initial data are represented as a series of K RxC contingency table s, where K is the number of…

Geometric Distribution

Geometric Distribution: A random variable x obeys the geometric distribution with parameter p (0<p<1) if P{x=k} = p(1-p)k,     k=0,1,2, ... . If a random variable obeys the Bernoulli distribution with probability of success p, then x might be the number of trials before the first…

Geometric Distribution (Graphical)

Geometric Distribution: A random variable x obeys the geometric distribution with parameter p (0<p<1) if If a random variable obeys the Bernoulli distribution with probability of success p, then x might be the number of trials before the first "success" occurs.

Geometric mean

Geometric mean: The geometric mean of n values is determined by multiplying all n values together, then taking the nth root of the product. It is useful in taking averages of ratios. The geometric mean is often used for data which take only on positive…

Geometric Mean and Mean (comparison)

Geometric Mean and Mean (comparison): The quantitative distinction between the geometric mean and the mean can be illustrated by the following table: Data set Mean Geometric Mean 1, 1, 1 1 1 1, 2, 3 2  1.6 1, 2, 1000  334  6.7 The analytical relation…

Gini coefficient

Gini coefficient: The Gini coefficient is used in economics to measure income inequality. Generally speaking, it is used to measure the extent of departure from a perfectly even distribution of income. A "0" indicates no departure, i.e. everyone has the same income. A "1" indicates…

Gini coefficient (Graphical)

Gini coefficient: The Gini coefficient is used in economics to measure income inequality. Generally speaking, it is used to measure the extent of departure from a perfectly even distribution of income. A "0" indicates no departure, i.e. everyone has the same income. A "1" indicates…

Gini´s Mean Difference

Gini´s Mean Difference: Gini´s mean difference is a descriptive statistic , a measure of variation. For a sample of N values the Gini´s mean difference is the average of all pairwise absolute differences: GMD =  1 N(N-1) ? ij  |xi-xj|;   i,j = 1,...,N;   i ?…

Goodness – of – Fit Test

Goodness - of - Fit Test: It is a statistical test to determine whether there is significant difference between the observed frequency distribution and a theoretical probability distribution which is hypothesized to describe the observed distribution.

Granger Causation

Granger Causation: Granger causation is a definition of causal relation between vectors in vector time series . Let us define Ht as the history up to and including the discrete time t , and denote Yt the random vector Y at time t . Granger…

Hadoop

Hadoop: As data processing requirements grew beyond the capacities of even large computers, distributed computing systems were developed to spread the load to multiple computers. Hadoop is a distributed computing system with two key features: (1) it is open source, and (2) it can use…

Harmonic Mean

Statistical Glossary Harmonic Mean: Harmonic mean is a measure of central location. The harmonic mean of positive values is defined by the formula Let the path between two cities and be divided into parts of equal length. One drives the th part at velocity .…

Hazard Function

Hazard Function: In medical statistics, the hazard function is a relationship between a proportion and time. The proportion (also called the hazard ratio) is the proportion of subjects who die in an increment of time starting at time "t" from among those who have survived…

HDFS

Statistical Glossary HDFS: HDFS is the Hadoop Distributed File System. It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

Heteroscedasticity

Heteroscedasticity: Heteroscedasticity generally means unequal variation of data, e.g. unequal variance . For special cases see heteroscedasticity in regression , heteroscedasticity in hypothesis testing See also: homoscedasticity

Heteroscedasticity in hypothesis testing

Heteroscedasticity in hypothesis testing: In hypothesis testing , heteroscedasticity means a situation in which the variance is different for compared samples. Heteroscedasticity complicates testing because most tests rest on the assumption of equal variance. See also: homoscedasticity in hypothesis testing

Heteroscedasticity in regression

Heteroscedasticity in regression: In regression analysis , heteroscedasticity means a situation in which the variance of the dependent variable varies across the data. Heteroscedasticity complicates analysis because many methods in regression analysis are based on an assumption of equal variance. See also: homoscedasticity in regression…

Hierarchical Cluster Analysis

Hierarchical Cluster Analysis: Hierarchical cluster analysis (or hierarchical clustering) is a general approach to cluster analysis , in which the object is to group together objects or records that are "close" to one another. A key component of the analysis is repeated calculation of distance…

Hierarchical Linear Modeling

Hierarchical Linear Modeling: Hierarchical linear modeling is an approach to analysis of hierarchical (nested) data - i.e. data represented by categories, sub-categories, ..., individual units (e.g. school -> classroom -> student). At the first stage, we choose a linear model (level 1 model) and fit…

Hierarchical Loglinear Models

Hierarchical Loglinear Models: Hierarchical linear modeling is an approach to analysis of hierarchical (nested) data - i.e. data represented by categories, sub-categories, ..., individual units (e.g. school -> classroom -> student). At the first stage, we choose a linear model (level 1 model) and fit…

Histogram

Histogram: A histogram is a graph of a dataset, composed of a series of rectangles. The width of these rectangles is proportional to the range of values in a class or bin, all bins being the same width. For example, values lying between 1 and…

Hold-Out Sample

A hold-out sample is a random sample from a data set that is withheld and not used in the model fitting process. After the model is fit to the main data (the "training" data), it is then applied to the hold-out sample. This gives an…

Homoscedasticity

Homoscedasticity: Homoscedasticity generally means equal variation of data, e.g. equal variance . For special cases see homoscedasticity in regression , homoscedasticity in hypothesis testing See also: heteroscedasticity

Homoscedasticity in hypothesis testing

Statistical Glossary Homoscedasticity in hypothesis testing: In hypothesis testing , homoscedasticity means a situation in which the variance is the same for all the compared samples. Homoscedasticity facilitates testing because most tests rest on the assumption of equal variance. See also: heteroscedasticity , heteroscedasticity in…

Homoscedasticity in regression

Homoscedasticity in regression: In regression analysis , homoscedasticity means a situation in which the variance of the dependent variable is the same for all the data. Homoscedasticity is facilitates analysis because most methods are based on the assumption of equal variance. See also: heteroscedasticity in…

Hotelling Trace Coefficient

Hotelling Trace Coefficient: The Hotelling Trace coefficient (also called Lawley-Hotelling or Hotelling-Lawley Trace) is a statistic for a multivariate test of mean differences between two groups. The null hypothesis is that centroid s don´t differ between two groups. The coefficient is equal to Hotelling´s T-Square…

Hotelling´s T-Square

Hotelling´s T-Square: Hotelling´s T-square is a statistic for a multivariate test of differences between the mean values of two groups. The null hypothesis is that centroid s don´t differ between two groups. Hotelling´s T-square is used in multiple analysis of variance (MANOVA) , and in…

Hypothesis

Hypothesis: A (statistical) hypothesis is an assertion or conjecture about the distribution of one or more random variables. For example, an experimenter may pose the hypothesis that the outcomes from treatment A and treatment B belong to the same population or distribution. If the hypothesis…

Hypothesis Testing

Hypothesis Testing: Hypothesis testing (also called "significance testing") is a statistical procedure for discriminating between two statistical hypotheses - the null hypothesis (H0) and the alternative hypothesis ( Ha, often denoted as H1). Hypothesis testing, in a formal logic sense, rests on the presumption of…

Icon Plots

Statistical Glossary Icon Plots: Icon plots are graphical tools for multivariate analysis. They provide graphical representation of observed units described by many variables. Each unit or observation is represented by a small image which depends on the values of the variables of interest. Icon plots…

Image Processing

Statistical Glossary Image Processing: In image processing, the initial data are images - functions of two coordinates. Normally, images are represented in discrete form as two-dimensional arrays of image elements, or "pixels" - i.e. sets of non-negative values , ordered by two indexes - (rows)…

Independent Events

Independent Events: Two events A and B are said to be independent if P(AB) = P(A).P(B). To put it differently, events A and B are independent if the occurrence or non-occurrence of A does not influence the occurrence of non-occurrence of B and vice-versa. For…

Independent Random Variables

Independent Random Variables: Two or more random variables are said to be independent it their joint distribution (density) is the product of their marginal distributions (densities).

Inferential Statistics

Inferential Statistics: Inferential statistics is the body of statistical techniques that deal with the question "How reliable is the conclusion or estimate that we derive from a set of data?" The two main techniques are confidence intervals and hypothesis tests.

Interaction effect

Interaction effect: An interaction effect refers to the role of a variable in an estimated model, and its effect on the dependent variable. A variable that has an interaction effect will have a different effect on the dependent variable, depending on the level of some…

Interim Monitoring

Interim Monitoring: In clinical trials of medical treatments or devices, a traditional fixed sample design establishes a fixed number of subjects or outcomes that must be observed. In a trial that uses interim monitoring, the sample size is not fixed in advance. Rather, periodic looks…

Internal Consistency Reliability

Statistical Glossary Internal Consistency Reliability: The internal consistency reliability of survey instruments (e.g. psychological tests), is a measure of reliability of different survey items intended to measure the same characteristic. For example, there are 5 different questions (items) related to anxiety level. Each question implies…

Interobserver Reliability

Statistical Glossary Interobserver Reliability: The interobserver reliability of a survey instrument, like a psychological test, measures agreement between two or more subjects rating the same object, phenomenon, or concept. For example, 5 critics are asked to evaluate the quality of 10 different works of art…

Interquartile Range

Interquartile Range: The difference between the 3d and 1st quartiles is called the interquartile range and it is used as a measure of variability (dispersion).

Interval Scale

Interval Scale: An interval scale is a measurement scale in which a certain distance along the scale means the same thing no matter where on the scale you are, but where "0" on the scale does not represent the absence of the thing being measured.…

Intraobserver Reliability

Statistical Glossary Intraobserver Reliability: Intraobserver reliability indicates how stable are responses obtained from the same respondent at different time points. The greater the difference between the responses, the smaller the intraobserver reliability of the survey instrument. The correlation coefficient between the responses obtained at different…

Jackknife

Jackknife: The jackknife is a general non-parametric method for estimation of the bias and variance of a statistic (which is usually an estimator) using only the sample itself. The jackknife is considered as the predecessor of the bootstrapping techniques. With a sample of size N,…

Joint Probability Density

Joint Probability Density: A function f(x,y) is called the joint probability density of random variables X and Y if and only if for any region A on the xy-plane

Joint Probability Distribution

Joint Probability Distribution: If X and Y are discrete random variables, the function f(x,y) which gives the probability that X = x and Y = y for each pair of values (x,y) within the range of values of X and Y is called the joint…

k-Means Clustering

k-Means Clustering: The k-means clustering method is used in non-hierarchical cluster analysis . The goal is to divide the whole set of objects into a predefined number (k) of clusters. The criteria for such subdivision is normally the minimal dispersion inside clusters - e.g. the…

k-Nearest neighbor

Statistical Glossary k-Nearest neighbor: K-nearest-neighbor (K-NN) is a machine learning predictive algorithm that relies on calculation of distances between pairs of records. The algorithm is used in classification problems where training data are available with known target values. The algorithm takes each record and assigns…

k-Nearest Neighbors Classification

k-Nearest Neighbors Classification: The k-nearest neighbors (k-NN) classification is a method of classification that uses a training set chosen from the data as a point of reference in classifying observations. The idea of the method is to find the k elements of the training set…

k-Nearest Neighbors Prediction

k-Nearest Neighbors Prediction: The k-nearest neighbors (k-NN) prediction is a method to predict a value of a target variable in a given record, using as a reference point a training set of similar objects. The basic idea is to choose k objects from the training…

Kalman Filter

Statistical Glossary Kalman Filter: Kalman filter is a class of linear filters for predicting and/or smoothing time series. The value of the time series is usually a vector in a state space . Kalman filter is optimal for filtering many types of markov chains .…

Kalman Filter (Equations)

Statistical Glossary Kalman Filter (Equations): The basic mathematics behind the idea of Kalman filter may be described as follows - Consider, for example, a Markov chain - i.e. a random series with Markov property - described by the following equation: (1) where - is the…

Kaplan-Meier Estimator

Kaplan-Meier Estimator: The Kaplan-Meier estimator is aimed at estimation of the survival function from censored life-time data. The value of the survival function between successive distinct uncensored observations is taken as constant, and the graph of the Kaplan-Meier estimate of the survival function is a…

Kappa Statistic

Kappa Statistic: Kappa statistic is a generic term for several similar measures of agreement used with categorical data . Typically it is used in assessing the degree to which two or more raters, examining the same data, agree when it comes to assigning the data…

Kolmogorov-Smirnov One-sample Test

Kolmogorov-Smirnov One-sample Test: The Kolmogorov-Smirnov one-sample test is a goodness-of-fit test, and tests whether an observed dataset is consistent with an hypothesized theoretical distribution. The test involves specifying the cumulative frequency distribution which would occur given the theoretical distribution and comparing that with the observed…

Kolmogorov-Smirnov Test

Kolmogorov-Smirnov Test: See: Kolmogorov-Smirnov one-sample test and Kolmogorov-Smirnov two-sample test

Kolmogorov-Smirnov Two-sample Test

Kolmogorov-Smirnov Two-sample Test: The Kolmogorov-Smirnov two-sample test is a test of the null hypothesis that two independent samples have been drawn from the same population (or from populations with the same distribution). The test uses the maximal difference between cumulative frequency distributions of two samples…

Kruskal – Wallis Test

Kruskal - Wallis Test: The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

Kurtosis

Kurtosis: Kurtosis measures the "heaviness of the tails" of a distribution (in compared to a normal distribution). Kurtosis is positive if the tails are "heavier" then for a normal distribution, and negative if the tails are "lighter" than for a normal distribution. The normal distribution…

Label

Label: A label is a category into which a record falls, usually in the context of predictive modeling. Label, class and category are different names for discrete values of a target (outcome) variable. "Label" typically has the added connotation that the label is something applied…

Latent Class Analysis (LCA)

Latent Class Analysis (LCA): Latent class analysis is concerned with deriving information about categorical latent variable s from observed values of categorical manifest variable s. In other words, LCA deals with fitting latent class models - a subclass of the latent variable models - to…

Latent Class Cluster Analysis

Latent Class Cluster Analysis: The latent class cluster analysis is a branch of the latent class analysis where the latent variable is considered as a single categorical variable taking on t possible values, corresponding to t classes.

Latent Class Factor Analysis

Latent Class Factor Analysis: The latent class factor analysis is a branch of the latent class analysis where the latent variable is a vector of several categorical variables, usually dichotomous variables.

Latent Profile Analysis (LPA)

Latent Profile Analysis (LPA): Latent profile analysis is concerned with deriving information about categorical latent variable s from the observed values of continuous manifest variable s. In other words, LPA deals with fitting latent profile models (a special kind of latent variable models ) to…

Latent Structure Models

Latent Structure Models: Latent structure models is a generic term for a broad set of categories of statistical models. This set includes factor analysis models, covariance structure models, latent profile analysis models, latent trait analysis models, latent class analysis models, and some others. Each category…

Latent Trait Analysis (LTA)

Latent Trait Analysis (LTA): Latent trait analysis is concerned with deriving information about continuous latent variable s from the observed values of categorical manifest variable s. In other words, LTA deals with fitting latent trait models (a special kind of latent variable models ) to…

Latent Variable

Latent Variable: A latent variable describes an unobservable construct and cannot be observed or measured directly. Latent variables are essential elements of latent variable models . A latent variable can be categorical or continuous. The opposite concept is the manifest variable . Other Glossary…

Close Menu