Get "Word of the Week" in your inbox.
Jun 13, 2019
Autoregressive refers to time series forecasting models (AR models) in which the independent variables (predictors) are prior values of the time series itself.
May 27, 2019
There are various ways to recommend additional products to an online purchaser, and the most effective ones rely on prior purchase or rating history 
Apr 19, 2019
A tensor is the multidimensional extension of a matrix (i.e. scalar > vector > matrix > tensor).
Apr 12, 2019
To a statistician, a sample is a collection of observations (cases). To a machine learner, it’s a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing synonyms, like these:
Apr 8, 2019
To a statistician, a sample is a collection of observations (cases). To a machine learner, it’s a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these:
Mar 18, 2019
When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records. One of them is the “yacht owner” problem.
Mar 1, 2019
Rectangular data are the staple of statistical and machine learning models. Rectangular data are multivariate crosssectional data (i.e. not timeseries or repeated measure) in which each column is a variable (feature), and each row is a case or record.
Feb 15, 2019
Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample. It can occur in numerous situations, here are just a few:
Jan 24, 2019
A “likert scale” is used in selfreport rating surveys to allow users to express an opinion or assessment of something on a gradient scale. For example, a response could range from “agree strongly” through “agree somewhat” and “disagree somewhat” on to “disagree strongly.” Two key decisions the survey designer faces are
Jan 17, 2019
A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category. Typically a dummy variable will be derived from a multicategory variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:
Dec 20, 2018
Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb. With a pretense of being an individual selling a car on his or her own, and with no fixed location, such dealers avoid the fixed costs and regulations that burden regular used car dealers, and, in the eyes of the dealers, constitute unfair competition. Hence the numerous web sites touting the honesty and fair deals you get from your neighborhood used car dealer, and warning you against the allure of curbstoners.
To a statistician, curbstoning means something completely different  it is the practice of fabricating
Dec 19, 2018
Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects. From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2. However, …
Dec 14, 2018
QUESTION: The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent). A consultant has proposed a machine learning system to review claims and classify them as fraud or nofraud. The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the nonfraud claims (it mistakenly labels one in five as “fraud”). If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?
Dec 10, 2018
Churn is a term used in marketing to refer to the departure, over time, of customers. Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.). A customer who leaves, for whatever reason, “churns.”
Nov 26, 2018
The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s. For example, fraudulent insurance claims (1’s) and nonfraudulent ones (0’s). It plots two quantities:
Nov 19, 2018
A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you're measuring, who you're measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition of the problem under study does not change once the data collection starts.
A retrospective study is one in which you look backwards at data that have already been collected or generated, to answer a scientific (usually medical) problem.
Jun 19, 2018
The field of data science is rife with terminology anomalies, arising from the fact that the field comes from multiple disciplines.
May 30, 2018
I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.
May 14, 2018
"Bag" refers to "bootstrap aggregating," repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)
May 8, 2018
Today's Words of the Week are convolution and tensor, key components of deep learning.
Mar 21, 2018
Benford's law describes an expected distribution of the first digit in many naturallyoccurring datasets.
Mar 7, 2018
Contingency tables are tables of counts of events or things, crosstabulated by row and column.
Feb 28, 2018
Hyperparameter is used in machine learning, where it refers, loosely speaking, to userset parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.
Feb 21, 2018
Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.
Feb 14, 2018
As applied to statistical models  "overfit" means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:
Feb 7, 2018
To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.
Jan 31, 2018
The easiest way to think of a spline is to first think of linear regression  a single linear relationship between an outcome variable and various predictor variables.
Jun 14, 2016
Logit is a nonlinear function of probability. If p is the probability of an event, then the corresponding logit is given by the formula:
Logit is widely used to construct statistical models, for example in logistic regression.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Jun 7, 2016
Intraobserver reliability indicates how stable are responses obtained from the same respondent at different time points. The greater the difference between the responses, the smaller the intraobserver reliability of the survey instrument.
The correlation coefficient between the responses obtained at different time points from the same respondent is often used as a quantitative measure of the intraobserver reliability.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
May 31, 2016
Two events A and B are said to be independent if P(A∩B) = P(A).P(B). To put it differently, events A and B are independent if the occurrence or nonoccurrence of A does not influence the occurrence of nonoccurrence of B and viceversa. For example, if I toss a coin and you toss a coin, the probability that I get a heads is not influenced by the outcome on your coin so the two events are independent. If two events are not independent, then they are said to be Dependent Events.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
May 24, 2016
Residuals are differences between the observed values and the values predicted by some model. Analysis of residuals allows you to estimate the adequacy of a model for particular data; it is widely used in regression analysis.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
May 17, 2016
The concurrent validity of survey instruments, like the tests used in psychometrics , is a measure of agreement between the results obtained by the given survey instrument and the results obtained for the same population by another instrument acknowledged as the "gold standard".
The concurrent validity is often quantified by the correlation coefficient between the two sets of measurements obtained for the same target population  the measurements performed by the evaluating instrument and by the standard instrument.
For example, a researcher has developed a new fast IQtest that requires only 5 minutes per subject, as compared to 90 min for the test acknowledged as the gold standard. The researcher administers both tests for each person in a group of, say, 50. The outcome is 50 pairs of IQscores  one pair for each person: the score obtained by the new test and the score obtained by the gold standard test. The value of the correlation between the two sets of scores is a quantitative measure of the concurrent validity of the new IQtest.
The concurrent validity is a form of the criterion validity.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
May 10, 2016
Normality is a property of a random variable that is distributed according to the normal distribution.
Normality plays a central role in both theoretical and practical statistics: a great number of theoretical statistical methods rest on the assumption that the data, or test statistics derived from a sample of data, are normally distributed. Just for this reason, in practical statistics, data are very frequently tested for normality.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
May 3, 2016
In statistics, "n" denotes the size of a dataset, typically a sample, in terms of the number of observations or records.
Apr 26, 2016
A corpus is a body of documents to be used in a text mining task. Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms. More typically, the corpus is a body of documents for a specific text mining task  e.g. a set of maintenance tickets, or a group of discovery documents in a legal case, for which a classification model is needed.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Apr 19, 2016
Weighted kappa is a measure of agreement for Categorical data . It is a generalization of the
Kappa statistic to situations in which the categories are not equal in some respect  that is, weighted by an objective or subjective function.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Apr 12, 2016
Rank correlation is a method of finding the degree of association between two variables. The calculation for the rank correlation coefficient the same as that for the Pearson correlation coefficient, but is calculated using the ranks of the observations and not their numerical values. This method is useful when the data are not available in numerical form but information is sufficient to rank the data.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Apr 5, 2016
In latent variable models, a manifest variable (or indicator) is an observable variable  i.e. a variable that can be measured directly. A manifest variable can be continuous or categorical.
The opposite concept is the latent variable.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Mar 29, 2016
Fisher´s exact test is the first (historically)
permutation test. It is used with two samples of binary data, and tests the null hypothesis that the two samples are drawn from populations with equal but unknown proportions of "successes" (e.g. proportion of patients recovered without complications among the patients receiving drug A and the patients receiving drug B).
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Mar 22, 2016
Homoscedasticity generally means equal variation of data, e.g. equal
variance.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Mar 15, 2016
Posterior probability is a revised probability that takes into account new available information. For example, let there be two urns, urn A having 5 black balls and 10 red balls and urn B having 10 black balls and 5 red balls. Now if an urn is selected at random, the probability that urn A is chosen is 0.5. This is the a
priori probability. If we are given an additional piece of information that a ball was drawn at random from the selected urn, and that ball was black, what is the probability that the chosen urn is urn A? Posterior probability takes into account this additional information and revises the probability downward from 0.5 to 0.333 according to Bayes´ theorem, because a black ball is more probable from urn B than urn A.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Mar 8, 2016
In an experiment, an arm is a treatment protocol  for example, drug A, or placebo. In medical trials, an arm corresponds to a patient group receiving a specified therapy. The term is also relevant for bandit algorithms for web testing, where an arm consists of a specific web treatment or offer. Assigning a web visitor to an arm (per an algorithm) is analogous to pulling one of several arms on a slot machine (also called, in this context, a "multiarm bandit").
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Mar 1, 2016
A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0valued. An example might be a binary matrix used to power web searches  columns representing search terms and rows representing searches, and cells populated by 1's or 0's (presence or absence of the term in that row's search). Obviously most values are going to be 0  each search will involve only a tiny minority of terms. Computational methods can compress sparse matrices (taking advantage of the large expanse of 0valued entries, which needn't all be represented individually), rendering computation feasible for the large datasets required in web search prediction.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Feb 25, 2016
We continue our effort to shed light on potentially confusing usage of terms in the different data science communities.
In statistics, a sample is a collection of observations or records. It is often, but not always, randomly drawn. In matrix form, the rows are records (subjects), columns are variables, and cell values are the values for a particular variable for a particular subject. The sample is the matrix  a collection of rows with their values.
In machine learning and artificial intelligence, a sample might refer to the above, but it also might refer to a single record (row).
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Feb 16, 2016
With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities.
In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation. When there are multiple variables in an analysis, normalization (also called standardization) removes scale as a factor. For example, it would ensure that the analysis does not change if a particular distance were measured in feet instead of miles.
In the database community, normalization refers to the process of organizing data into a relational database with tables that key to each other, and minimize redundancy.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Feb 9, 2016
The KolmogorovSmirnov onesample test is a goodnessoffit test, and tests whether an observed dataset is consistent with an hypothesized theoretical distribution. The test involves specifying the cumulative frequency distribution which would occur given the theoretical distribution and comparing that with the observed cumulative frequency distribution.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Feb 2, 2016
Cohort data records multiple observations over time for a set of individuals or units tied together by some event (say, born in the same year). See also longitudinal data and panel data.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Jan 26, 2016
A loss function specifies a penalty for an incorrect estimate from a statistical model. Typical loss functions might specify the penalty as a function of the difference between the estimate and the true value, or simply as a binary value depending on whether the estimate is accurate within a certain range.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Jan 19, 2016
Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model. In other words, endogenous variables have explicit causes within the model.
The concept of endogenous variable is fundamental in path analysis and structural equation modeling .
The complementary concept is exogenous variable
Note: classification of a particular variable as endogenous depends on the chosen causal model: the same variable may be endogenous in one model and exogenous in another model based on the same set of variables.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Jan 12, 2016
Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables.
Consider for example a simple linear model:
y = a_{0} + a_{1} x_{1} + a_{2} x_{2} + e 

where y is the dependent variable, x_{1} and x_{2} are independent variables, e is the contribution of all other variables and factors. Linear regression analysis allows you to establish the proportion of the variance of y explained by variables x_{1} and x_{2} combined.
Methods of causal analysis pretend to partition the combined effect of x_{1} and x_{2} into meaningful and mutually exclusive components. Path analysis and analysis of commonality are examples of causal modeling techniques.
Strictly speaking, the actual causal relations cannot be derived unambiguously from such data. The term "causal" should be understood as a metaphor for some mathematical relations between the variables, or as only one of many reasonable models for the actual causal relations.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Jan 5, 2016
A time series x_t is called to be nonstationary if its statistical properties depend on time. The opposite concept is stationary time series . Most real world time series are nonstationary.
An example of a nonstationary time series is a record of readings of the atmosphere temperature measured each 10 seconds with some random errors that have a constant distribution with zero mean. At any given time point the mean of the readings is equal to the true temperature. On the other hand, the mean value itself changes with time  as far as the true temperature varies with time.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Dec 22, 2015
Six sigma means literally six standard deviations. The phrase refers to the limits drawn on statistical process control charts used to plot statistics from samples taken regularly from a production process. Consider the process mean. A process is deemed to be "in control" at any given point in time if the mean of the sample at that time is within six standard deviations of the overall process mean to that point. In this case, "standard deviation" means the standard deviation of the sample mean. Six sigmas (= six standard deviations) is a very broad range, and the use of sixsigmas, rather than 3sigmas, was popularized by Motorola. It poses substantial demands on the manufacturing process to limit variability of output so that a sixsigmawide band lies within the limits of an acceptable process.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Nov 24, 2015
Psychometrics or psychological testing is concerned with quantification (measurement) of human characteristics, behavior, performance, health, etc., as well as with design and analysis of studies based on such measurements. An example of the problems being solved in psychometrics is the measurement of intelligence via "IQ" scores.
Statistical methods are widely used in psychometrics. Some of the methods (e.g. factor analysis) were developed first in psychometric research. Psychometrics uses various types of surveys and tests to obtain the primary data, and a broad spectrum of statistical methods to analyze the data. Controlled experiments and design of experiment methods are widely used in psychometrics (in contrast, say, to econometrics).
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Nov 17, 2015
Azure is the Microsoft Cloud Computing Platform and Services. ML stands for Machine Learning, and is one of the services. Like other cloud computing services, you purchase it on a metered basis  as of 2015, there was a perprediction charge, and a compute time charge. As of October 2015, it featured a couple dozen algorithms and could be used in studio mode for piloting, or via API for production.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Nov 10, 2015
Categorical variables are nonnumeric "category" variables, e.g. color. Ordered categorical variables are category variables that have a quantitative dimension that can be ordered but is not on a regular scale. Doctors rate pain on a scale of 1 to 10  a "2" has no particular numeric content, nor does a "3," but we can say that 3 represents more pain than 2 (but probably not 50% more, nor, necessarily, the same increment that moving from 9 to 10 represents). Ordered categorical variables can often be successfully used in statistical modeling, even though they may not meet the strict requirements associated with a particular method.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Nov 3, 2015
Bimodal literally means "two modes" and is typically used to describe distributions of values that have two centers. For example, the distribution of heights in a sample of adults might have two peaks, one for women and one for men.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Oct 27, 2015
HDFS is the Hadoop Distributed File System. It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Oct 20, 2015
The KruskalWallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Oct 13, 2015
A statistical technique which helps in making inference whether three or more samples might come from populations having the same mean; specifically, whether the differences among the samples might be caused by chance variation.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Oct 6, 2015
A twotailed test is a hypothesis test in which the null hypothesis is rejected if the observed sample statistic is more extreme than the critical value in either direction (higher than the positive critical value or lower than the negative critical value). A twotailed test has two critical regions.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Sep 29, 2015
In psychometric surveys, the splithalves method is used to measure the internal consistency reliability of survey instruments, e.g. psychological tests.
The idea is to split the items (questions) related to the same construct to be measured, e.d. the anxiety level, and to compare the results obtained from the two resulting subsets of items. The closer the results  i.e. the scores of the construct being measured (e.g the anxiety level), the greater the internal consistency reliability of this survey instrument.
The correlation coefficient between the two sets of measurements is often used as a quantitative measure of the internal consistency reliability of a survey instrument.
Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Sep 22, 2015
In survival analysis, life tables summarize lifetime data or, generally speaking, timetoevent data. Rows in a life table usually correspond to time intervals, columns to the following categories: (i) not "failed", (ii) "failed", (iii) censored (withdrawn), and the sum of the three called "the number at risk". Each cell contains the number of units of each category (column) for a given interval (row). Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Sep 15, 2015
Truncation, generally speaking, means to shorten. In statistics it can mean the process of limiting consideration or analysis to data that meet certain criteria (for example, the patients still alive at a certain point). Or it can refer to a data distribution where values above or below a certain point have been eliminated (or cannot occur). It can also to the elimination (not rounding) of digits beyond a certain number of places past the decimal. Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Sep 8, 2015
This test is used for testing the significance of unplanned pairwise comparisons. When you do multiple significance tests, the chance of finding a "significant" difference just by chance increases. Tukey´s HSD test is one of several methods of ensuring that the chance of finding a significant difference in any comparison (under a null model) is maintained at the alpha level of the test. In other words, it preserves "familywise type I error." Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Sep 1, 2015
A robust filter is a filter that is not sensitive to input noise values with extremely large magnitude (e.g. those arising due to anomalous measurement errors). The median filter is an example of a robust filter. Linear filters are not robust  their output may be degraded by a small fraction of large errors in the input. Promoting better understanding of statistics throughout the world.
The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.
Aug 25, 2015
Hypothesis testing (also called "significance testing") is a statistical procedure for discriminating between two statistical hypotheses  the null hypothesis (H_{0}) and the alternative hypothesis ( H_{a}, often denoted as H_{1}). Hypothesis testing, in a formal logic sense, rests on the presumption of validity of the null hypothesis  that is, the null hypothesis is rejected only if the data at hand testify strongly enough against it.
Aug 18, 2015
Kurtosis measures the "heaviness of the tails" of a distribution (in compared to a normal distribution). Kurtosis is positive if the tails are "heavier" then for a normal distribution, and negative if the tails are "lighter" than for a normal distribution. The normal distribution has kurtosis of zero.
Aug 11, 2015
A "discovery" is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a TypeI error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).
Jul 23, 2015
The signal is the component of the observed data that carries useful information.
Jul 14, 2015
Nonparametric regression methods are aimed at describing a relationship between the dependent and independent variables...
Jul 7, 2015
A nominal scale is really a list of categories to which objects can be classified.
Jun 30, 2015
The noise is the component of the observed data (e.g. of a
time series) that is random and carries no useful information.
Jun 23, 2015
The single linkage clustering method (or the nearest neighbor method) is a method of calculating distance between clusters in
hierarchical cluster analysis .
Jun 16, 2015
In a network analysis context, "edge" refers to a link or connection between two entities in a network
Jun 9, 2015
The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available. Individuals and teams then compete to develop the best performing model.
Jun 2, 2015
The linear model is ubiquitous in classical statistics, yet reallife data rarely follow a purely linear pattern.
May 26, 2015
Association rules, also called "market basket analysis," is a data mining method applied to transaction data.
May 19, 2015
This week's word is actually a letter. R is a statistical computing and programming language and program, a derivative of the commercial SPLUS program, which, in turn, was an offshoot of S from Bell Labs.
May 12, 2015
With the advent of Big Data and data mining, statistical methods like regression and CART have been repurposed to use as tools in predictive modeling.
May 5, 2015
The Netflix prize was a famous early application of crowdsourcing to predictive modeling.
Apr 28, 2015
An AB test is a classic statistical design in which individuals or subjects are randomly split into two groups and some intervention or treatment is applied.
Apr 21, 2015
In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t1.
Apr 14, 2015
In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.
Apr 7, 2015
A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).
Mar 31, 2015
RMSE is root mean squared error. In predicting a numerical outcome with a statistical model, predicted values rarely match actual outcomes exactly.
Mar 24, 2015
A label is a category into which a record falls, usually in the context of predictive modeling. Label, class and category are different names for discrete values of a target (outcome) variable.
Mar 17, 2015
Spark is a second generation computing environment that sits on top of a Hadoop system, supporting the workflows that leverage a distributed file system.
Mar 10, 2015
Bandits refers to a class of algorithms in which users or subjects make repeated choices among, or decisions in reaction to, multiple alternatives.
Mar 3, 2015
In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.
Feb 24, 2015
In a classification model, the confusion matrix shows the counts of correct and erroneous classifications. In a binary classification problem, the matrix consists of 4 cells.
Feb 20, 2015
A strip transect is a small subsection of a geographicallydefined study area, typically chosen randomly.
Feb 17, 2015
In a classic statistical experiment, treatment(s) and placebo are applied to randomly assigned subjects, and, at the end of the experiment, outcomes are compared.
Feb 10, 2015
Classification and regression trees, applied to data with known values for an outcome variable, derive models with rules like "If taxable income <$80,000, if no Schedule C income, if standard deduction taken, then noaudit."
Feb 3, 2015
The predictors in a predictive model are sometimes given different terms by different disciplines. Traditional statisticians think in terms of variables.
Jan 27, 2015
In logistic regression, we seek to estimate the relationship between predictor variables Xi and a binary response variable. Specifically, we want to estimate the probability p that the response variable will be a 0 or a 1.
Jan 20, 2015
Bayesian statistics typically incorporates new information (e.g. from a diagnostic test, or a recently drawn sample) to answer a question of the form "What is the probability that..."
Jan 13, 2015
Consider two (or more) samples subjected to different treatments. A permutation test assesses whether,
Jan 6, 2015
One avid reader took issue with a recent definition of "quasi experiment." I had defined it
Dec 30, 2014
In social science research, particularly in the qualitative literature on program evaluation, the term "quasiexperiment" refers to studies that do not involve the application of treatments via random assignment of subjects.
Dec 23, 2014
In survey research, curbstoning refers to the deliberate fabrication of survey interview data by the interviewer.
Dec 16, 2014
Bagofwords is a simplified natural language processing concept.
Dec 9, 2014
In language processing, stemming is the process of taking multiple forms of the same word and reducing them to the same basic core form.
Dec 2, 2014
Structured data is data that is in a form that can be used to develop statistical or machine learning models (typically a matrix where rows are records and columns are variables or features).
Nov 25, 2014
In predictive modeling, a key step is to turn available data (which may come from varied sources and be messy) into an orderly matrix of rows (records to be predicted) and columns (predictor variables or features).
Nov 18, 2014
A full Bayesian classifier is a supervised learning technique that assigns a class to a record by finding other records with attributes just like it has, and finding the most prevalent class among them.
Nov 11, 2014
In computer science, MapReduce is a procedure that prepares data for parallel processing on multiple computers.
Nov 4, 2014
Likert scales are categorical ordinal scales used in social sciences to measure attitude. A typical example is a set of response options ranging from "strongly agree" to "strongly disagree."
Oct 28, 2014
A node is an entity in a network. In a social network, it would be a person. In a digital network, it would be a computer or device.
Oct 21, 2014
Latent variable models postulate some relationship between the statistical properties of observable variables.
Oct 14, 2014
Knearestneighbor (KNN) is a machine learning predictive algorithm that relies on calculation of distances between pairs of records.
Oct 7, 2014
The kappa statistic measures the extent to which different raters or examiners differ when looking at the same data and assigning categories.
Sep 30, 2014
Censoring in timeseries data occurs when some event causes subjects to cease producing data for reasons beyond the control of the investigator, or for reasons external to the issue being studied.
Sep 23, 2014
Survival analysis is a set of methods used to model and analyze survival data, also called timetoevent data.
Sep 16, 2014
The probability distribution for X is the possible values of X and their associated probabilities. With two separate discrete random variables, X and Y, the joint probability distribution is the function f(x,y)
Sep 9, 2014
With a sample of size N, the jackknife involves calculating N values of the estimator, with each value calculated on the basis of the entire sample less one observation.
Sep 2, 2014
In the interim monitoring of clinical trials, multiple looks are taken at the accruing patient results  say, response to a medication.
Aug 26, 2014
A NoSQL database is distinguished mainly by what it is not 
Aug 19, 2014
A similarity matrix shows how similar records are to each other.
Aug 12, 2014
Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or nodefault) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).
Aug 5, 2014
A holdout sample is a random sample from a data set that is withheld and not used in the model fitting process. After the model...
Jul 29, 2014
Heteroscedasticity generally means unequal variation of data, e.g. unequal variance. More specifically,
Jul 22, 2014
Goodnessoffit measures the difference between an observed frequency distribution and a theoretical probability distribution which
Jul 15, 2014
The geometric mean of n values is determined by multiplying all n values together, then taking the nth root of the product. It is useful in taking averages of ratios.
Jul 8, 2014
Hierarchical linear modeling is an approach to analysis of hierarchical (nested) data  i.e. data represented by categories, subcategories, ..., individual units (e.g. school > classroom > student).
Jul 1, 2014
In medical statistics, the hazard function is a relationship between a proportion and time.
Jun 24, 2014
The Fleming procedure (or
O´BrienFleming multiple testing procedure ) is a simple
multiple testing procedure for comparing two treatments when the response to treatment is
dichotomous . This procedure...
Jun 17, 2014
In a directed network, connections between nodes are directional. For example..
Jun 10, 2014
An adjacency matrix describes the relationships in a network. Nodes are listed in the top..
Jun 3, 2014
The exponential distribution is a model for the length of intervals between two consecutive random events in time or
May 27, 2014
Error is the deviation of an estimated quantity from its true value, or, more precisely,
May 20, 2014
Stepwise regression is one of several computerbased iterative variableselection procedures.
May 13, 2014
Regularization refers to a wide variety of techniques used to bring structure to statistical models in the face of data size, complexity and sparseness.
May 6, 2014
SQL stands for structured query language, a high level language for querying relational databases, extracting information.
Apr 29, 2014
A Markov chain is a probability system that governs transition among states or through successive events.
Apr 22, 2014
MapReduce is a programming framework to distribute the computing load of very large data and problems to multiple computers.
Apr 15, 2014
As data processing requirements grew beyond the capacities of even large computers, distributed computing systems were developed to spread the load to multiple computers.
Apr 8, 2014
The curse of dimensionality is the affliction caused by adding variables to multivariate data models.
Apr 1, 2014
A data product is a product or service whose value is derived from using algorithmic methods on data, and which in turn produces data to be used in the same product, or tangential data products.
Mar 25, 2014
Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called independent variables.
Mar 18, 2014
Statistical distance is a measure calculated between two records that are typically part of a larger dataset, where rows are records and columns are variables. To calculate...
Mar 11, 2014
In predictive modeling, the goal is to make predictions about outcomes on a casebycase basis: an insurance claim will be fraudulent or not, a tax return will be correct or in error, a subscriber...
Mar 4, 2014
In the machine learning community, a decision tree is a branching set of rules used to classify a record, or predict a continuous value for a record. For example
Feb 25, 2014
In predictive modeling, feature selection, also called variable selection, is the process (usually automated) of sorting through variables to retain variables that are likely...
Feb 18, 2014
In predictive modeling, bagging is an ensemble method that uses bootstrap replicates of the original training data to fit predictive models.
Feb 11, 2014
In predictive modeling, boosting is an iterative ensemble method that starts out by applying a classification algorithm and generating classifications.
Feb 5, 2014
In predictive modeling, ensemble methods refer to the practice of taking multiple models and averaging their predictions.
Jan 28, 2014
The expected value of a random variable, in a simple sense, is nothing but the arithmetic mean.
Jan 21, 2014
Exact tests are hypothesis tests that are guaranteed to produce TypeI error at or below the nominal alpha level of the test when conducted on samples drawn from a null model.
Jan 14, 2014
In statistical models, error or residual is the deviation of the estimated quantity from its true value: the greater the deviation, the greater the error.
Jan 7, 2014
Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model.
Dec 31, 2013
In a study or experiment with two groups (usually control and treatment), the investigator typically has in mind the magnitude of the difference between the two groups that he or she wants to be able to detect in a hypothesis test.
Dec 24, 2013
In the
interim monitoring of clinical trials, multiple looks are taken at the accruing patient results  say, response to a medication.
Dec 17, 2013
In a test of significance (also called a hypothesis test), Type I error is the error of rejecting the null hypothesis when it is true  of saying an effect or event is statistically significant when it is not.
Dec 10, 2013
A time series x(t); t=1,... is considered to be stationary if its statistical properties do not depend on time t .
Dec 3, 2013
Data partitioning in data mining is the division of the whole data available into two or three nonoverlapping sets: the training set (used to fit the model), the validation set (used to compared models), and the test set (used to predict performance on new data).
Nov 26, 2013
Data mining is concerned with finding latent patterns in large databases.
Nov 19, 2013
An observation´s zscore tells you the number of
standard deviations it lies away from the population mean (and in which direction).
Nov 12, 2013
In multivariate analysis, cluster analysis refers to methods used to divide up objects into similar groups, or, more precisely, groups whose members are all close to one another on various dimensions being measured.
Nov 5, 2013
In psychology, a construct is a phenomenon or a variable in a model that is not directly observable or measurable  intelligence is a classic example.
Oct 29, 2013
Collaborative filtering algorithms are used to predict whether a given individual might like, or purchase, an item.
Oct 22, 2013
Longitudinal data records multiple observations over time for a set of individuals or units. A typical..
Oct 15, 2013
Crosssectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual. A simple...
Oct 8, 2013
Tokenization is an initial step in natural language processing. It involves breaking down a text into a series of basic units, typically words. For example...
Oct 1, 2013
A natural language is what most people outside the field of computer science think of as just a language (Spanish, English, etc.). The term...
Sep 24, 2013
White Hat Bias is bias leading to distortion in, or selective presentation of, data that is considered by investigators or reviewers to be acceptable because it is in the service of righteous goals.
Sep 17, 2013
An edge is a link between two people or entities in a network that can be
Sep 10, 2013
Stratified sampling is a method of random sampling.
Sep 3, 2013
When probabilities are quoted without specification of the sample space, it could result in ambiguity when the sample space is not selfevident.
Aug 27, 2013
A discrete distribution is one in which the data can only take on certain values, for example integers. A continuous distribution is one in which data can take on any value within a specified range (which may be infinite).
Aug 20, 2013
The central limit theorem states that the sampling distribution of the mean approaches Normality as the sample size increases, regardless of the probability distribution of the population from which the sample is drawn.
Aug 13, 2013
Classification and regression trees (CART) are a set of techniques for classification and prediction.
Aug 6, 2013
CHAID stands for Chisquared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising alreadyclassified objects.
Jul 30, 2013
In a census survey , all units from the population of interest are analyzed. A related concept is the sample survey, in which only a subset of the population is taken.
Jul 23, 2013
Discriminant analysis is a method of distinguishing between classes of objects. The objects are typically represented as rows in a matrix.
Jul 16, 2013
Also called the training sample, training set, calibration sample. The context is predictive modeling (also called supervised data mining)  where you have data with multiple predictor variables and a single known outcome or target variable.
Jul 9, 2013
A general statistical term meaning a systematic (not random) deviation of an estimate from the true value.
Jul 2, 2013
One of several computerbased iterative procedures for selecting variables to use in a model. The process begins...
Jun 25, 2013
Outcomes to an experiment or repeated events are statistically significant if they differ from what chance variation might produce.
Jun 18, 2013
In
multiple comparison procedures, familywise type I error is the probability that, even if all samples come from the same population, you will wrongly conclude
Jun 11, 2013
A cohort study is a
longitudinal study that identifies a group of subjects sharing some attributes (a "cohort") then
Jun 4, 2013
The coefficient of variation is the
standard deviation of a data set, divided by the
mean of the same data set.
May 28, 2013
In
regression analysis, the coefficient of determination is a measure of goodnessoffit (i.e. how well or tightly the data fit the estimated model). The coefficient is
May 21, 2013
An estimator is a measure or metric intended to be calculated from a sample drawn from a larger population...
May 14, 2013
In regression analysis , collinearity of two variables means that strong correlation exists between them, making it difficult or impossible to estimate their individual regression coefficients reliably.
May 7, 2013
A cohort study is a longitudinal study that identifies a population or large group (a "cohort") then draws a sample from the population at various points in time and records data for the sample.
Apr 30, 2013
The centroid is a measure of center in multidimensional space.
Apr 23, 2013
Bootstrapping is sampling with replacement from observed data to estimate the variability in a statistic of interest. See also
permutation tests, a related form of resampling. A common application
Apr 16, 2013
A Binomial distribution is used to describe an experiment, event, or process for which the probability of success is the same for each trial and each trial has only two possible outcomes.
Apr 9, 2013
A combination of treatment comparisons (e.g. send a sales solicitation, or send nothing) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments.
Apr 2, 2013
Network analytics is the science of describing and, especially, visualizing the connections among objects.
Mar 26, 2013
Multiplicity issues arise in a number of contexts, but they generally boil down to the same thing: repeated looks at a data set in different ways, until something "statistically significant" emerges.
Mar 19, 2013
Support vector machines are used in data mining (predictive modeling, to be specific) for classification of records, by learning from training data.
Mar 12, 2013
In data analysis or data mining, an attribute is a characteristic or feature that is measured for each observation (record) and can vary from one observation to another. It might
Mar 5, 2013
The negative binomial distribution is the probability distribution of the number of Bernoulli (yes/no) trials required to obtain r successes.
Feb 26, 2013
A random walk is a process of random steps, motions, or transitions. It might be in one dimension (movement along a line), in two dimensions (movements in a plane), or in three dimensions or more.
Feb 19, 2013
Cover time is the expected number of steps in a random walk required
Feb 12, 2013
is a general computerintensive approach used in estimating the accuracy of statistical models.
Feb 5, 2013
(also called dissimilarity matrix) describes pairwise distinction between M objects.
Jan 29, 2013
in discrete time is the transformation of the series to a new time series where the values are the differences between consecutive values of the original series.
Jan 22, 2013
(outcome or variable) means "having only two possible values", e.g.
Jan 7, 2013
A probability density function is a curve used
Jan 1, 2013
In predictive modeling, data partitioning is the division of the data available for analysis into two or three nonoverlapping
Dec 27, 2012
Promoting better understanding of statistics throughout the world.
Nov 12, 2012
New Editor of Journal of Statistics Education
Oct 28, 2012
Read Peter's Letter to the Editor in Saturday's Washington Post.
Jun 28, 2012
Last year's popular story out of the Predictive Analytics World conference series was Andrew Pole's presentation of Target's methodology for predicting which customers were pregnant.
May 24, 2012
Evidence show that there is no significant difference between taking an online introductory statistics course and a traditional inperson class.
May 18, 2012
Facebook began trading around 11:30 this morning, and I spent 8 minutes
May 14, 2012
Newly elected American Statistical Association (ASA) Fellow, and recognized for his outstanding professional contributions to and leadership in the field of statistical science.
Apr 23, 2012
Arizona's immigration law goes before the Supreme Court this week...
Mar 15, 2012
I saw this job posting a while ago, and, in my next life,
Feb 21, 2012
David Unwin, Emeritus Chair in Geography, Bubeck College, University of London (and instructor at Statistics.com!) will be awarded the Association of American Geographers (AAG) Ronald F. Abler Distinguished Service Honors at the upcoming annual meeting next week.
Feb 13, 2012
February 12 was the 80th anniversary of the birth of Julian Simon, an early pioneer in resampling methods.
Jan 17, 2012
Statistics for Future Presidents  Steve Pierson, Director of Science Policy at ASA wrote interesting blog wondering how statistics for future presidents (or policymakers more generally) would compare with the recommended statistical skills/concepts for others. Take a look and let him know!
Jan 6, 2012
Teaching Geographic Information Science and Technology in Higher Education, 2012 (Wiley)
Nov 29, 2011
The story of the prospective Facebook IPO, and prior IPO's from LinkedIn, Pandora, and Groupon all involve "data scientists". Read an interview with Monica Rogati  Senior Data Scientist at LinkedIn to see the connection.
Oct 25, 2011
Dr. Michelle Everson is recognized for her outstanding contributions to and innovation in the teaching of elementary statistics.
Sep 30, 2011
John Elder's presentations on common data mining mistakes are a mustsee if you have any experience or plans in the data mining arena.
Sep 13, 2011
"Any claim coming from an observational study is most likely to be wrong."Â Thus begins "Deming, data and observational studies," just published in "Significance Magazine" (Sept. 2011).
Aug 31, 2011
I was watching a Washington Nationals game on TV a couple of days ago, and the concept of "expected value" ...
Jul 15, 2011
A neurosurgeon, pathologist and epidemiologist are each told to examine a can of sardines on a table in a closed room, and present a report.
Jun 14, 2011
What do teenagers want?Â More importantly for the music industry, what music will they buy?
Apr 5, 2011
Advertisers shy away from round numbers, believing that $99 appears significantly cheaper than $100...
Mar 22, 2011
Did the NCAA get the March Madness rankings right?Â Check out SportsMeasures.com
Jan 24, 2011
What does Matt Asher's article "Attack of the Hair Trigger Bees" have to do with global warming?Â Matt Asher runs statisticsblog.com ...
Jan 13, 2011
The first Gallup Poll was published in October, 1935.Â In America Speaks,
Jan 5, 2011
Thinking about careers that use statistics?Â The job title "catastrophe modeling assistant" caught my eye recently in a job announcement. ...
Dec 27, 2010
One of my gifts this holiday season was "A Drunkard's Walk: How Randomness Rules Our Lives,"
MORE COMMENTS...