## SAMPLE

Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

## SPLINE

The easiest way to think of a spline is to first think of linear regression – a single linear relationship between an outcome variable and various predictor variables.

## OVERFIT

As applied to statistical models – “overfit” means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

## Week #18 – n

In statistics, “n” denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

## Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task.  Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms.  More typically, the corpus is a body of documents for a specific text mining task – e.g. a set ofContinue reading “Week #17 – Corpus”

## Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2 are independent variables, e is the contribution of all otherContinue reading “Week #2 – Casual Modeling”

## Week #10 – Arm

In an experiment, an arm is a treatment protocol – for example, drug A, or placebo.   In medical trials, an arm corresponds to a patient group receiving a specified therapy.  The term is also relevant for bandit algorithms for web testing, where an arm consists of a specific web treatment or offer.   Assigning a webContinue reading “Week #10 – Arm”

## Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued.  An example might be a binary matrix used to power web searches – columns representing search terms and rows representing searches, and cells populated by 1’s or 0’s (presence or absenceContinue reading “Week #9 – Sparse Matrix”

## Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records.  It is often, but not always, randomly drawn.  In matrix form, the rows are records (subjects), columns are variables, and cell values are the valuesContinue reading “Week #8 – Homonyms department: Sample”

## Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation.  When there are multiple variables in an analysis, normalization (also called standardization) removesContinue reading “Week #7 – Homonyms department: Normalization”

## Week #43 – HDFS

HDFS is the Hadoop Distributed File System.  It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

## Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

## Week #32 – False Discovery Rate

A “discovery” is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).

## Week #23 – Netflix Contest

The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available.  Individuals and teams then compete to develop the best performing model.

## Week #20 – R

This week’s word is actually a letter.  R is a statistical computing and programming language and program, a derivative of the commercial S-PLUS program, which, in turn, was an offshoot of S from Bell Labs.

## Week #16 – Moving Average

In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t-1.

## Week #15 – Interaction term

In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.

## Week #14 – Naive forecast

A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).

## week #9 – Overdispersion

In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.