#### Week #34 – Hypothesis Testing

Hypothesis testing (also called "significance testing") is a statistical procedure for discriminating between two statistical hypotheses - the null hypothesis (H0) and the alternative hypothesis ( Ha, often denoted as H1). Hypothesis testing, in a formal logic sense, rests on the presumption of validity of the null hypothesis - that is, the null hypothesis is rejected only if the data at hand testify strongly enough against it.

Comments Off on Week #34 – Hypothesis Testing

#### Week #33 – Kurtosis

Kurtosis measures the "heaviness of the tails" of a distribution (in compared to a normal distribution). Kurtosis is positive if the tails are "heavier" then for a normal distribution, and negative if the tails are "lighter" than for a normal distribution. The normal distribution has kurtosis of zero.

Comments Off on Week #33 – Kurtosis

#### Week #32 – False Discovery Rate

A "discovery" is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).

Comments Off on Week #32 – False Discovery Rate

#### Week # 31 – Skewness

Comments Off on Week # 31 – Skewness

#### Week # 30 – Icon Plots

Comments Off on Week # 30 – Icon Plots

#### Week #29 – Signal

The signal is the component of the observed data that carries useful information.

Comments Off on Week #29 – Signal

#### Week #28 – Non-parametric Regression

Non-parametric regression methods are aimed at describing a relationship between the dependent and independent variables...

Comments Off on Week #28 – Non-parametric Regression

#### Week #27 – Nominal scale

A nominal scale is really a list of categories to which objects can be classified.

Comments Off on Week #27 – Nominal scale

#### Week #26 – Noise

The noise is the component of the observed data (e.g. of a time series) that is random and carries no useful information.

Comments Off on Week #26 – Noise

#### Week #25 – Nearest Neighbor Clustering

The single linkage clustering method (or the nearest neighbor method) is a method of calculating distance between clusters in hierarchical cluster analysis .

Comments Off on Week #25 – Nearest Neighbor Clustering

#### Week # 24 – Edge

In a network analysis context, "edge" refers to a link or connection between two entities in a network

Comments Off on Week # 24 – Edge

#### Week #23 – Netflix Contest

The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available.  Individuals and teams then compete to develop the best performing model.

Comments Off on Week #23 – Netflix Contest

#### Week #22 – Splines

The linear model is ubiquitous in classical statistics, yet real-life data rarely follow a purely linear pattern.

Comments Off on Week #22 – Splines

#### Week # 21 – Association Rules

Association rules, also called "market basket analysis," is a data mining method applied to transaction data.

Comments Off on Week # 21 – Association Rules

#### Week #20 – R

This week's word is actually a letter.  R is a statistical computing and programming language and program, a derivative of the commercial S-PLUS program, which, in turn, was an offshoot of S from Bell Labs.

Comments Off on Week #20 – R

#### Week #19 – Prediction vs. Explanation

With the advent of Big Data and data mining, statistical methods like regression and CART have been repurposed to use as tools in predictive modeling.

Comments Off on Week #19 – Prediction vs. Explanation

#### Week #18 – Netflix Prize

The Netflix prize was a famous early application of crowdsourcing to predictive modeling.

Comments Off on Week #18 – Netflix Prize

#### Week #17 – A-B Test

An A-B test is a classic statistical design in which individuals or subjects are randomly split into two groups and some intervention or treatment is applied.

Comments Off on Week #17 – A-B Test

#### Week #16 – Moving Average

In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t-1.

Comments Off on Week #16 – Moving Average

#### Week #15 – Interaction term

In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.

Comments Off on Week #15 – Interaction term

#### Week #14 – Naive forecast

A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).

Comments Off on Week #14 – Naive forecast

#### Week #13 – RMSE

RMSE is root mean squared error.  In predicting a numerical outcome with a statistical model, predicted values rarely match actual outcomes exactly.

Comments Off on Week #13 – RMSE

#### Week #12 – Label

A label is a category into which a record falls, usually in the context of predictive modeling.  Label, class and category are different names for discrete values of a target (outcome) variable.

Comments Off on Week #12 – Label

#### Week #7.5 – Strip transect

A strip transect is a small subsection of a geographically-defined study area, typically chosen randomly.

Comments Off on Week #7.5 – Strip transect

#### Week #11 – Spark

Spark is a second generation computing environment that sits on top of a Hadoop system, supporting the workflows that leverage a distributed file system.

Comments Off on Week #11 – Spark

#### Week #10 – Bandits

Bandits refers to a class of algorithms in which users or subjects make repeated choices among, or decisions in reaction to, multiple alternatives.

Comments Off on Week #10 – Bandits

#### week #9 – Overdispersion

In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.

Comments Off on week #9 – Overdispersion

#### Week #8 – Confusion matrix

In a classification model, the confusion matrix shows the counts of correct and erroneous classifications.  In a binary classification problem, the matrix consists of 4 cells.

Comments Off on Week #8 – Confusion matrix

#### Week #7 – Multiple looks

In a classic statistical experiment, treatment(s) and placebo are applied to randomly assigned subjects, and, at the end of the experiment, outcomes are compared.

Comments Off on Week #7 – Multiple looks

#### Week #6 – Pruning the tree

Classification and regression trees, applied to data with known values for an outcome variable, derive models with rules like "If taxable income <\$80,000, if no Schedule C income, if standard deduction taken, then no-audit."

Comments Off on Week #6 – Pruning the tree

#### Week #5 – Features vs. Variables

The predictors in a predictive model are sometimes given different terms by different disciplines.  Traditional statisticians think in terms of variables.

Comments Off on Week #5 – Features vs. Variables

#### Week #4 – Logistic Regression

In logistic regression, we seek to estimate the relationship between predictor variables Xi and a binary response variable.  Specifically, we want to estimate the probability p that the response variable will be a 0 or a 1.

Comments Off on Week #4 – Logistic Regression

#### Week #3 – Prior and posterior

Bayesian statistics typically incorporates new information (e.g. from a diagnostic test, or a recently drawn sample) to answer a question of the form "What is the probability that..."

Comments Off on Week #3 – Prior and posterior

#### Week #2 – Permutation test

Consider two (or more) samples subjected to different treatments.  A permutation test assesses whether,

Comments Off on Week #2 – Permutation test

#### Week #1 – Quasi-experiment (revisited)

One avid reader took issue with a recent definition of "quasi experiment."  I had defined it

Comments Off on Week #1 – Quasi-experiment (revisited)

#### Week # 52 – Quasi-experiment

In social science research, particularly in the qualitative literature on program evaluation, the term "quasi-experiment" refers to studies that do not involve the application of treatments via random assignment of subjects.

Comments Off on Week # 52 – Quasi-experiment

#### Week #51 – Curb-stoning

In survey research, curb-stoning refers to the deliberate fabrication of survey interview data by the interviewer.

Comments Off on Week #51 – Curb-stoning

#### Week #50 – Bag-of-words

Bag-of-words is a simplified natural language processing concept.

Comments Off on Week #50 – Bag-of-words

#### Week #49 – Stemming

In language processing, stemming is the process of taking multiple forms of the same word and reducing them to the same basic core form.

Comments Off on Week #49 – Stemming

#### Week #48 – Structured vs. unstructured data

Structured data is data that is in a form that can be used to develop statistical or machine learning models (typically a matrix where rows are records and columns are variables or features).

Comments Off on Week #48 – Structured vs. unstructured data

#### Week #47 – Feature engineering

In predictive modeling, a key step is to turn available data (which may come from varied sources and be messy) into an orderly matrix of rows (records to be predicted) and columns (predictor variables or features).

Comments Off on Week #47 – Feature engineering

#### Week #46 – Naive bayes classifier

A full Bayesian classifier is a supervised learning technique that assigns a class to a record by finding other records  with attributes just like it has, and finding the most prevalent class among them.

Comments Off on Week #46 – Naive bayes classifier

#### Week #45 – MapReduce

In computer science, MapReduce is a procedure that prepares data for parallel processing on multiple computers.

Comments Off on Week #45 – MapReduce

#### Week #44 – Likert scales

Likert scales are categorical ordinal scales used in social sciences to measure attitude.  A typical example is a set of response options ranging from "strongly agree" to "strongly disagree."

Comments Off on Week #44 – Likert scales

#### Week #43 – Node

A node is an entity in a network.  In a social network, it would be a person.  In a digital network, it would be a computer or device.

Comments Off on Week #43 – Node

#### Week #42 – Latent Variable Models

Latent variable models postulate some relationship between the statistical properties of observable variables.

Comments Off on Week #42 – Latent Variable Models

#### Week #41 – K-nearest neighbor

K-nearest-neighbor (K-NN) is a machine learning predictive algorithm that relies on calculation of distances between pairs of records.

Comments Off on Week #41 – K-nearest neighbor

#### Word #40 – Kappa Statistic

The kappa statistic measures the extent to which different raters or examiners differ when looking at the same data and assigning categories.

Comments Off on Word #40 – Kappa Statistic

#### Word #39 – Censoring

Censoring in time-series data occurs when some event causes subjects to cease producing data for reasons beyond the control of the investigator, or for reasons external to the issue being studied.

Comments Off on Word #39 – Censoring

#### Word #38 – Survival Analysis

Survival analysis is a set of methods used to model and analyze survival data, also called time-to-event data.

Comments Off on Word #38 – Survival Analysis

#### Word #37 – Joint Probability Distribution

The probability distribution for X is the possible values of X and their associated probabilities. With two separate discrete random variables, X and Y, the joint probability distribution is the function f(x,y)

Comments Off on Word #37 – Joint Probability Distribution

#### Word #36 – The Jackknife

With a sample of size N, the jackknife involves calculating N values of the estimator, with each value calculated on the basis of the entire sample less one observation.

Comments Off on Word #36 – The Jackknife

#### Word #35 – Interim Monitoring

In the interim monitoring of clinical trials, multiple looks are taken at the accruing patient results - say, response to a medication.

Comments Off on Word #35 – Interim Monitoring

#### Word #34 – NoSQL

A NoSQL database is distinguished mainly by what it is not -

Comments Off on Word #34 – NoSQL

#### Word #33 – Similarity matrix

A similarity matrix shows how similar records are to each other.

Comments Off on Word #33 – Similarity matrix

#### Work #32 – Predictive modeling

Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).

Comments Off on Work #32 – Predictive modeling

#### Word #31 – Hold-out sample

A hold-out sample is a random sample from a data set that is withheld and not used in the model fitting process.  After the model...

Comments Off on Word #31 – Hold-out sample

#### Week #30 – Heteroscedasticity

Heteroscedasticity generally means unequal variation of data, e.g. unequal variance.  More specifically,

Comments Off on Week #30 – Heteroscedasticity

#### Week #29 – Goodness-of-fit

Goodness-of-fit measures the difference between an observed frequency distribution and a theoretical probability distribution which

Comments Off on Week #29 – Goodness-of-fit

#### Week #28 – Geometric Mean

The geometric mean of n values is determined by multiplying all n values together, then taking the nth root of the product. It is useful in taking averages of ratios.

Comments Off on Week #28 – Geometric Mean

#### Week #27 – Hierarchical Linear Models

Hierarchical linear modeling is an approach to analysis of hierarchical (nested) data - i.e. data represented by categories, sub-categories, ..., individual units (e.g. school -> classroom -> student).

Comments Off on Week #27 – Hierarchical Linear Models

#### Week #26 – Hazard Function

In medical statistics, the hazard function is a relationship between a proportion and time.

Comments Off on Week #26 – Hazard Function

#### Week #25 – Fleming multiple testing procedure

The Fleming procedure (or O´Brien-Fleming multiple testing procedure ) is a simple multiple testing procedure for comparing two treatments when the response to treatment is dichotomous . This procedure...

Comments Off on Week #25 – Fleming multiple testing procedure

#### Week #24 – Directed vs. Undirected Network

In a directed network, connections between nodes are directional. For example..

Comments Off on Week #24 – Directed vs. Undirected Network

#### Week #23 – Adjacency Matrix

An adjacency matrix describes the relationships in a network. Nodes are listed in the top..

Comments Off on Week #23 – Adjacency Matrix

#### Week #22 – Exponential Distribution

The exponential distribution is a model for the length of intervals between two consecutive random events in time or

Comments Off on Week #22 – Exponential Distribution

#### Week #21 – Error

Error is the deviation of an estimated quantity from its true value, or, more precisely,

Comments Off on Week #21 – Error

#### Week #20 – Step-wise Regression

Step-wise regression is one of several computer-based iterative variable-selection procedures.

Comments Off on Week #20 – Step-wise Regression

#### Week #19 – Regularization

Regularization refers to a wide variety of techniques used to bring structure to statistical models in the face of data size, complexity and sparseness.

Comments Off on Week #19 – Regularization

#### Week #18 – SQL

SQL stands for structured query language, a high level language for querying relational databases, extracting information.

Comments Off on Week #18 – SQL

#### Week #17 – . Markov Chain Monte Carlo (MCMC)

A Markov chain is a probability system that governs transition among states or through successive events.

Comments Off on Week #17 – . Markov Chain Monte Carlo (MCMC)

#### Week #16 – MapReduce

MapReduce is a programming framework to distribute the computing load of very large data and problems to multiple computers.

Comments Off on Week #16 – MapReduce

#### Week #15 – Hadoop

As data processing requirements grew beyond the capacities of even large computers, distributed computing systems were developed to spread the load to multiple computers.

Comments Off on Week #15 – Hadoop

#### Week #14 – Curse of Dimensionality

The curse of dimensionality is the affliction caused by adding variables to multivariate data models.

Comments Off on Week #14 – Curse of Dimensionality

#### Week #13 – Data Product

A data product is a product or service whose value is derived from using algorithmic methods on data, and which in turn produces data to be used in the same product, or tangential data products.

Comments Off on Week #13 – Data Product

#### Week #12 – Dependent and Independent Variables

Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called independent variables.

Comments Off on Week #12 – Dependent and Independent Variables

#### Week #11 – Distance

Statistical distance is a measure calculated between two records that are typically part of a larger dataset, where rows are records and columns are variables.  To calculate...

Comments Off on Week #11 – Distance

#### Week #10 – Decile Lift

In predictive modeling, the goal is to make predictions about outcomes on a case-by-case basis:  an insurance claim will be fraudulent or not, a tax return will be correct or in error, a subscriber...

Comments Off on Week #10 – Decile Lift

#### Week #9 – Decision Trees

In the machine learning community, a decision tree is a branching set of rules used to classify a record, or predict a continuous value for a record.  For example

Comments Off on Week #9 – Decision Trees

#### Week #8 – Feature Selection

In predictive modeling, feature selection, also called variable selection, is the process (usually automated) of sorting through variables to retain variables that are likely...

Comments Off on Week #8 – Feature Selection

#### Week #7 – Bagging

In predictive modeling, bagging is an ensemble method that uses bootstrap replicates of the original training data to fit predictive models.

Comments Off on Week #7 – Bagging

#### Week #6 – Boosting

In predictive modeling, boosting is an iterative ensemble method that starts out by applying a classification algorithm and generating classifications.

Comments Off on Week #6 – Boosting

#### Week #5 – Ensemble Methods

In predictive modeling, ensemble methods refer to the practice of taking multiple models and averaging their predictions.

Comments Off on Week #5 – Ensemble Methods

#### Week #4 – Expected value

The expected value of a random variable, in a simple sense, is nothing but the arithmetic mean.

Comments Off on Week #4 – Expected value

#### Week #3 – Exact Tests

Exact tests are hypothesis tests that are guaranteed to produce Type-I error at or below the nominal alpha level of the test when conducted on samples drawn from a null model.

Comments Off on Week #3 – Exact Tests

#### Week #2 – Error

In statistical models, error or residual is the deviation of the estimated quantity from its true value: the greater the deviation, the greater the error.

Comments Off on Week #2 – Error

#### Week #1 – Endogenous variable

Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model.

Comments Off on Week #1 – Endogenous variable

#### Week #53 – Effect size

In a study or experiment with two groups (usually control and treatment), the investigator typically has in mind the magnitude of the difference between the two groups that he or she wants to be able to detect in a hypothesis test.

Comments Off on Week #53 – Effect size

#### Week #52 – Alpha spending function

In the interim monitoring of clinical trials, multiple looks are taken at the accruing patient results - say, response to a medication.

Comments Off on Week #52 – Alpha spending function

#### Week #51 – Type 1 error

In a test of significance (also called a hypothesis test), Type I error is the error of rejecting the null hypothesis when it is true -- of saying an effect or event is statistically significant when it is not.

Comments Off on Week #51 – Type 1 error

#### Week #50 – Stationary time series

A time series x(t); t=1,... is considered to be stationary if its statistical properties do not depend on time t .

Comments Off on Week #50 – Stationary time series

#### Week #49 – Data partitioning

Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set (used to fit the model), the validation set (used to compared models), and the test set (used to predict performance on new data).

Comments Off on Week #49 – Data partitioning

#### Week #48 – Data Mining

Data mining is concerned with finding latent patterns in large databases.

Comments Off on Week #48 – Data Mining

#### Week #47 – Z-score

An observation´s z-score tells you the number of standard deviations it lies away from the population mean (and in which direction).

Comments Off on Week #47 – Z-score

#### Week #46 – Cluster Analysis

In multivariate analysis, cluster analysis refers to methods used to divide up objects into similar groups, or, more precisely, groups whose members are all close to one another on various dimensions being measured.

Comments Off on Week #46 – Cluster Analysis

#### Week #45 – Construct validity

In psychology, a construct is a phenomenon or a variable in a model that is not directly observable or measurable  - intelligence is a classic example.

Comments Off on Week #45 – Construct validity

#### Week # 44 – Collaborative filtering

Collaborative filtering algorithms are used to predict whether a given individual might like, or purchase, an item.

Comments Off on Week # 44 – Collaborative filtering

#### Week #43 – Longitudinal data

Longitudinal data records multiple observations over time for a set of individuals or units. A typical..

Comments Off on Week #43 – Longitudinal data

#### Week #42 – Cross-sectional data

Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual.  A simple...

Comments Off on Week #42 – Cross-sectional data

#### Week #41 – Tokenization

Tokenization is an initial step in natural language processing.  It involves breaking down a text into a series of basic units, typically words. For example...

Comments Off on Week #41 – Tokenization