Jan 20, 2015

Bayesian statistics typically incorporates new information (e.g. from a diagnostic test, or a recently drawn sample) to answer a question of the form "What is the probability that..."

Jan 13, 2015

Consider two (or more) samples subjected to different treatments. A permutation test assesses whether,

Jan 6, 2015

One avid reader took issue with a recent definition of "quasi experiment." I had defined it

Dec 30, 2014

In social science research, particularly in the qualitative literature on program evaluation, the term "quasi-experiment" refers to studies that do not involve the application of treatments via random assignment of subjects.

Dec 23, 2014

In survey research, curb-stoning refers to the deliberate fabrication of survey interview data by the interviewer.

Dec 16, 2014

Bag-of-words is a simplified natural language processing concept.

Dec 9, 2014

In language processing, stemming is the process of taking multiple forms of the same word and reducing them to the same basic core form.

Dec 2, 2014

Structured data is data that is in a form that can be used to develop statistical or machine learning models (typically a matrix where rows are records and columns are variables or features).

Nov 25, 2014

In predictive modeling, a key step is to turn available data (which may come from varied sources and be messy) into an orderly matrix of rows (records to be predicted) and columns (predictor variables or features).

Nov 18, 2014

A full Bayesian classifier is a supervised learning technique that assigns a class to a record by finding other records with attributes just like it has, and finding the most prevalent class among them.

Nov 11, 2014

In computer science, MapReduce is a procedure that prepares data for parallel processing on multiple computers.

Nov 4, 2014

Likert scales are categorical ordinal scales used in social sciences to measure attitude. A typical example is a set of response options ranging from "strongly agree" to "strongly disagree."

Oct 28, 2014

A node is an entity in a network. In a social network, it would be a person. In a digital network, it would be a computer or device.

Oct 21, 2014

Latent variable models postulate some relationship between the statistical properties of observable variables.

Oct 14, 2014

K-nearest-neighbor (K-NN) is a machine learning predictive algorithm that relies on calculation of distances between pairs of records.

Oct 7, 2014

The kappa statistic measures the extent to which different raters or examiners differ when looking at the same data and assigning categories.

Sep 30, 2014

Censoring in time-series data occurs when some event causes subjects to cease producing data for reasons beyond the control of the investigator, or for reasons external to the issue being studied.

Sep 23, 2014

Survival analysis is a set of methods used to model and analyze survival data, also called time-to-event data.

Sep 16, 2014

The probability distribution for X is the possible values of X and their associated probabilities. With two separate discrete random variables, X and Y, the joint probability distribution is the function f(x,y)

Sep 9, 2014

With a sample of size N, the jackknife involves calculating N values of the estimator, with each value calculated on the basis of the entire sample less one observation.

Sep 2, 2014

In the interim monitoring of clinical trials, multiple looks are taken at the accruing patient results - say, response to a medication.

Aug 26, 2014

A NoSQL database is distinguished mainly by what it is not -

Aug 19, 2014

A similarity matrix shows how similar records are to each other.

Aug 12, 2014

Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).

Aug 5, 2014

A hold-out sample is a random sample from a data set that is withheld and not used in the model fitting process. After the model...

Jul 29, 2014

Heteroscedasticity generally means unequal variation of data, e.g. unequal variance. More specifically,

Jul 22, 2014

Goodness-of-fit measures the difference between an observed frequency distribution and a theoretical probability distribution which

Jul 15, 2014

The geometric mean of n values is determined by multiplying all n values together, then taking the nth root of the product. It is useful in taking averages of ratios.

Jul 8, 2014

Hierarchical linear modeling is an approach to analysis of hierarchical (nested) data - i.e. data represented by categories, sub-categories, ..., individual units (e.g. school -> classroom -> student).

Jul 1, 2014

In medical statistics, the hazard function is a relationship between a proportion and time.

Jun 24, 2014

The Fleming procedure (or *O´Brien-Fleming multiple testing procedure *) is a simple multiple testing procedure for comparing two treatments when the response to treatment is dichotomous . This procedure...

Jun 17, 2014

In a directed network, connections between nodes are directional. For example..

Jun 10, 2014

An adjacency matrix describes the relationships in a network. Nodes are listed in the top..

Jun 3, 2014

The exponential distribution is a model for the length of intervals between two consecutive random events in time or

May 27, 2014

Error is the deviation of an estimated quantity from its true value, or, more precisely,

May 20, 2014

Step-wise regression is one of several computer-based iterative variable-selection procedures.

May 13, 2014

Regularization refers to a wide variety of techniques used to bring structure to statistical models in the face of data size, complexity and sparseness.

May 6, 2014

SQL stands for structured query language, a high level language for querying relational databases, extracting information.

Apr 29, 2014

A Markov chain is a probability system that governs transition among states or through successive events.

Apr 22, 2014

MapReduce is a programming framework to distribute the computing load of very large data and problems to multiple computers.

Apr 15, 2014

As data processing requirements grew beyond the capacities of even large computers, distributed computing systems were developed to spread the load to multiple computers.

Apr 8, 2014

The curse of dimensionality is the affliction caused by adding variables to multivariate data models.

Apr 1, 2014

A data product is a product or service whose value is derived from using algorithmic methods on data, and which in turn produces data to be used in the same product, or tangential data products.

Mar 25, 2014

Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called independent variables.

Mar 18, 2014

Statistical distance is a measure calculated between two records that are typically part of a larger dataset, where rows are records and columns are variables. To calculate...

Mar 11, 2014

In predictive modeling, the goal is to make predictions about outcomes on a case-by-case basis: an insurance claim will be fraudulent or not, a tax return will be correct or in error, a subscriber...

Mar 4, 2014

In the machine learning community, a decision tree is a branching set of rules used to classify a record, or predict a continuous value for a record. For example

Feb 25, 2014

In predictive modeling, feature selection, also called variable selection, is the process (usually automated) of sorting through variables to retain variables that are likely...

Feb 18, 2014

In predictive modeling, bagging is an ensemble method that uses bootstrap replicates of the original training data to fit predictive models.

Feb 11, 2014

In predictive modeling, boosting is an iterative ensemble method that starts out by applying a classification algorithm and generating classifications.

Feb 5, 2014

In predictive modeling, ensemble methods refer to the practice of taking multiple models and averaging their predictions.

Jan 28, 2014

The expected value of a random variable, in a simple sense, is nothing but the arithmetic mean.

Jan 21, 2014

Exact tests are hypothesis tests that are guaranteed to produce Type-I error at or below the nominal alpha level of the test when conducted on samples drawn from a null model.

Jan 14, 2014

In statistical models, error or residual is the deviation of the estimated quantity from its true value: the greater the deviation, the greater the error.

Jan 7, 2014

Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model.

Dec 31, 2013

In a study or experiment with two groups (usually control and treatment), the investigator typically has in mind the magnitude of the difference between the two groups that he or she wants to be able to detect in a hypothesis test.

Dec 24, 2013

In the interim monitoring of clinical trials, multiple looks are taken at the accruing patient results - say, response to a medication.

Dec 17, 2013

In a test of significance (also called a hypothesis test), Type I error is the error of rejecting the null hypothesis when it is true -- of saying an effect or event is statistically significant when it is not.

Dec 10, 2013

A time series x(t); t=1,... is considered to be stationary if its statistical properties do not depend on time t .

Dec 3, 2013

Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set (used to fit the model), the validation set (used to compared models), and the test set (used to predict performance on new data).

Nov 26, 2013

Data mining is concerned with finding latent patterns in large databases.

Nov 19, 2013

An observation´s z-score tells you the number of standard deviations it lies away from the population mean (and in which direction).

Nov 12, 2013

In multivariate analysis, cluster analysis refers to methods used to divide up objects into similar groups, or, more precisely, groups whose members are all close to one another on various dimensions being measured.

Nov 5, 2013

In psychology, a construct is a phenomenon or a variable in a model that is not directly observable or measurable - intelligence is a classic example.

Oct 29, 2013

Collaborative filtering algorithms are used to predict whether a given individual might like, or purchase, an item.

Oct 22, 2013

Longitudinal data records multiple observations over time for a set of individuals or units. A typical..

Oct 15, 2013

Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual. A simple...

Oct 8, 2013

Tokenization is an initial step in natural language processing. It involves breaking down a text into a series of basic units, typically words. For example...

Oct 1, 2013

A natural language is what most people outside the field of computer science think of as just a language (Spanish, English, etc.). The term...

Sep 24, 2013

White Hat Bias is bias leading to distortion in, or selective presentation of, data that is considered by investigators or reviewers to be acceptable because it is in the service of righteous goals.

Sep 17, 2013

An edge is a link between two people or entities in a network that can be

Sep 10, 2013

Stratified sampling is a method of random sampling.

Sep 3, 2013

When probabilities are quoted without specification of the sample space, it could result in ambiguity when the sample space is not self-evident.

Aug 27, 2013

A discrete distribution is one in which the data can only take on certain values, for example integers. A continuous distribution is one in which data can take on any value within a specified range (which may be infinite).

Aug 20, 2013

The central limit theorem states that the sampling distribution of the mean approaches Normality as the sample size increases, regardless of the probability distribution of the population from which the sample is drawn.

Aug 13, 2013

Classification and regression trees (CART) are a set of techniques for classification and prediction.

Aug 6, 2013

CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.

Jul 30, 2013

In a census survey , all units from the population of interest are analyzed. A related concept is the sample survey, in which only a subset of the population is taken.

Jul 23, 2013

Discriminant analysis is a method of distinguishing between classes of objects. The objects are typically represented as rows in a matrix.

Jul 16, 2013

Also called the training sample, training set, calibration sample. The context is predictive modeling (also called supervised data mining) - where you have data with multiple predictor variables and a single known outcome or target variable.

Jul 9, 2013

A general statistical term meaning a systematic (not random) deviation of an estimate from the true value.

Jul 2, 2013

One of several computer-based iterative procedures for selecting variables to use in a model. The process begins...

Jun 25, 2013

Outcomes to an experiment or repeated events are statistically significant if they differ from what chance variation might produce.

Jun 18, 2013

In multiple comparison procedures, family-wise type I error is the probability that, even if all samples come from the same population, you will wrongly conclude

Jun 11, 2013

A cohort study is a longitudinal study that identifies a group of subjects sharing some attributes (a "cohort") then

Jun 4, 2013

The coefficient of variation is the standard deviation of a data set, divided by the mean of the same data set.

May 28, 2013

In regression analysis, the coefficient of determination is a measure of goodness-of-fit (i.e. how well or tightly the data fit the estimated model). The coefficient is

May 21, 2013

An estimator is a measure or metric intended to be calculated from a sample drawn from a larger population...

May 14, 2013

In regression analysis , collinearity of two variables means that strong correlation exists between them, making it difficult or impossible to estimate their individual regression coefficients reliably.

May 7, 2013

A cohort study is a longitudinal study that identifies a population or large group (a "cohort") then draws a sample from the population at various points in time and records data for the sample.

Apr 30, 2013

The centroid is a measure of center in multi-dimensional space.

Apr 23, 2013

Bootstrapping is sampling with replacement from observed data to estimate the variability in a statistic of interest. See also permutation tests, a related form of resampling. A common application

Apr 16, 2013

A Binomial distribution is used to describe an experiment, event, or process for which the probability of success is the same for each trial and each trial has only two possible outcomes.

Apr 9, 2013

A combination of treatment comparisons (e.g. send a sales solicitation, or send nothing) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments.

Apr 2, 2013

Network analytics is the science of describing and, especially, visualizing the connections among objects.

Mar 26, 2013

Multiplicity issues arise in a number of contexts, but they generally boil down to the same thing: repeated looks at a data set in different ways, until something "statistically significant" emerges.

Mar 19, 2013

Support vector machines are used in data mining (predictive modeling, to be specific) for classification of records, by learning from training data.

Mar 12, 2013

In data analysis or data mining, an attribute is a characteristic or feature that is measured for each observation (record) and can vary from one observation to another. It might

Mar 5, 2013

The negative binomial distribution is the probability distribution of the number of Bernoulli (yes/no) trials required to obtain *r* successes.

Feb 26, 2013

A random walk is a process of random steps, motions, or transitions. It might be in one dimension (movement along a line), in two dimensions (movements in a plane), or in three dimensions or more.

Feb 19, 2013

Cover time is the expected number of steps in a random walk required

Feb 12, 2013

is a general computer-intensive approach used in estimating the accuracy of statistical models.

Feb 5, 2013

(also called dissimilarity matrix) describes pairwise distinction between M objects.

Jan 29, 2013

in discrete time is the transformation of the series to a new time series where the values are the differences between consecutive values of the original series.

Jan 22, 2013

(outcome or variable) means "having only two possible values", e.g.

Jan 1, 2013

In predictive modeling, data partitioning is the division of the data available for analysis into two or three non-overlapping

Dec 27, 2012

Promoting better understanding of statistics throughout the world.

Nov 12, 2012

New Editor of Journal of Statistics Education

Oct 28, 2012

Read Peter's Letter to the Editor in Saturday's **Washington Post**.

Jun 28, 2012

Last year's popular story out of the Predictive Analytics World conference series was Andrew Pole's presentation of Target's methodology for predicting which customers were pregnant.

May 24, 2012

Evidence show that there is no significant difference between taking an online introductory statistics course and a traditional in-person class.

May 18, 2012

Facebook began trading around 11:30 this morning, and I spent 8 minutes

May 14, 2012

Newly elected American Statistical Association (ASA) Fellow, and recognized for his outstanding professional contributions to and leadership in the field of statistical science.

Apr 23, 2012

Arizona's immigration law goes before the Supreme Court this week...

Mar 15, 2012

I saw this job posting a while ago, and, in my next life,

Feb 21, 2012

David Unwin, Emeritus Chair in Geography, Bubeck College, University of London (and instructor at Statistics.com!) will be awarded the Association of American Geographers (AAG) *Ronald F. Abler Distinguished Service Honors *at the upcoming annual meeting next week.

Feb 13, 2012

February 12 was the 80th anniversary of the birth of Julian Simon, an early pioneer in resampling methods.

Jan 17, 2012

Statistics for Future Presidents - Steve Pierson, Director of Science Policy at ASA wrote interesting blog wondering how statistics for future presidents (or policymakers more generally) would compare with the recommended statistical skills/concepts for others. Take a look and let him know!

Jan 6, 2012

*Teaching Geographic Information Science and Technology in Higher Education, *2012 (Wiley)

Nov 29, 2011

The story of the prospective Facebook IPO, and prior IPO's from LinkedIn, Pandora, and Groupon all involve "data scientists". Read an interview with Monica Rogati - Senior Data Scientist at LinkedIn to see the connection.

Oct 25, 2011

Dr. Michelle Everson is recognized for her outstanding contributions to and innovation in the teaching of elementary statistics.

Sep 30, 2011

John Elder's presentations on common data mining mistakes are a must-see if you have any experience or plans in the data mining arena.

Sep 13, 2011

"Any claim coming from an observational study is most likely to be wrong."Â Thus begins "Deming, data and observational studies," just published in "Significance Magazine" (Sept. 2011).

Aug 31, 2011

I was watching a Washington Nationals game on TV a couple of days ago, and the concept of "expected value" ...

Jul 15, 2011

A neurosurgeon, pathologist and epidemiologist are each told to examine a can of sardines on a table in a closed room, and present a report.

Jun 14, 2011

What do teenagers want?Â More importantly for the music industry, what music will they buy?

Apr 5, 2011

Advertisers shy away from round numbers, believing that $99 appears significantly cheaper than $100...

Mar 22, 2011

Did the NCAA get the March Madness rankings right?Â Check out SportsMeasures.com

Jan 24, 2011

What does Matt Asher's article "Attack of the Hair Trigger Bees" have to do with global warming?Â Matt Asher runs statisticsblog.com ...

Jan 13, 2011

The first Gallup Poll was published in October, 1935.Â In *America Speaks*,

Jan 5, 2011

Thinking about careers that use statistics?Â The job title "catastrophe modeling assistant" caught my eye recently in a job announcement. ...

Dec 27, 2010

One of my gifts this holiday season was "A Drunkard's Walk: How Randomness Rules Our Lives,"

