Week #42 – Cross-sectional data

Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual.  A simple...

Comments Off on Week #42 – Cross-sectional data

Week #41 – Tokenization

Tokenization is an initial step in natural language processing.  It involves breaking down a text into a series of basic units, typically words. For example...

Comments Off on Week #41 – Tokenization

Week #40 – Natural Language

A natural language is what most people outside the field of computer science think of as just a language (Spanish, English, etc.). The term...

Comments Off on Week #40 – Natural Language

Week # 39 – White Hat Bias

White Hat Bias is bias leading to distortion in, or selective presentation of, data that is considered by investigators or reviewers to be acceptable because it is in the service of righteous goals.

Comments Off on Week # 39 – White Hat Bias

Week #35 – Continuous vs. Discrete Distributions

A discrete distribution is one in which the data can only take on certain values, for example integers.  A continuous distribution is one in which data can take on any value within a specified range (which may be infinite).

Comments Off on Week #35 – Continuous vs. Discrete Distributions

Week # 34 – Central Limit Theorem

The central limit theorem states that the sampling distribution of the mean approaches Normality as the sample size increases, regardless of the probability distribution of the population from which the sample is drawn.

Comments Off on Week # 34 – Central Limit Theorem

Week #32 – CHAID

CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.

Comments Off on Week #32 – CHAID

Week # 31 – Census

In a census survey , all units from the population of interest are analyzed. A related concept is the sample survey, in which only a subset of the population is taken.

Comments Off on Week # 31 – Census

Week #30 – Discriminant analysis

Discriminant analysis is a method of distinguishing between classes of objects.  The objects are typically represented as rows in a matrix.

Comments Off on Week #30 – Discriminant analysis

Week # 29 – Training data

Also called the training sample, training set, calibration sample.  The context is predictive modeling (also called supervised data mining) -  where you have data with multiple predictor variables and a single known outcome or target variable.

Comments Off on Week # 29 – Training data

Week #28 – Bias

A general statistical term meaning a systematic (not random) deviation of an estimate from the true value.

Comments Off on Week #28 – Bias

Week #17 – Bootstrapping

Bootstrapping is sampling with replacement from observed data to estimate the variability in a statistic of interest. See also permutation tests, a related form of resampling. A common application

Comments Off on Week #17 – Bootstrapping

Week #16 – Binomial Distribution

A Binomial distribution is used to describe an experiment, event, or process for which the probability of success is the same for each trial and each trial has only two possible outcomes.

Comments Off on Week #16 – Binomial Distribution

Week #15 – Uplift or Persuasion Modeling

A combination of treatment comparisons (e.g. send a sales solicitation, or send nothing) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments.

Comments Off on Week #15 – Uplift or Persuasion Modeling

Week #13 – Multiplicity issues

Multiplicity issues arise in a number of contexts, but they generally boil down to the same thing:  repeated looks at a data set in different ways, until something "statistically significant" emerges.

Comments Off on Week #13 – Multiplicity issues

Week #12 – Support vector machines

Support vector machines are used in data mining (predictive modeling, to be specific) for classification of records, by learning from training data.

Comments Off on Week #12 – Support vector machines

Week #11 – Attribute

In data analysis or data mining, an attribute is a characteristic or feature that is measured for each observation (record) and can vary from one observation to another.  It might

Comments Off on Week #11 – Attribute

Week #10 – Negative Binomial

The negative binomial distribution is the probability distribution of the number of Bernoulli (yes/no) trials required to obtain r successes.

Comments Off on Week #10 – Negative Binomial

Week #9 – Random Walk

A random walk is a process of random steps, motions, or transitions.  It might be in one dimension (movement along a line), in two dimensions (movements in a plane), or in three dimensions or more.

Comments Off on Week #9 – Random Walk

Week #5 – Differencing of a Time Series

in discrete time is the transformation of the series to a new time series where the values are the differences between consecutive values of the original series.

Comments Off on Week #5 – Differencing of a Time Series

Week #1 – Data Partitioning

In predictive modeling, data partitioning is the division of the data available for analysis into two or three non-overlapping

Comments Off on Week #1 – Data Partitioning

Churn Trigger

Last year's popular story out of the Predictive Analytics World conference series was Andrew Pole's presentation of Target's methodology for predicting which customers were pregnant.

Comments Off on Churn Trigger

Randomized Trials on online learning

Evidence show that there is no significant difference between taking an online introductory statistics course and a traditional in-person class.

Comments Off on Randomized Trials on online learning

Facebook IPO

Facebook began trading around 11:30 this morning, and I spent 8 minutes

Comments Off on Facebook IPO

Congratulations to Thomas Lumley!

Newly elected American Statistical Association (ASA) Fellow, and recognized for his outstanding professional contributions to and leadership in the field of statistical science.

Comments Off on Congratulations to Thomas Lumley!


Arizona's immigration law goes before the Supreme Court this week...

Comments Off on Immigration

Julian Simon birthday

February 12 was the 80th anniversary of the birth of Julian Simon, an early pioneer in resampling methods.

Comments Off on Julian Simon birthday

Statistics for Future Presidents

Statistics for Future Presidents - Steve Pierson, Director of Science Policy at ASA wrote interesting blog wondering how statistics for future presidents (or policymakers more generally) would compare with the recommended statistical skills/concepts for others. Take a look and let him know!

Comments Off on Statistics for Future Presidents

The Data Scientist

The story of the prospective Facebook IPO, and prior IPO's from LinkedIn, Pandora, and Groupon all involve "data scientists".  Read an interview with Monica Rogati - Senior Data Scientist at LinkedIn to see the connection.

Comments Off on The Data Scientist

Popular Mistakes in Data Mining

John Elder's presentations on common data mining mistakes are a must-see if you have any experience or plans in the data mining arena.

Comments Off on Popular Mistakes in Data Mining

Coffee causes cancer?

"Any claim coming from an observational study is most likely to be wrong." Thus begins "Deming, data and observational studies," just published in "Significance Magazine" (Sept. 2011).

Comments Off on Coffee causes cancer?

The sacrifice bunt

I was watching a Washington Nationals game on TV a couple of days ago, and the concept of "expected value" ...

Comments Off on The sacrifice bunt

Epidemiologist joke

A neurosurgeon, pathologist and epidemiologist are each told to examine a can of sardines on a table in a closed room, and present a report.

Comments Off on Epidemiologist joke

The Power of Round

Advertisers shy away from round numbers, believing that $99 appears significantly cheaper than $100...

Comments Off on The Power of Round

March Madness

Did the NCAA get the March Madness rankings right? Check out SportsMeasures.com

Comments Off on March Madness

Bees on the attack

What does Matt Asher's article "Attack of the Hair Trigger Bees" have to do with global warming? Matt Asher runs statisticsblog.com ...

Comments Off on Bees on the attack

Catastrophe Modeling Assistant

Thinking about careers that use statistics? The job title "catastrophe modeling assistant" caught my eye recently in a job announcement. ...

Comments Off on Catastrophe Modeling Assistant

Random Monkeys

One of my gifts this holiday season was "A Drunkard's Walk: How Randomness Rules Our Lives,"

Comments Off on Random Monkeys