Week #5 – Ensemble Methods

In predictive modeling, ensemble methods refer to the practice of taking multiple models and averaging their predictions.

Comments Off on Week #5 – Ensemble Methods


When talking to several people, do you address them as "you guys"? "Y'all"? Just "you"? And is the carbonated soft drink "soda" or "pop?" Maps based on survey responses to questions like this were published in the Harvard Dialect Survey in 2003. Josh Katz took…

Comments Off on Dialects

Needle in a Haystack

What's the probability that the NSA examined the metadata for your phone number in 2013? According to John Inglis, Deputy Director at the NSA, it's about 0.00001, or 1 in 100,000. A surprisingly small number, given what we've all been reading in the media about…

Comments Off on Needle in a Haystack

Week #3 – Exact Tests

Exact tests are hypothesis tests that are guaranteed to produce Type-I error at or below the nominal alpha level of the test when conducted on samples drawn from a null model.

Comments Off on Week #3 – Exact Tests

Week #2 – Error

In statistical models, error or residual is the deviation of the estimated quantity from its true value: the greater the deviation, the greater the error.

Comments Off on Week #2 – Error

Week #1 – Endogenous variable

Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model.

Comments Off on Week #1 – Endogenous variable

Week #53 – Effect size

In a study or experiment with two groups (usually control and treatment), the investigator typically has in mind the magnitude of the difference between the two groups that he or she wants to be able to detect in a hypothesis test.

Comments Off on Week #53 – Effect size

Week #51 – Type 1 error

In a test of significance (also called a hypothesis test), Type I error is the error of rejecting the null hypothesis when it is true -- of saying an effect or event is statistically significant when it is not.

Comments Off on Week #51 – Type 1 error

Personality regions

There are Red States and Blue States. The three blue states of the Pacific coast constitute the Left Coast. For Colin Woodward, Yankeedom comprises both New England and the Great Lakes. If you're into accessories, there's the Bible Belt, the Rust Belt, and the Stroke…

Comments Off on Personality regions

Week #49 – Data partitioning

Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set (used to fit the model), the validation set (used to compared models), and the test set (used to predict performance on new data).

Comments Off on Week #49 – Data partitioning

Week #46 – Cluster Analysis

In multivariate analysis, cluster analysis refers to methods used to divide up objects into similar groups, or, more precisely, groups whose members are all close to one another on various dimensions being measured.

Comments Off on Week #46 – Cluster Analysis

Week #45 – Construct validity

In psychology, a construct is a phenomenon or a variable in a model that is not directly observable or measurable  - intelligence is a classic example.

Comments Off on Week #45 – Construct validity

Terrorist Clusters

The "righteous vengeance gun attack" is just one of 10 types of terrorism identified by Chenoweth and Lowham via statistical clustering techniques. Another cluster is "bombings of a public population where a liberation group takes responsibility." You can read about the 10 clusters, and the…

Comments Off on Terrorist Clusters

Statistics.com Partners With CrowdANALYTIX to Offer New Online Course With Crowdsource Contest As Project

Crowdsourcing, using the power of the crowd to solve problems, has been used for many functions and tasks, including predictive modeling (like the 2009 Netflix Contest). Typically, problems are broadcast to an unknown group of statistical modelers on the Internet, and solutions are sought. Every…

Comments Off on Statistics.com Partners With CrowdANALYTIX to Offer New Online Course With Crowdsource Contest As Project

Week #42 – Cross-sectional data

Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual.  A simple...

Comments Off on Week #42 – Cross-sectional data

Week #41 – Tokenization

Tokenization is an initial step in natural language processing.  It involves breaking down a text into a series of basic units, typically words. For example...

Comments Off on Week #41 – Tokenization

Week #40 – Natural Language

A natural language is what most people outside the field of computer science think of as just a language (Spanish, English, etc.). The term...

Comments Off on Week #40 – Natural Language

Week # 39 – White Hat Bias

White Hat Bias is bias leading to distortion in, or selective presentation of, data that is considered by investigators or reviewers to be acceptable because it is in the service of righteous goals.

Comments Off on Week # 39 – White Hat Bias

Week #35 – Continuous vs. Discrete Distributions

A discrete distribution is one in which the data can only take on certain values, for example integers.  A continuous distribution is one in which data can take on any value within a specified range (which may be infinite).

Comments Off on Week #35 – Continuous vs. Discrete Distributions

Week # 34 – Central Limit Theorem

The central limit theorem states that the sampling distribution of the mean approaches Normality as the sample size increases, regardless of the probability distribution of the population from which the sample is drawn.

Comments Off on Week # 34 – Central Limit Theorem

Week #32 – CHAID

CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.

Comments Off on Week #32 – CHAID

Week # 31 – Census

In a census survey , all units from the population of interest are analyzed. A related concept is the sample survey, in which only a subset of the population is taken.

Comments Off on Week # 31 – Census

Week #30 – Discriminant analysis

Discriminant analysis is a method of distinguishing between classes of objects.  The objects are typically represented as rows in a matrix.

Comments Off on Week #30 – Discriminant analysis

Week # 29 – Training data

Also called the training sample, training set, calibration sample.  The context is predictive modeling (also called supervised data mining) -  where you have data with multiple predictor variables and a single known outcome or target variable.

Comments Off on Week # 29 – Training data

Mutual Attraction

Mutual attraction is a dominant force in the universe. Gravity binds the moon to the earth, the earth to the sun, the sun to the galaxy, and one galaxy to another. And yet the universe is expanding; the result is a larger universe comprised of…

Comments Off on Mutual Attraction

Week #28 – Bias

A general statistical term meaning a systematic (not random) deviation of an estimate from the true value.

Comments Off on Week #28 – Bias

Week #17 – Bootstrapping

Bootstrapping is sampling with replacement from observed data to estimate the variability in a statistic of interest. See also permutation tests, a related form of resampling. A common application

Comments Off on Week #17 – Bootstrapping

Week #16 – Binomial Distribution

A Binomial distribution is used to describe an experiment, event, or process for which the probability of success is the same for each trial and each trial has only two possible outcomes.

Comments Off on Week #16 – Binomial Distribution

Week #15 – Uplift or Persuasion Modeling

A combination of treatment comparisons (e.g. send a sales solicitation, or send nothing) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments.

Comments Off on Week #15 – Uplift or Persuasion Modeling

Week #13 – Multiplicity issues

Multiplicity issues arise in a number of contexts, but they generally boil down to the same thing:  repeated looks at a data set in different ways, until something "statistically significant" emerges.

Comments Off on Week #13 – Multiplicity issues

Week #12 – Support vector machines

Support vector machines are used in data mining (predictive modeling, to be specific) for classification of records, by learning from training data.

Comments Off on Week #12 – Support vector machines

Week #11 – Attribute

In data analysis or data mining, an attribute is a characteristic or feature that is measured for each observation (record) and can vary from one observation to another.  It might

Comments Off on Week #11 – Attribute

Week #10 – Negative Binomial

The negative binomial distribution is the probability distribution of the number of Bernoulli (yes/no) trials required to obtain r successes.

Comments Off on Week #10 – Negative Binomial

Week #9 – Random Walk

A random walk is a process of random steps, motions, or transitions.  It might be in one dimension (movement along a line), in two dimensions (movements in a plane), or in three dimensions or more.

Comments Off on Week #9 – Random Walk

Week #5 – Differencing of a Time Series

in discrete time is the transformation of the series to a new time series where the values are the differences between consecutive values of the original series.

Comments Off on Week #5 – Differencing of a Time Series

Week #1 – Data Partitioning

In predictive modeling, data partitioning is the division of the data available for analysis into two or three non-overlapping

Comments Off on Week #1 – Data Partitioning

Churn Trigger

Last year's popular story out of the Predictive Analytics World conference series was Andrew Pole's presentation of Target's methodology for predicting which customers were pregnant.

Comments Off on Churn Trigger

Randomized Trials on online learning

Evidence show that there is no significant difference between taking an online introductory statistics course and a traditional in-person class.

Comments Off on Randomized Trials on online learning

Facebook IPO

Facebook began trading around 11:30 this morning, and I spent 8 minutes

Comments Off on Facebook IPO

Congratulations to Thomas Lumley!

Newly elected American Statistical Association (ASA) Fellow, and recognized for his outstanding professional contributions to and leadership in the field of statistical science.

Comments Off on Congratulations to Thomas Lumley!


Arizona's immigration law goes before the Supreme Court this week...

Comments Off on Immigration

Julian Simon birthday

February 12 was the 80th anniversary of the birth of Julian Simon, an early pioneer in resampling methods.

Comments Off on Julian Simon birthday

Statistics for Future Presidents

Statistics for Future Presidents - Steve Pierson, Director of Science Policy at ASA wrote interesting blog wondering how statistics for future presidents (or policymakers more generally) would compare with the recommended statistical skills/concepts for others. Take a look and let him know!

Comments Off on Statistics for Future Presidents

The Data Scientist

The story of the prospective Facebook IPO, and prior IPO's from LinkedIn, Pandora, and Groupon all involve "data scientists".  Read an interview with Monica Rogati - Senior Data Scientist at LinkedIn to see the connection.

Comments Off on The Data Scientist

Popular Mistakes in Data Mining

John Elder's presentations on common data mining mistakes are a must-see if you have any experience or plans in the data mining arena.

Comments Off on Popular Mistakes in Data Mining

Coffee causes cancer?

"Any claim coming from an observational study is most likely to be wrong." Thus begins "Deming, data and observational studies," just published in "Significance Magazine" (Sept. 2011).

Comments Off on Coffee causes cancer?

The sacrifice bunt

I was watching a Washington Nationals game on TV a couple of days ago, and the concept of "expected value" ...

Comments Off on The sacrifice bunt

Epidemiologist joke

A neurosurgeon, pathologist and epidemiologist are each told to examine a can of sardines on a table in a closed room, and present a report.

Comments Off on Epidemiologist joke

The Power of Round

Advertisers shy away from round numbers, believing that $99 appears significantly cheaper than $100...

Comments Off on The Power of Round

March Madness

Did the NCAA get the March Madness rankings right? Check out SportsMeasures.com

Comments Off on March Madness

Bees on the attack

What does Matt Asher's article "Attack of the Hair Trigger Bees" have to do with global warming? Matt Asher runs statisticsblog.com ...

Comments Off on Bees on the attack

Catastrophe Modeling Assistant

Thinking about careers that use statistics? The job title "catastrophe modeling assistant" caught my eye recently in a job announcement. ...

Comments Off on Catastrophe Modeling Assistant

Random Monkeys

One of my gifts this holiday season was "A Drunkard's Walk: How Randomness Rules Our Lives,"

Comments Off on Random Monkeys
Close Menu