#### Week #40 – Natural Language

A natural language is what most people outside the field of computer science think of as just a language (Spanish, English, etc.). The term...

Comments Off on Week #40 – Natural Language

#### Week # 39 – White Hat Bias

White Hat Bias is bias leading to distortion in, or selective presentation of, data that is considered by investigators or reviewers to be acceptable because it is in the service of righteous goals.

Comments Off on Week # 39 – White Hat Bias

#### Week # 38 – Edge

An edge is a link between two people or entities in a network that can be

Comments Off on Week # 38 – Edge

#### Week #37 – Stratified Sampling

Stratified sampling is a method of random sampling.

Comments Off on Week #37 – Stratified Sampling

#### Week #36 – Conditional Probability

When probabilities are quoted without specification of the sample space, it could result in ambiguity when the sample space is not self-evident.

Comments Off on Week #36 – Conditional Probability

#### Week #35 – Continuous vs. Discrete Distributions

A discrete distribution is one in which the data can only take on certain values, for example integers.  A continuous distribution is one in which data can take on any value within a specified range (which may be infinite).

Comments Off on Week #35 – Continuous vs. Discrete Distributions

#### Week # 34 – Central Limit Theorem

The central limit theorem states that the sampling distribution of the mean approaches Normality as the sample size increases, regardless of the probability distribution of the population from which the sample is drawn.

Comments Off on Week # 34 – Central Limit Theorem

#### Week #33 – Classification and Regression Trees (CART)

Classification and regression trees (CART) are a set of techniques for classification and prediction.

Comments Off on Week #33 – Classification and Regression Trees (CART)

#### Week #32 – CHAID

CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.

Comments Off on Week #32 – CHAID

#### Week # 31 – Census

In a census survey , all units from the population of interest are analyzed. A related concept is the sample survey, in which only a subset of the population is taken.

Comments Off on Week # 31 – Census

#### Week #30 – Discriminant analysis

Discriminant analysis is a method of distinguishing between classes of objects.  The objects are typically represented as rows in a matrix.

Comments Off on Week #30 – Discriminant analysis

#### Week # 29 – Training data

Also called the training sample, training set, calibration sample.  The context is predictive modeling (also called supervised data mining) -  where you have data with multiple predictor variables and a single known outcome or target variable.

Comments Off on Week # 29 – Training data

#### Week #28 – Bias

A general statistical term meaning a systematic (not random) deviation of an estimate from the true value.

Comments Off on Week #28 – Bias

#### Week #27 – Backward Elimination

One of several computer-based iterative procedures for selecting variables to use in a model.  The process begins...

Comments Off on Week #27 – Backward Elimination

#### Week #26 – Statistical Significance

Outcomes to an experiment or repeated events are statistically significant if they differ from what chance variation might produce.

Comments Off on Week #26 – Statistical Significance

#### Week #25 – Family-wise Type I Error

In multiple comparison procedures, family-wise type I error is the probability that, even if all samples come from the same population, you will wrongly conclude

Comments Off on Week #25 – Family-wise Type I Error

#### Week #24 – Cohort study

A cohort study is a longitudinal study that identifies a group of subjects sharing some attributes (a "cohort") then

Comments Off on Week #24 – Cohort study

#### Week #23 – Coefficient of variation

The coefficient of variation is the standard deviation of a data set, divided by the mean of the same data set.

Comments Off on Week #23 – Coefficient of variation

#### Week #22 – Coefficient of Determination

In regression analysis, the coefficient of determination is a measure of goodness-of-fit (i.e. how well or tightly the data fit the estimated model).  The coefficient is

Comments Off on Week #22 – Coefficient of Determination

#### Week #21 – Consistent Estimator

An estimator is a measure or metric intended to be calculated from a sample drawn from a larger population...

Comments Off on Week #21 – Consistent Estimator

#### Week #20 – Collinearity

In regression analysis , collinearity of two variables means that strong correlation exists between them, making it difficult or impossible to estimate their individual regression coefficients reliably.

Comments Off on Week #20 – Collinearity

#### Week #19 – Cohort study

A cohort study is a longitudinal study that identifies a population or large group (a "cohort") then draws a sample from the population at various points in time and records data for the sample.

Comments Off on Week #19 – Cohort study

#### Week #18 – Centroid

The centroid is a measure of center in multi-dimensional space.

Comments Off on Week #18 – Centroid

#### Week #17 – Bootstrapping

Bootstrapping is sampling with replacement from observed data to estimate the variability in a statistic of interest. See also permutation tests, a related form of resampling. A common application

Comments Off on Week #17 – Bootstrapping

#### Week #16 – Binomial Distribution

A Binomial distribution is used to describe an experiment, event, or process for which the probability of success is the same for each trial and each trial has only two possible outcomes.

Comments Off on Week #16 – Binomial Distribution

#### Week #15 – Uplift or Persuasion Modeling

A combination of treatment comparisons (e.g. send a sales solicitation, or send nothing) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments.

Comments Off on Week #15 – Uplift or Persuasion Modeling

#### Week #14 – Network Analytics

Network analytics is the science of describing and, especially, visualizing the connections among objects.

Comments Off on Week #14 – Network Analytics

#### Week #13 – Multiplicity issues

Multiplicity issues arise in a number of contexts, but they generally boil down to the same thing:  repeated looks at a data set in different ways, until something "statistically significant" emerges.

Comments Off on Week #13 – Multiplicity issues

#### Week #12 – Support vector machines

Support vector machines are used in data mining (predictive modeling, to be specific) for classification of records, by learning from training data.

Comments Off on Week #12 – Support vector machines

#### Week #11 – Attribute

In data analysis or data mining, an attribute is a characteristic or feature that is measured for each observation (record) and can vary from one observation to another.  It might

Comments Off on Week #11 – Attribute

#### Week #10 – Negative Binomial

The negative binomial distribution is the probability distribution of the number of Bernoulli (yes/no) trials required to obtain r successes.

Comments Off on Week #10 – Negative Binomial

#### Week #9 – Random Walk

A random walk is a process of random steps, motions, or transitions.  It might be in one dimension (movement along a line), in two dimensions (movements in a plane), or in three dimensions or more.

Comments Off on Week #9 – Random Walk

#### Week #8 – Cover Time

Cover time is the expected number of steps in a random walk required

Comments Off on Week #8 – Cover Time

#### Week #7 – Cross-Validation

is a general computer-intensive approach used in estimating the accuracy of statistical models.

Comments Off on Week #7 – Cross-Validation

#### Week #6 – Distance Matrix

(also called dissimilarity matrix) describes pairwise distinction between M objects.

Comments Off on Week #6 – Distance Matrix

#### Week #5 – Differencing of a Time Series

in discrete time is the transformation of the series to a new time series where the values are the differences between consecutive values of the original series.

Comments Off on Week #5 – Differencing of a Time Series

#### Week #4 – Dichotomous

(outcome or variable) means "having only two possible values", e.g.

Comments Off on Week #4 – Dichotomous

#### Week #2 – Density Function

A probability density function is a curve used

Comments Off on Week #2 – Density Function

#### Week #1 – Data Partitioning

In predictive modeling, data partitioning is the division of the data available for analysis into two or three non-overlapping

Comments Off on Week #1 – Data Partitioning

#### 2013 – The International Year of Statistics

Promoting better understanding of statistics throughout the world.

Comments Off on 2013 – The International Year of Statistics

#### Congratulations to Michelle Everson!

New Editor of Journal of Statistics Education

Comments Off on Congratulations to Michelle Everson!

#### Airline passenger screening can be random

Read Peter's Letter to the Editor in Saturday's Washington Post.

Comments Off on Airline passenger screening can be random

#### Churn Trigger

Last year's popular story out of the Predictive Analytics World conference series was Andrew Pole's presentation of Target's methodology for predicting which customers were pregnant.

#### Randomized Trials on online learning

Evidence show that there is no significant difference between taking an online introductory statistics course and a traditional in-person class.

Comments Off on Randomized Trials on online learning

Facebook began trading around 11:30 this morning, and I spent 8 minutes

#### Congratulations to Thomas Lumley!

Newly elected American Statistical Association (ASA) Fellow, and recognized for his outstanding professional contributions to and leadership in the field of statistical science.

Comments Off on Congratulations to Thomas Lumley!

#### Immigration

Arizona's immigration law goes before the Supreme Court this week...

#### Revisiting Catastrophe Modeling Assistant

I saw this job posting a while ago, and, in my next life,

Comments Off on Revisiting Catastrophe Modeling Assistant

#### Congratulations to David Unwin – Honors of the Association of American Geographers

David Unwin, Emeritus Chair in Geography, Bubeck College, University of London (and instructor at Statistics.com!) will be awarded the Association of American Geographers (AAG) Ronald F. Abler Distinguished Service Honors at the upcoming annual meeting next week.

Comments Off on Congratulations to David Unwin – Honors of the Association of American Geographers

#### Julian Simon birthday

February 12 was the 80th anniversary of the birth of Julian Simon, an early pioneer in resampling methods.

Comments Off on Julian Simon birthday

#### Statistics for Future Presidents

Statistics for Future Presidents - Steve Pierson, Director of Science Policy at ASA wrote interesting blog wondering how statistics for future presidents (or policymakers more generally) would compare with the recommended statistical skills/concepts for others. Take a look and let him know!

Comments Off on Statistics for Future Presidents

#### Congratulations to David Unwin on a New Edited Volume

Teaching Geographic Information Science and Technology in Higher Education, 2012 (Wiley)

Comments Off on Congratulations to David Unwin on a New Edited Volume

#### The Data Scientist

The story of the prospective Facebook IPO, and prior IPO's from LinkedIn, Pandora, and Groupon all involve "data scientists".  Read an interview with Monica Rogati - Senior Data Scientist at LinkedIn to see the connection.

Comments Off on The Data Scientist

#### Congratulations to Michelle Everson for winning the 2011 Waller Education Award.

Dr. Michelle Everson is recognized for her outstanding contributions to and innovation in the teaching of elementary statistics.

Comments Off on Congratulations to Michelle Everson for winning the 2011 Waller Education Award.

#### Popular Mistakes in Data Mining

John Elder's presentations on common data mining mistakes are a must-see if you have any experience or plans in the data mining arena.

Comments Off on Popular Mistakes in Data Mining

#### Coffee causes cancer?

"Any claim coming from an observational study is most likely to be wrong." Thus begins "Deming, data and observational studies," just published in "Significance Magazine" (Sept. 2011).

Comments Off on Coffee causes cancer?

#### The sacrifice bunt

I was watching a Washington Nationals game on TV a couple of days ago, and the concept of "expected value" ...

Comments Off on The sacrifice bunt

#### Epidemiologist joke

A neurosurgeon, pathologist and epidemiologist are each told to examine a can of sardines on a table in a closed room, and present a report.

#### What do teenagers want?

What do teenagers want? More importantly for the music industry, what music will they buy?

Comments Off on What do teenagers want?

#### The Power of Round

Advertisers shy away from round numbers, believing that \$99 appears significantly cheaper than \$100...

Comments Off on The Power of Round

Did the NCAA get the March Madness rankings right? Check out SportsMeasures.com

#### Bees on the attack

What does Matt Asher's article "Attack of the Hair Trigger Bees" have to do with global warming? Matt Asher runs statisticsblog.com ...

Comments Off on Bees on the attack

#### The First Gallup Poll

The first Gallup Poll was published in October, 1935. In America Speaks,

Comments Off on The First Gallup Poll

#### Catastrophe Modeling Assistant

Thinking about careers that use statistics? The job title "catastrophe modeling assistant" caught my eye recently in a job announcement. ...

Comments Off on Catastrophe Modeling Assistant

#### Random Monkeys

One of my gifts this holiday season was "A Drunkard's Walk: How Randomness Rules Our Lives,"