#### Course Spotlight: Customer Analytics in R

"The customer is always right" was the motto Selfridge's department store coined in 1909. "We'll tell the customer what they want" was Madison Avenue's mantra starting in the 1950's. Now data scientists like Karolis Urbonas help companies like Amazon (where he works in Europe as…

#### Course Spotlight: Predictive Analytics

Predicting whether an internet user will click on a link or buy a product, whether an insurance claim is fraudulent, whether a home mortgage will be paid on time (or early), how much a house will sell for, what internet ad you should see next,…

#### Course Spotlight: Spatial Statistics Using R

Have you ever needed to analyze data with a spatial component? Geographic clusters of disease, crimes, animals, plants, events?Or describing the spatial variation of something, and perhaps correlating it with some other predictor? Assessing whether the geographic distribution of something departs from randomness? Location data…

#### “Money and Brains” and “Furs and Station Wagons”

"Money and Brains" and "Furs and Station Wagons" were evocative customer shorthands that the marketing company Claritas came up with over a half century ago. These names, which facilitated the work of marketers and sales people, were shorthand descriptions of segments of customers identified through…

#### Course Spotlight: Text Mining

The term text mining is sometimes used in two different meanings in computational statistics: Using predictive modeling to label many documents (e.g. legal docs might be "relevant" or "not relevant") - this is what we call text mining. Using grammar and syntax to parse the…

#### CONVOLUTION and TENSOR

Today's Words of the Week are convolution and tensor, key components of deep learning.

#### BENFORD’S LAW

Benford's law describes an expected distribution of the first digit in many naturally-occurring datasets.

#### CONTINGENCY TABLES

Contingency tables are tables of counts of events or things, cross-tabulated by row and column.

#### HYPERPARAMETER

Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.

#### SAMPLE

Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

#### SPLINE

The easiest way to think of a spline is to first think of linear regression - a single linear relationship between an outcome variable and various predictor variables.

#### NLP

To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.

#### OVERFIT

As applied to statistical models - "overfit" means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

“The goal is to turn data into information, and information into insight.” – Carly Fiorina, former CEO, Hewlett-Packard Co. Speech given at Oracle OpenWorld “Data is the new science. Big data holds the answers.” – Pat Gelsinger, CEO, EMC, Big Bets on Big Data, Forbes“Hiding within those…

#### Week #18 – n

In statistics, "n" denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

#### Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task.  Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms.  More typically, the corpus is a body of documents for…

#### Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics

April 27 marks the 80th anniversary of the death of Karl Pearson, who contributed to statistics the correlation coefficient, principal components, the (increasingly-maligned) p-value, and much more. Pearson was one of a trio of founding fathers of modern statistics, the others being Francis Galton and…

#### Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2…

#### Week #10 – Arm

In an experiment, an arm is a treatment protocol - for example, drug A, or placebo.   In medical trials, an arm corresponds to a patient group receiving a specified therapy.  The term is also relevant for bandit algorithms for web testing, where an arm consists…

#### Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued.  An example might be a binary matrix used to power web searches - columns representing search terms and rows representing searches,…

#### Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records.  It is often, but not always, randomly drawn.  In matrix form, the rows are records…

#### Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation.  When there are…

#### Week #43 – HDFS

HDFS is the Hadoop Distributed File System.  It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

#### Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

#### Week #32 – False Discovery Rate

A "discovery" is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).

#### Week #23 – Netflix Contest

The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available.  Individuals and teams then compete to develop the best performing model.

#### Week #20 – R

This week's word is actually a letter.  R is a statistical computing and programming language and program, a derivative of the commercial S-PLUS program, which, in turn, was an offshoot of S from Bell Labs.

#### Week #16 – Moving Average

In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t-1.

#### Week #15 – Interaction term

In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.

#### Week #14 – Naive forecast

A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).

#### week #9 – Overdispersion

In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.

#### Week #8 – Confusion matrix

In a classification model, the confusion matrix shows the counts of correct and erroneous classifications.  In a binary classification problem, the matrix consists of 4 cells.

#### Week #5 – Features vs. Variables

The predictors in a predictive model are sometimes given different terms by different disciplines.  Traditional statisticians think in terms of variables.

#### Course Spotlight: The Text Analytics Sequence

Text analytics or text mining is the natural extension of predictive analytics, and Statistics.com's text analytics program starts Feb. 6. Text analytics is now ubiquitous and yields insight in: Marketing: Voice of the customer, social media analysis, churn analysis, market research, survey analysis Business: Competitive…

#### Course Spotlight: Constrained Optimization

Say you operate a tank farm (to store and sell fuel). How much of each fuel grade should you buy? You have specified flow and storage capacities, constraints on what types of fuels can be stored in which tanks, prior contractual obligations about minimum monthly…

#### College Credit Recommendation

Statistics.com Receives College Recommendation from the American Council on Education (ACE) College Credit Recommendation for Online Data Science Courses from The Institute for Statistics Education at Statistics.com LLC The American Council on Education's College Credit Recommendation Service (ACE CREDIT) has evaluated and recommended college credit…

#### Week #48 – Structured vs. unstructured data

Structured data is data that is in a form that can be used to develop statistical or machine learning models (typically a matrix where rows are records and columns are variables or features).

#### Big Data and Clinical Trials in Medicine

There was an interesting article a couple of weeks ago in the New York Times magazine section on the role that Big Data can play in treating patients -- discovering things that clinical trials are too slow, too expensive, and too blunt to find. The…

#### Word #39 – Censoring

Censoring in time-series data occurs when some event causes subjects to cease producing data for reasons beyond the control of the investigator, or for reasons external to the issue being studied.

#### Industry Spotlight: The brand premium for Chanel and Harvard

The classic illustration of the power of brand is perfume - expensive perfumes may cost just a few dollars to produce but can be sold for more than \$500 due to the cachet afforded by the brand. David Malan's Computer Science course at Harvard, CSCI…

#### Work #32 – Predictive modeling

Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).

#### Week #29 – Goodness-of-fit

Goodness-of-fit measures the difference between an observed frequency distribution and a theoretical probability distribution which

#### Week #23 – Adjacency Matrix

An adjacency matrix describes the relationships in a network. Nodes are listed in the top..

#### Convoys

Ever wonder why, in World War II, ships in convoys were safer than ships traveling on their own? Most people assume it was due to the protection afforded by military escort vessels, of which there was a limited supply (insufficient to protect ships traveling on…

#### Needle in a Haystack

What's the probability that the NSA examined the metadata for your phone number in 2013? According to John Inglis, Deputy Director at the NSA, it's about 0.00001, or 1 in 100,000. A surprisingly small number, given what we've all been reading in the media about…

#### Week #51 – Type 1 error

In a test of significance (also called a hypothesis test), Type I error is the error of rejecting the null hypothesis when it is true -- of saying an effect or event is statistically significant when it is not.

#### Predictive Modeling and Typhoon Relief

The devastation wrought by Super-Typhoon Haiyan in the Philippines is the biggest test yet for the nascent technology of "artificial intelligence disaster response," a phrase used by Patrick Meier, a pioneer in the field. When disaster strikes, a flood of social media posts and tweets…

#### Personality regions

There are Red States and Blue States. The three blue states of the Pacific coast constitute the Left Coast. For Colin Woodward, Yankeedom comprises both New England and the Great Lakes. If you're into accessories, there's the Bible Belt, the Rust Belt, and the Stroke…

#### Week #49 – Data partitioning

Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set (used to fit the model), the validation set (used to compared models), and the test set (used to predict performance on new data).

#### Statistics.com Partners With CrowdANALYTIX to Offer New Online Course With Crowdsource Contest As Project

Crowdsourcing, using the power of the crowd to solve problems, has been used for many functions and tasks, including predictive modeling (like the 2009 Netflix Contest). Typically, problems are broadcast to an unknown group of statistical modelers on the Internet, and solutions are sought. Every…

#### Week #43 – Longitudinal data

Longitudinal data records multiple observations over time for a set of individuals or units. A typical..

#### Week #42 – Cross-sectional data

Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual.  A simple...

#### Week #32 – CHAID

CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.

#### Illuminate, Iterate, Involve, Involvement, Iteration, Insight

I did not start off in the field of statistics; my first real job was as a diplomat. And from my undergraduate days I recall a professor who taught a cultural history of Russia. He was one of the world's top experts. Possessed of a…

#### Week # 29 – Training data

Also called the training sample, training set, calibration sample.  The context is predictive modeling (also called supervised data mining) -  where you have data with multiple predictor variables and a single known outcome or target variable.

#### Mutual Attraction

Mutual attraction is a dominant force in the universe. Gravity binds the moon to the earth, the earth to the sun, the sun to the galaxy, and one galaxy to another. And yet the universe is expanding; the result is a larger universe comprised of…

#### Week #18 – Centroid

The centroid is a measure of center in multi-dimensional space.

#### 2013 – The International Year of Statistics

Promoting better understanding of statistics throughout the world.

#### Congratulations to Michelle Everson!

New Editor of Journal of Statistics Education

#### Airline Passenger Screening Can Be Random

Read Peter's Letter to the Editor in Saturday's Washington Post.

#### Churn Trigger

Last year's popular story out of the Predictive Analytics World conference series was Andrew Pole's presentation of Target's methodology for predicting which customers were pregnant.

#### Randomized Trials on online learning

Evidence show that there is no significant difference between taking an online introductory statistics course and a traditional in-person class.

Facebook began trading around 11:30 this morning, and I spent 8 minutes

#### Congratulations to Thomas Lumley!

Newly elected American Statistical Association (ASA) Fellow, and recognized for his outstanding professional contributions to and leadership in the field of statistical science.

#### Immigration

Arizona's immigration law goes before the Supreme Court this week...

#### Revisiting Catastrophe Modeling Assistant

I saw this job posting a while ago, and, in my next life,

#### Julian Simon birthday

February 12 was the 80th anniversary of the birth of Julian Simon, an early pioneer in resampling methods.

#### Statistics for Future Presidents

Statistics for Future Presidents - Steve Pierson, Director of Science Policy at ASA wrote interesting blog wondering how statistics for future presidents (or policymakers more generally) would compare with the recommended statistical skills/concepts for others. Take a look and let him know!

#### Congratulations to David Unwin on a New Edited Volume

Teaching Geographic Information Science and Technology in Higher Education, 2012 (Wiley)

#### The Data Scientist

The story of the prospective Facebook IPO, and prior IPO's from LinkedIn, Pandora, and Groupon all involve "data scientists".  Read an interview with Monica Rogati - Senior Data Scientist at LinkedIn to see the connection.

#### Congratulations to Michelle Everson for winning the 2011 Waller Education Award.

Dr. Michelle Everson is recognized for her outstanding contributions to and innovation in the teaching of elementary statistics.

#### Popular Mistakes in Data Mining

John Elder's presentations on common data mining mistakes are a must-see if you have any experience or plans in the data mining arena.

#### Coffee causes cancer?

"Any claim coming from an observational study is most likely to be wrong." Thus begins "Deming, data and observational studies," just published in "Significance Magazine" (Sept. 2011).

#### The sacrifice bunt

I was watching a Washington Nationals game on TV a couple of days ago, and the concept of "expected value" ...

#### Epidemiologist joke

A neurosurgeon, pathologist and epidemiologist are each told to examine a can of sardines on a table in a closed room, and present a report.

#### What do teenagers want?

What do teenagers want? More importantly for the music industry, what music will they buy?

#### The Power of Round

Advertisers shy away from round numbers, believing that \$99 appears significantly cheaper than \$100...

Did the NCAA get the March Madness rankings right? Check out SportsMeasures.com

#### Bees on the attack

What does Matt Asher's article "Attack of the Hair Trigger Bees" have to do with global warming? Matt Asher runs statisticsblog.com ...

#### The First Gallup Poll

The first Gallup Poll was published in October, 1935. In America Speaks,

#### Catastrophe Modeling Assistant

Thinking about careers that use statistics? The job title "catastrophe modeling assistant" caught my eye recently in a job announcement. ...

#### Random Monkeys

One of my gifts this holiday season was "A Drunkard's Walk: How Randomness Rules Our Lives,"