# Blog

## Work #32 – Predictive modeling

Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).

## Week #29 – Goodness-of-fit

Goodness-of-fit measures the difference between an observed frequency distribution and a theoretical probability distribution which

## Week #23 – Adjacency Matrix

An adjacency matrix describes the relationships in a network. Nodes are listed in the top..

## Convoys

Ever wonder why, in World War II, ships in convoys were safer than ships traveling on their own? Most people assume it was due to the protection afforded by military escort vessels, of which there was a limited supply (insufficient to protect ships traveling on their own). Actually, most of the benefit came from theContinue reading “Convoys”

## Dialects

When talking to several people, do you address them as “you guys”? “Y’all”? Just “you”? And is the carbonated soft drink “soda” or “pop?” Maps based on survey responses to questions like this were published in the Harvard Dialect Survey in 2003. Josh Katz took the data and produced extended visualizations and, last month, aContinue reading “Dialects”

## Needle in a Haystack

What’s the probability that the NSA examined the metadata for your phone number in 2013? According to John Inglis, Deputy Director at the NSA, it’s about 0.00001, or 1 in 100,000. A surprisingly small number, given what we’ve all been reading in the media about NSA’s massive data collection effort. Of course, that’s an unconditionalContinue reading “Needle in a Haystack”

## Week #51 – Type 1 error

In a test of significance (also called a hypothesis test), Type I error is the error of rejecting the null hypothesis when it is true — of saying an effect or event is statistically significant when it is not.

## Predictive Modeling and Typhoon Relief

The devastation wrought by Super-Typhoon Haiyan in the Philippines is the biggest test yet for the nascent technology of “artificial intelligence disaster response,” a phrase used by Patrick Meier, a pioneer in the field. When disaster strikes, a flood of social media posts and tweets ensues. There is useful information in the data flood, butContinue reading “Predictive Modeling and Typhoon Relief”

## Personality regions

There are Red States and Blue States. The three blue states of the Pacific coast constitute the Left Coast. For Colin Woodward, Yankeedom comprises both New England and the Great Lakes. If you’re into accessories, there’s the Bible Belt, the Rust Belt, and the Stroke Belt. In the Journal of Personality and Social Psychology (onlineContinue reading “Personality regions”

## Week #49 – Data partitioning

Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set (used to fit the model), the validation set (used to compared models), and the test set (used to predict performance on new data).

## Terrorist Clusters

The “righteous vengeance gun attack” is just one of 10 types of terrorism identified by Chenoweth and Lowham via statistical clustering techniques. Another cluster is “bombings of a public population where a liberation group takes responsibility.” You can read about the 10 clusters, and the 44 dichotomous variables (suicide or not, bombing or not, religiousContinue reading “Terrorist Clusters”

## Statistics.com Partners With CrowdANALYTIX to Offer New Online Course With Crowdsource Contest As Project

Crowdsourcing, using the power of the crowd to solve problems, has been used for many functions and tasks, including predictive modeling (like the 2009 Netflix Contest). Typically, problems are broadcast to an unknown group of statistical modelers on the Internet, and solutions are sought. Every crowdsourced project harnesses the power of the community to findContinue reading “Statistics.com Partners With CrowdANALYTIX to Offer New Online Course With Crowdsource Contest As Project”

## Week #43 – Longitudinal data

Longitudinal data records multiple observations over time for a set of individuals or units. A typical..

## Week #42 – Cross-sectional data

Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual.  A simple…

## Week #32 – CHAID

CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.

## Illuminate, Iterate, Involve, Involvement, Iteration, Insight

I did not start off in the field of statistics; my first real job was as a diplomat. And from my undergraduate days I recall a professor who taught a cultural history of Russia. He was one of the world’s top experts. Possessed of a tremendous store of knowledge (a leading author in the field,Continue reading “Illuminate, Iterate, Involve, Involvement, Iteration, Insight”

## Week # 29 – Training data

Also called the training sample, training set, calibration sample.  The context is predictive modeling (also called supervised data mining) –  where you have data with multiple predictor variables and a single known outcome or target variable.