#### Week #47 – Feature engineering

In predictive modeling, a key step is to turn available data (which may come from varied sources and be messy) into an orderly matrix of rows (records to be predicted) and columns (predictor variables or features).

Skip to content
# Our Latest Blogs

####
Week #47 – Feature engineering

####
Week #46 – Naive bayes classifier

####
Week #45 – MapReduce

####
Big Data and Clinical Trials in Medicine

####
Week #44 – Likert scales

####
Week #43 – Node

####
Week #42 – Latent Variable Models

####
Week #41 – K-nearest neighbor

####
Word #40 – Kappa Statistic

####
Word #39 – Censoring

####
Word #38 – Survival Analysis

####
Word #37 – Joint Probability Distribution

####
Word #36 – The Jackknife

####
Word #35 – Interim Monitoring

####
Industry Spotlight: The brand premium for Chanel and Harvard

####
Industry Spotlight: SAS is back

####
Twitter Sentiment vs. Survey Methods

####
Internet of Things

####
Word #34 – NoSQL

####
Word #33 – Similarity matrix

####
Work #32 – Predictive modeling

####
Word #31 – Hold-out sample

####
Week #30 – Heteroscedasticity

####
Week #29 – Goodness-of-fit

####
Week #28 – Geometric Mean

####
Week #27 – Hierarchical Linear Models

####
Week #26 – Hazard Function

####
Week #25 – Fleming multiple testing procedure

####
Week #24 – Directed vs. Undirected Network

####
Week #23 – Adjacency Matrix

####
Week #22 – Exponential Distribution

####
Week #21 – Error

####
Week #20 – Step-wise Regression

####
Week #19 – Regularization

####
Week #18 – SQL

####
Week #17 – . Markov Chain Monte Carlo (MCMC)

####
Week #16 – MapReduce

####
Week #15 – Hadoop

####
Week #14 – Curse of Dimensionality

####
Week #13 – Data Product

####
Convoys

####
Week #12 – Dependent and Independent Variables

####
Week #11 – Distance

####
Week #10 – Decile Lift

####
Week #9 – Decision Trees

####
Week #8 – Feature Selection

####
Week #7 – Bagging

####
Week #6 – Boosting

####
Week #5 – Ensemble Methods

####
Dialects

####
Needle in a Haystack

####
Week #4 – Expected value

####
Week #3 – Exact Tests

####
Week #2 – Error

####
Week #1 – Endogenous variable

####
Week #53 – Effect size

####
Week #52 – Alpha spending function

In the interim monitoring of clinical trials, multiple looks are taken at the accruing patient results - say, response to a medication.

####
Week #51 – Type 1 error

####
Predictive Modeling and Typhoon Relief

####
Personality regions

####
Week #50 – Stationary time series

####
Week #49 – Data partitioning

####
Week #48 – Data Mining

####
Week #47 – Z-score

####
Week #46 – Cluster Analysis

####
Week #45 – Construct validity

####
Week # 44 – Collaborative filtering

####
Terrorist Clusters

####
Statistics.com Partners With CrowdANALYTIX to Offer New Online Course With Crowdsource Contest As Project

####
Week #43 – Longitudinal data

####
Week #42 – Cross-sectional data

####
Week #41 – Tokenization

####
Week #40 – Natural Language

####
Week # 39 – White Hat Bias

####
Week # 38 – Edge

####
Week #37 – Stratified Sampling

####
Week #36 – Conditional Probability

####
Week #35 – Continuous vs. Discrete Distributions

####
Week # 34 – Central Limit Theorem

####
Week #33 – Classification and Regression Trees (CART)

####
Week #32 – CHAID

####
Week # 31 – Census

####
Illuminate, Iterate, Involve, Involvement, Iteration, Insight

####
Week #30 – Discriminant analysis

####
Week # 29 – Training data

####
Mutual Attraction

####
Week #28 – Bias

####
Week #27 – Backward Elimination

####
Week #26 – Statistical Significance

####
Week #25 – Family-wise Type I Error

####
Week #24 – Cohort study

####
Week #23 – Coefficient of variation

####
Week #22 – Coefficient of Determination

####
Week #21 – Consistent Estimator

####
Week #20 – Collinearity

####
Week #19 – Cohort study

####
Week #18 – Centroid

####
Week #17 – Bootstrapping

####
Week #16 – Binomial Distribution

####
Week #15 – Uplift or Persuasion Modeling

In predictive modeling, a key step is to turn available data (which may come from varied sources and be messy) into an orderly matrix of rows (records to be predicted) and columns (predictor variables or features).

Comments Off on Week #47 – Feature engineering

November 7, 2014

A full Bayesian classifier is a supervised learning technique that assigns a class to a record by finding other records with attributes just like it has, and finding the most prevalent class among them.

Comments Off on Week #46 – Naive bayes classifier

November 7, 2014

In computer science, MapReduce is a procedure that prepares data for parallel processing on multiple computers.

Comments Off on Week #45 – MapReduce

November 7, 2014

There was an interesting article a couple of weeks ago in the New York Times magazine section on the role that Big Data can play in treating patients -- discovering things that clinical trials are too slow, too expensive, and too blunt to find. The…

Comments Off on Big Data and Clinical Trials in Medicine

October 20, 2014

Likert scales are categorical ordinal scales used in social sciences to measure attitude. A typical example is a set of response options ranging from "strongly agree" to "strongly disagree."

Comments Off on Week #44 – Likert scales

October 10, 2014

A node is an entity in a network. In a social network, it would be a person. In a digital network, it would be a computer or device.

Comments Off on Week #43 – Node

October 10, 2014

Latent variable models postulate some relationship between the statistical properties of observable variables.

Comments Off on Week #42 – Latent Variable Models

October 1, 2014

K-nearest-neighbor (K-NN) is a machine learning predictive algorithm that relies on calculation of distances between pairs of records.

Comments Off on Week #41 – K-nearest neighbor

October 1, 2014

The kappa statistic measures the extent to which different raters or examiners differ when looking at the same data and assigning categories.

Comments Off on Word #40 – Kappa Statistic

September 19, 2014

Censoring in time-series data occurs when some event causes subjects to cease producing data for reasons beyond the control of the investigator, or for reasons external to the issue being studied.

Comments Off on Word #39 – Censoring

September 19, 2014

Survival analysis is a set of methods used to model and analyze survival data, also called time-to-event data.

Comments Off on Word #38 – Survival Analysis

September 18, 2014

The probability distribution for X is the possible values of X and their associated probabilities. With two separate discrete random variables, X and Y, the joint probability distribution is the function f(x,y)

Comments Off on Word #37 – Joint Probability Distribution

September 18, 2014

With a sample of size N, the jackknife involves calculating N values of the estimator, with each value calculated on the basis of the entire sample less one observation.

Comments Off on Word #36 – The Jackknife

September 18, 2014

In the interim monitoring of clinical trials, multiple looks are taken at the accruing patient results - say, response to a medication.

Comments Off on Word #35 – Interim Monitoring

September 18, 2014

The classic illustration of the power of brand is perfume - expensive perfumes may cost just a few dollars to produce but can be sold for more than $500 due to the cachet afforded by the brand. David Malan's Computer Science course at Harvard, CSCI…

Comments Off on Industry Spotlight: The brand premium for Chanel and Harvard

September 11, 2014

The big news from the SAS world this summer was the release, on May 28, of the SAS University Edition, which brings the effective price for a single user edition of SAS down from around $10,000 to $0. It does most of the things that…

Comments Off on Industry Spotlight: SAS is back

August 21, 2014

Nobody expects Twitter feed sentiment analysis to give you unbiased results the way a well-designed survey will. A Pew Research study found that Twitter political opinion was, at times, much more liberal than that revealed by public opinion polls, while it was more conservative at…

Comments Off on Twitter Sentiment vs. Survey Methods

August 4, 2014

Boston, August 3 2014: Bill Ruh, GE Software Center, says that the Internet of Things, 30 billion machines talking to one another, will dwarf the impact of the consumer internet. Speaking at the Joint Statistical Meetings today, Ruh predicted that the marriage of the IoT…

Comments Off on Internet of Things

August 3, 2014

A NoSQL database is distinguished mainly by what it is not -

Comments Off on Word #34 – NoSQL

July 28, 2014

A similarity matrix shows how similar records are to each other.

Comments Off on Word #33 – Similarity matrix

July 28, 2014

Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).

Comments Off on Work #32 – Predictive modeling

July 28, 2014

A hold-out sample is a random sample from a data set that is withheld and not used in the model fitting process. After the model...

Comments Off on Word #31 – Hold-out sample

July 28, 2014

Heteroscedasticity generally means unequal variation of data, e.g. unequal variance. More specifically,

Comments Off on Week #30 – Heteroscedasticity

July 28, 2014

Goodness-of-fit measures the difference between an observed frequency distribution and a theoretical probability distribution which

Comments Off on Week #29 – Goodness-of-fit

July 15, 2014

The geometric mean of n values is determined by multiplying all n values together, then taking the nth root of the product. It is useful in taking averages of ratios.

Comments Off on Week #28 – Geometric Mean

July 15, 2014

Hierarchical linear modeling is an approach to analysis of hierarchical (nested) data - i.e. data represented by categories, sub-categories, ..., individual units (e.g. school -> classroom -> student).

Comments Off on Week #27 – Hierarchical Linear Models

June 6, 2014

In medical statistics, the hazard function is a relationship between a proportion and time.

Comments Off on Week #26 – Hazard Function

June 6, 2014

The Fleming procedure (or *O´Brien-Fleming multiple testing procedure *) is a simple multiple testing procedure for comparing two treatments when the response to treatment is dichotomous . This procedure...

Comments Off on Week #25 – Fleming multiple testing procedure

May 30, 2014

In a directed network, connections between nodes are directional. For example..

Comments Off on Week #24 – Directed vs. Undirected Network

May 30, 2014

An adjacency matrix describes the relationships in a network. Nodes are listed in the top..

Comments Off on Week #23 – Adjacency Matrix

May 30, 2014

The exponential distribution is a model for the length of intervals between two consecutive random events in time or

Comments Off on Week #22 – Exponential Distribution

May 30, 2014

Error is the deviation of an estimated quantity from its true value, or, more precisely,

Comments Off on Week #21 – Error

May 29, 2014

Step-wise regression is one of several computer-based iterative variable-selection procedures.

Comments Off on Week #20 – Step-wise Regression

May 16, 2014

Regularization refers to a wide variety of techniques used to bring structure to statistical models in the face of data size, complexity and sparseness.

Comments Off on Week #19 – Regularization

May 9, 2014

SQL stands for structured query language, a high level language for querying relational databases, extracting information.

Comments Off on Week #18 – SQL

March 28, 2014

A Markov chain is a probability system that governs transition among states or through successive events.

Comments Off on Week #17 – . Markov Chain Monte Carlo (MCMC)

March 14, 2014

MapReduce is a programming framework to distribute the computing load of very large data and problems to multiple computers.

Comments Off on Week #16 – MapReduce

March 14, 2014

As data processing requirements grew beyond the capacities of even large computers, distributed computing systems were developed to spread the load to multiple computers.

Comments Off on Week #15 – Hadoop

March 14, 2014

The curse of dimensionality is the affliction caused by adding variables to multivariate data models.

Comments Off on Week #14 – Curse of Dimensionality

March 14, 2014

A data product is a product or service whose value is derived from using algorithmic methods on data, and which in turn produces data to be used in the same product, or tangential data products.

Comments Off on Week #13 – Data Product

February 21, 2014

Ever wonder why, in World War II, ships in convoys were safer than ships traveling on their own? Most people assume it was due to the protection afforded by military escort vessels, of which there was a limited supply (insufficient to protect ships traveling on…

Comments Off on Convoys

February 19, 2014

Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called independent variables.

Comments Off on Week #12 – Dependent and Independent Variables

February 12, 2014

Statistical distance is a measure calculated between two records that are typically part of a larger dataset, where rows are records and columns are variables. To calculate...

Comments Off on Week #11 – Distance

February 5, 2014

In predictive modeling, the goal is to make predictions about outcomes on a case-by-case basis: an insurance claim will be fraudulent or not, a tax return will be correct or in error, a subscriber...

Comments Off on Week #10 – Decile Lift

February 5, 2014

In the machine learning community, a decision tree is a branching set of rules used to classify a record, or predict a continuous value for a record. For example

Comments Off on Week #9 – Decision Trees

February 5, 2014

In predictive modeling, feature selection, also called variable selection, is the process (usually automated) of sorting through variables to retain variables that are likely...

Comments Off on Week #8 – Feature Selection

February 5, 2014

In predictive modeling, bagging is an ensemble method that uses bootstrap replicates of the original training data to fit predictive models.

Comments Off on Week #7 – Bagging

February 5, 2014

In predictive modeling, boosting is an iterative ensemble method that starts out by applying a classification algorithm and generating classifications.

Comments Off on Week #6 – Boosting

February 5, 2014

In predictive modeling, ensemble methods refer to the practice of taking multiple models and averaging their predictions.

Comments Off on Week #5 – Ensemble Methods

February 5, 2014

When talking to several people, do you address them as "you guys"? "Y'all"? Just "you"? And is the carbonated soft drink "soda" or "pop?" Maps based on survey responses to questions like this were published in the Harvard Dialect Survey in 2003. Josh Katz took…

Comments Off on Dialects

January 21, 2014

What's the probability that the NSA examined the metadata for your phone number in 2013? According to John Inglis, Deputy Director at the NSA, it's about 0.00001, or 1 in 100,000. A surprisingly small number, given what we've all been reading in the media about…

Comments Off on Needle in a Haystack

January 10, 2014

The expected value of a random variable, in a simple sense, is nothing but the arithmetic mean.

Comments Off on Week #4 – Expected value

December 9, 2013

Exact tests are hypothesis tests that are guaranteed to produce Type-I error at or below the nominal alpha level of the test when conducted on samples drawn from a null model.

Comments Off on Week #3 – Exact Tests

December 9, 2013

In statistical models, error or residual is the deviation of the estimated quantity from its true value: the greater the deviation, the greater the error.

Comments Off on Week #2 – Error

December 9, 2013

Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model.

Comments Off on Week #1 – Endogenous variable

December 9, 2013

In a study or experiment with two groups (usually control and treatment), the investigator typically has in mind the magnitude of the difference between the two groups that he or she wants to be able to detect in a hypothesis test.

Comments Off on Week #53 – Effect size

December 9, 2013

Comments Off on Week #52 – Alpha spending function

November 22, 2013

In a test of significance (also called a hypothesis test), Type I error is the error of rejecting the null hypothesis when it is true -- of saying an effect or event is statistically significant when it is not.

Comments Off on Week #51 – Type 1 error

November 22, 2013

The devastation wrought by Super-Typhoon Haiyan in the Philippines is the biggest test yet for the nascent technology of "artificial intelligence disaster response," a phrase used by Patrick Meier, a pioneer in the field. When disaster strikes, a flood of social media posts and tweets…

Comments Off on Predictive Modeling and Typhoon Relief

November 11, 2013

There are Red States and Blue States. The three blue states of the Pacific coast constitute the Left Coast. For Colin Woodward, Yankeedom comprises both New England and the Great Lakes. If you're into accessories, there's the Bible Belt, the Rust Belt, and the Stroke…

Comments Off on Personality regions

October 30, 2013

A time series x(t); t=1,... is considered to be stationary if its statistical properties do not depend on time t .

Comments Off on Week #50 – Stationary time series

October 25, 2013

Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set (used to fit the model), the validation set (used to compared models), and the test set (used to predict performance on new data).

Comments Off on Week #49 – Data partitioning

October 25, 2013

Data mining is concerned with finding latent patterns in large databases.

Comments Off on Week #48 – Data Mining

October 25, 2013

An observation´s z-score tells you the number of standard deviations it lies away from the population mean (and in which direction).

Comments Off on Week #47 – Z-score

October 25, 2013

In multivariate analysis, cluster analysis refers to methods used to divide up objects into similar groups, or, more precisely, groups whose members are all close to one another on various dimensions being measured.

Comments Off on Week #46 – Cluster Analysis

October 21, 2013

In psychology, a construct is a phenomenon or a variable in a model that is not directly observable or measurable - intelligence is a classic example.

Comments Off on Week #45 – Construct validity

October 7, 2013

Collaborative filtering algorithms are used to predict whether a given individual might like, or purchase, an item.

Comments Off on Week # 44 – Collaborative filtering

October 7, 2013

The "righteous vengeance gun attack" is just one of 10 types of terrorism identified by Chenoweth and Lowham via statistical clustering techniques. Another cluster is "bombings of a public population where a liberation group takes responsibility." You can read about the 10 clusters, and the…

Comments Off on Terrorist Clusters

September 24, 2013

Analytics / Course Spotlight / Prediction/Forecasting / Predictive Analytics / R Programming / Statistics

Crowdsourcing, using the power of the crowd to solve problems, has been used for many functions and tasks, including predictive modeling (like the 2009 Netflix Contest). Typically, problems are broadcast to an unknown group of statistical modelers on the Internet, and solutions are sought. Every…

Comments Off on Statistics.com Partners With CrowdANALYTIX to Offer New Online Course With Crowdsource Contest As Project

September 10, 2013

Longitudinal data records multiple observations over time for a set of individuals or units. A typical..

Comments Off on Week #43 – Longitudinal data

August 20, 2013

Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual. A simple...

Comments Off on Week #42 – Cross-sectional data

August 20, 2013

Tokenization is an initial step in natural language processing. It involves breaking down a text into a series of basic units, typically words. For example...

Comments Off on Week #41 – Tokenization

August 20, 2013

A natural language is what most people outside the field of computer science think of as just a language (Spanish, English, etc.). The term...

Comments Off on Week #40 – Natural Language

August 20, 2013

White Hat Bias is bias leading to distortion in, or selective presentation of, data that is considered by investigators or reviewers to be acceptable because it is in the service of righteous goals.

Comments Off on Week # 39 – White Hat Bias

August 20, 2013

An edge is a link between two people or entities in a network that can be

Comments Off on Week # 38 – Edge

July 24, 2013

Stratified sampling is a method of random sampling.

Comments Off on Week #37 – Stratified Sampling

July 24, 2013

When probabilities are quoted without specification of the sample space, it could result in ambiguity when the sample space is not self-evident.

Comments Off on Week #36 – Conditional Probability

July 24, 2013

A discrete distribution is one in which the data can only take on certain values, for example integers. A continuous distribution is one in which data can take on any value within a specified range (which may be infinite).

Comments Off on Week #35 – Continuous vs. Discrete Distributions

July 24, 2013

The central limit theorem states that the sampling distribution of the mean approaches Normality as the sample size increases, regardless of the probability distribution of the population from which the sample is drawn.

Comments Off on Week # 34 – Central Limit Theorem

July 24, 2013

Classification and regression trees (CART) are a set of techniques for classification and prediction.

Comments Off on Week #33 – Classification and Regression Trees (CART)

July 24, 2013

CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.

Comments Off on Week #32 – CHAID

July 23, 2013

In a census survey , all units from the population of interest are analyzed. A related concept is the sample survey, in which only a subset of the population is taken.

Comments Off on Week # 31 – Census

July 23, 2013

I did not start off in the field of statistics; my first real job was as a diplomat. And from my undergraduate days I recall a professor who taught a cultural history of Russia. He was one of the world's top experts. Possessed of a…

Comments Off on Illuminate, Iterate, Involve, Involvement, Iteration, Insight

July 12, 2013

Discriminant analysis is a method of distinguishing between classes of objects. The objects are typically represented as rows in a matrix.

Comments Off on Week #30 – Discriminant analysis

July 10, 2013

Also called the training sample, training set, calibration sample. The context is predictive modeling (also called supervised data mining) - where you have data with multiple predictor variables and a single known outcome or target variable.

Comments Off on Week # 29 – Training data

July 10, 2013

Mutual attraction is a dominant force in the universe. Gravity binds the moon to the earth, the earth to the sun, the sun to the galaxy, and one galaxy to another. And yet the universe is expanding; the result is a larger universe comprised of…

Comments Off on Mutual Attraction

June 19, 2013

A general statistical term meaning a systematic (not random) deviation of an estimate from the true value.

Comments Off on Week #28 – Bias

June 14, 2013

One of several computer-based iterative procedures for selecting variables to use in a model. The process begins...

Comments Off on Week #27 – Backward Elimination

June 14, 2013

Outcomes to an experiment or repeated events are statistically significant if they differ from what chance variation might produce.

Comments Off on Week #26 – Statistical Significance

June 14, 2013

In multiple comparison procedures, family-wise type I error is the probability that, even if all samples come from the same population, you will wrongly conclude

Comments Off on Week #25 – Family-wise Type I Error

May 6, 2013

A cohort study is a longitudinal study that identifies a group of subjects sharing some attributes (a "cohort") then

Comments Off on Week #24 – Cohort study

April 29, 2013

The coefficient of variation is the standard deviation of a data set, divided by the mean of the same data set.

Comments Off on Week #23 – Coefficient of variation

April 29, 2013

In regression analysis, the coefficient of determination is a measure of goodness-of-fit (i.e. how well or tightly the data fit the estimated model). The coefficient is

Comments Off on Week #22 – Coefficient of Determination

April 29, 2013

An estimator is a measure or metric intended to be calculated from a sample drawn from a larger population...

Comments Off on Week #21 – Consistent Estimator

April 5, 2013

In regression analysis , collinearity of two variables means that strong correlation exists between them, making it difficult or impossible to estimate their individual regression coefficients reliably.

Comments Off on Week #20 – Collinearity

April 5, 2013

A cohort study is a longitudinal study that identifies a population or large group (a "cohort") then draws a sample from the population at various points in time and records data for the sample.

Comments Off on Week #19 – Cohort study

April 5, 2013

The centroid is a measure of center in multi-dimensional space.

Comments Off on Week #18 – Centroid

April 5, 2013

Bootstrapping is sampling with replacement from observed data to estimate the variability in a statistic of interest. See also permutation tests, a related form of resampling. A common application

Comments Off on Week #17 – Bootstrapping

April 5, 2013

A Binomial distribution is used to describe an experiment, event, or process for which the probability of success is the same for each trial and each trial has only two possible outcomes.

Comments Off on Week #16 – Binomial Distribution

April 5, 2013

A combination of treatment comparisons (e.g. send a sales solicitation, or send nothing) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments.

Comments Off on Week #15 – Uplift or Persuasion Modeling

April 5, 2013