The classic illustration of the power of brand is perfume - expensive perfumes may cost just a few dollars to produce but can be sold for more than $500 due to the cachet afforded by the brand. David Malan's Computer Science course at Harvard, CSCI…
The big news from the SAS world this summer was the release, on May 28, of the SAS University Edition, which brings the effective price for a single user edition of SAS down from around $10,000 to $0. It does most of the things that…
Nobody expects Twitter feed sentiment analysis to give you unbiased results the way a well-designed survey will. A Pew Research study found that Twitter political opinion was, at times, much more liberal than that revealed by public opinion polls, while it was more conservative at…
Boston, August 3 2014: Bill Ruh, GE Software Center, says that the Internet of Things, 30 billion machines talking to one another, will dwarf the impact of the consumer internet. Speaking at the Joint Statistical Meetings today, Ruh predicted that the marriage of the IoT…
A NoSQL database is distinguished mainly by what it is not -
A similarity matrix shows how similar records are to each other.
Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).
A hold-out sample is a random sample from a data set that is withheld and not used in the model fitting process. After the model...
Heteroscedasticity generally means unequal variation of data, e.g. unequal variance. More specifically,
Goodness-of-fit measures the difference between an observed frequency distribution and a theoretical probability distribution which
The geometric mean of n values is determined by multiplying all n values together, then taking the nth root of the product. It is useful in taking averages of ratios.
Hierarchical linear modeling is an approach to analysis of hierarchical (nested) data - i.e. data represented by categories, sub-categories, ..., individual units (e.g. school -> classroom -> student).
In medical statistics, the hazard function is a relationship between a proportion and time.
In a directed network, connections between nodes are directional. For example..
An adjacency matrix describes the relationships in a network. Nodes are listed in the top..
The exponential distribution is a model for the length of intervals between two consecutive random events in time or
Error is the deviation of an estimated quantity from its true value, or, more precisely,
Step-wise regression is one of several computer-based iterative variable-selection procedures.
Regularization refers to a wide variety of techniques used to bring structure to statistical models in the face of data size, complexity and sparseness.
SQL stands for structured query language, a high level language for querying relational databases, extracting information.
A Markov chain is a probability system that governs transition among states or through successive events.
MapReduce is a programming framework to distribute the computing load of very large data and problems to multiple computers.
As data processing requirements grew beyond the capacities of even large computers, distributed computing systems were developed to spread the load to multiple computers.
The curse of dimensionality is the affliction caused by adding variables to multivariate data models.
A data product is a product or service whose value is derived from using algorithmic methods on data, and which in turn produces data to be used in the same product, or tangential data products.
Ever wonder why, in World War II, ships in convoys were safer than ships traveling on their own? Most people assume it was due to the protection afforded by military escort vessels, of which there was a limited supply (insufficient to protect ships traveling on…
Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called independent variables.
Statistical distance is a measure calculated between two records that are typically part of a larger dataset, where rows are records and columns are variables. To calculate...
In predictive modeling, the goal is to make predictions about outcomes on a case-by-case basis: an insurance claim will be fraudulent or not, a tax return will be correct or in error, a subscriber...
In the machine learning community, a decision tree is a branching set of rules used to classify a record, or predict a continuous value for a record. For example
In predictive modeling, feature selection, also called variable selection, is the process (usually automated) of sorting through variables to retain variables that are likely...
In predictive modeling, bagging is an ensemble method that uses bootstrap replicates of the original training data to fit predictive models.
In predictive modeling, boosting is an iterative ensemble method that starts out by applying a classification algorithm and generating classifications.
In predictive modeling, ensemble methods refer to the practice of taking multiple models and averaging their predictions.
What's the probability that the NSA examined the metadata for your phone number in 2013? According to John Inglis, Deputy Director at the NSA, it's about 0.00001, or 1 in 100,000. A surprisingly small number, given what we've all been reading in the media about…
The expected value of a random variable, in a simple sense, is nothing but the arithmetic mean.
Exact tests are hypothesis tests that are guaranteed to produce Type-I error at or below the nominal alpha level of the test when conducted on samples drawn from a null model.
In statistical models, error or residual is the deviation of the estimated quantity from its true value: the greater the deviation, the greater the error.
Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model.
In a study or experiment with two groups (usually control and treatment), the investigator typically has in mind the magnitude of the difference between the two groups that he or she wants to be able to detect in a hypothesis test.
In a test of significance (also called a hypothesis test), Type I error is the error of rejecting the null hypothesis when it is true -- of saying an effect or event is statistically significant when it is not.
The devastation wrought by Super-Typhoon Haiyan in the Philippines is the biggest test yet for the nascent technology of "artificial intelligence disaster response," a phrase used by Patrick Meier, a pioneer in the field. When disaster strikes, a flood of social media posts and tweets…
There are Red States and Blue States. The three blue states of the Pacific coast constitute the Left Coast. For Colin Woodward, Yankeedom comprises both New England and the Great Lakes. If you're into accessories, there's the Bible Belt, the Rust Belt, and the Stroke…
A time series x(t); t=1,... is considered to be stationary if its statistical properties do not depend on time t .
Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set (used to fit the model), the validation set (used to compared models), and the test set (used to predict performance on new data).
Data mining is concerned with finding latent patterns in large databases.
In multivariate analysis, cluster analysis refers to methods used to divide up objects into similar groups, or, more precisely, groups whose members are all close to one another on various dimensions being measured.
In psychology, a construct is a phenomenon or a variable in a model that is not directly observable or measurable - intelligence is a classic example.
Collaborative filtering algorithms are used to predict whether a given individual might like, or purchase, an item.
The "righteous vengeance gun attack" is just one of 10 types of terrorism identified by Chenoweth and Lowham via statistical clustering techniques. Another cluster is "bombings of a public population where a liberation group takes responsibility." You can read about the 10 clusters, and the…
Crowdsourcing, using the power of the crowd to solve problems, has been used for many functions and tasks, including predictive modeling (like the 2009 Netflix Contest). Typically, problems are broadcast to an unknown group of statistical modelers on the Internet, and solutions are sought. Every…
Longitudinal data records multiple observations over time for a set of individuals or units. A typical..
Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual. A simple...
Tokenization is an initial step in natural language processing. It involves breaking down a text into a series of basic units, typically words. For example...
A natural language is what most people outside the field of computer science think of as just a language (Spanish, English, etc.). The term...
White Hat Bias is bias leading to distortion in, or selective presentation of, data that is considered by investigators or reviewers to be acceptable because it is in the service of righteous goals.
An edge is a link between two people or entities in a network that can be
Stratified sampling is a method of random sampling.
A discrete distribution is one in which the data can only take on certain values, for example integers. A continuous distribution is one in which data can take on any value within a specified range (which may be infinite).
The central limit theorem states that the sampling distribution of the mean approaches Normality as the sample size increases, regardless of the probability distribution of the population from which the sample is drawn.
Classification and regression trees (CART) are a set of techniques for classification and prediction.
CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.
In a census survey , all units from the population of interest are analyzed. A related concept is the sample survey, in which only a subset of the population is taken.
Discriminant analysis is a method of distinguishing between classes of objects. The objects are typically represented as rows in a matrix.
Also called the training sample, training set, calibration sample. The context is predictive modeling (also called supervised data mining) - where you have data with multiple predictor variables and a single known outcome or target variable.
A general statistical term meaning a systematic (not random) deviation of an estimate from the true value.
One of several computer-based iterative procedures for selecting variables to use in a model. The process begins...
Outcomes to an experiment or repeated events are statistically significant if they differ from what chance variation might produce.
An estimator is a measure or metric intended to be calculated from a sample drawn from a larger population...
The centroid is a measure of center in multi-dimensional space.
A Binomial distribution is used to describe an experiment, event, or process for which the probability of success is the same for each trial and each trial has only two possible outcomes.
A combination of treatment comparisons (e.g. send a sales solicitation, or send nothing) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments.
Network analytics is the science of describing and, especially, visualizing the connections among objects.
Multiplicity issues arise in a number of contexts, but they generally boil down to the same thing: repeated looks at a data set in different ways, until something "statistically significant" emerges.
Support vector machines are used in data mining (predictive modeling, to be specific) for classification of records, by learning from training data.
In data analysis or data mining, an attribute is a characteristic or feature that is measured for each observation (record) and can vary from one observation to another. It might
The negative binomial distribution is the probability distribution of the number of Bernoulli (yes/no) trials required to obtain r successes.
A random walk is a process of random steps, motions, or transitions. It might be in one dimension (movement along a line), in two dimensions (movements in a plane), or in three dimensions or more.
Cover time is the expected number of steps in a random walk required
is a general computer-intensive approach used in estimating the accuracy of statistical models.
(also called dissimilarity matrix) describes pairwise distinction between M objects.
in discrete time is the transformation of the series to a new time series where the values are the differences between consecutive values of the original series.
(outcome or variable) means "having only two possible values", e.g.
A probability density function is a curve used
In predictive modeling, data partitioning is the division of the data available for analysis into two or three non-overlapping
Promoting better understanding of statistics throughout the world.