Skip to content

Distance

  Dendrogram: Statistical distance is a measure calculated between two records that are typically part of a larger dataset, where rows are records and columns are variables. To calculate Euclidean distance, one possible distance metric, the steps are: 1. [Typically done, but not always] Convert all the values in each...

View Full Description

Decision Trees

Decision Trees: In the machine learning community, a decision tree is a branching set of rules used to classify a record, or predict a continuous value for a record. For example, one path in a tree modeling customer churn (abandonment of subscription) might look like this: IF payment is month-to-month,...

View Full Description

Feature Selection

Feature Selection: In predictive modeling, feature selection, also called variable selection, is the process (usually automated) of sorting through variables to retain variables that are likely to be informative in prediction, and discard or combine those that are redundant. “Features” is a term used by the machine learning community, sometimes...

View Full Description

Bagging

Bagging: In predictive modeling, bagging is an ensemble method that uses bootstrap replicates of the original training data to fit predictive models. For each record, the predictions from all available models are then averaged for the final prediction. For a classification problem, a majority vote of the models is used....

View Full Description

Decile Lift

Decile Lift: In predictive modeling, the goal is to make predictions about outcomes on a case-by-case basis: an insurance claim will be fraudulent or not, a tax return will be correct or in error, a subscriber will terminate a subscription or not, a customer will purchase $X, etc. Lift is...

View Full Description

Boosting

boosting: In predictive modeling, boosting is an iterative ensemble method that starts out by applying a classification algorithm and generating classifications. The classifications are then assessed, and a second round of model-fitting occurs in which the records classified incorrectly in the first round are given a higher weight in the...

View Full Description

Ensemble Methods

In predictive modeling, ensemble methods refer to the practice of taking multiple models and averaging their predictions. In the case of classification models, the average can be that of a probability score attached to the classification. Models can differ with respect to algorithms used (e.g. neural net, logistic regression), settings...

View Full Description

A Priori Probability

A Priori Probability: A priori probability is the probability estimate prior to receiving new information. See also Bayes Theorem and posterior probability. Browse Other Glossary Entries

View Full Description

Bayes´ Theorem

Bayes´ Theorem: Bayes theorem is a formula for revising a priori probabilities after receiving new information. The revised probabilities are called posterior probabilities. For example, consider the probability that you will develop a specific cancer in the next year. An estimate of this probability based on general population data would...

View Full Description

Bootstrapping

Bootstrapping: Bootstrapping is sampling with replacement from observed data to estimate the variability in a statistic of interest. See also permutation tests, a related form of resampling. A common application of the bootstrap is to assess the accuracy of an estimate based on a sample of data from a larger...

View Full Description

Categorical Data Analysis

Categorical Data Analysis: Categorical data analysis is a branch of statistics dealing with categorical data . This sort of analysis is of great practical importance because a wide variety of data are of a categorical nature. The most common type of data analyzed in categorical data analysis are contingency table...

View Full Description

Collinearity

Collinearity: In regression analysis , collinearity of two variables means that strong correlation exists between them, making it difficult or impossible to estimate their individual regression coefficients reliably. The extreme case of collinearity, where the variables are perfectly correlated, is called singularity . See also: Multicollinearity Browse Other Glossary Entries

View Full Description

Complete Statistic

Complete Statistic: A sufficient statistic T is called a complete statistic if no function of it has zero expected value for all distributions concerned unless this function itself is zero for all possible distributions concerned (except possibly a set of measure zero). The property of completeness of a statistic guarantees...

View Full Description

Contingency Table

Contingency Table: A contingency table is a tabular representation of categorical data . A contingency table usually shows frequencies for particular combinations of values of two discrete random variable s X and Y. Each cell in the table represents a mutually exclusive combination of X-Y values. For example, consider a...

View Full Description