Skip to content

Deep Learning

Deep Learning refers to complex multi-layer neural nets.  They are especially suitable for image and voice recognition, and for unsupervised tasks with complex, unstructured data.

View Full Description

Prospective Versus Retrospective

Prospective vs. Retrospective A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you're measuring, who you're measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition of the...

View Full Description

y-hat

The estimated or predicted values in a regression or other predictive model are termed the y-hat values. "Y" because y is the outcome or dependent variable in the model equation, and a "hat" symbol (circumflex) placed over the variable name is the statistical designation of an estimated value.

View Full Description

Azure ML

Azure is the Microsoft Cloud Computing Platform and Services. ML stands for Machine Learning, and is one of the services. Like other cloud computing services, you purchase it on a metered basis - as of 2015, there was a per-prediction charge, and a compute time charge. As of October 2015,...

View Full Description

Ordered categorical data

Statistical Glossary Additive Error: Categorical variables are non-numeric "category" variables, e.g. color. Ordered categorical variables are category variables that have a quantitative dimension that can be ordered but is not on a regular scale. Doctors rate pain on a scale of 1 to 10 - a "2" has no particular...

View Full Description

Bimodal

Statistical Glossary Additive Error: Bimodal literally means "two modes" and is typically used to describe distributions of values that have two centers. For example, the distribution of heights in a sample of adults might have two peaks, one for women and one for men. Browse Other Glossary Entries

View Full Description

HDFS

Statistical Glossary HDFS: HDFS is the Hadoop Distributed File System. It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant. Browse Other Glossary Entries

View Full Description

Netflix Prize

Netflix Prize: The Netflix prize was a famous early application of crowdsourcing to predictive modeling. In 2006, Netflix published customer movie rating data and challenged analysts to come up with a predictive model that would improve Netflix's prediction of what your rating would be for a given movie. Various individuals...

View Full Description

Prediction vs. Explanation

Prediction vs. Explanation: With the advent of Big Data and data mining, statistical methods like regression and CART have been repurposed to use as tools in predictive modeling. When statistical models are used as a tool of research, the goal is to explain relationships in a dataset, and make inference...

View Full Description

A-B Test

A-B Test: An A-B test is a classic statistical design in which individuals or subjects are randomly split into two groups and some intervention or treatment is applied - one group gets treatment A, the other treatment B. Typically one of the treatments will be a control (i.e. nothing new),...

View Full Description

RMSE

Statistical Glossary RMSE: RMSE is root mean squared error. In predicting a numerical outcome with a statistical model, predicted values rarely match actual outcomes exactly. The difference between predicted and actual is the error (or residual). To calculate RMSE, square each error, take the average, then take the square root....

View Full Description

Label

Label: A label is a category into which a record falls, usually in the context of predictive modeling. Label, class and category are different names for discrete values of a target (outcome) variable. "Label" typically has the added connotation that the label is something applied by a human to model-training...

View Full Description

Strip transect

Strip transect:A strip transect is a small subsection of a geographically-defined study area, typically chosen randomly. For example, Manly (Introduction to Ecological Sampling, CRC) discusses using randomly selected strips 3 meters wide and 20 meters long which are carefully examined and the number of deer pellets is counted. The area...

View Full Description

Spark

Spark: Spark is a second generation computing environment that sits on top of a Hadoop system, supporting the workflows that leverage a distributed file system. It improves on the performance of the initial Hadoop computational paradigm, MapReduce, via fast functional programming capabilities and the use of virtual memory caching. Browse...

View Full Description

Bandits

Bandits: Bandits refers to a class of algorithms in which users or subjects make repeated choices among, or decisions in reaction to, multiple alternatives. For example, a web retailer might have a set of N ways of presenting an offer. The task of the algorithm is to efficiently and accurately...

View Full Description

Multiple looks

<b Multiple looks: In a classic statistical experiment, treatment(s) and placebo are applied to randomly assigned subjects, and, at the end of the experiment, outcomes are compared. With multiple looks, the investigator does not wait until the end of the experiment -- outcomes are compared at earlier stages. The more...

View Full Description

Pruning the tree

<b Pruning the tree: Classification and regression trees, applied to data with known values for an outcome variable, derive models with rules like "If taxable income <$80,000, if no Schedule C income, if standard deduction taken, then no-audit." Pruning is the process of truncating the rules (= pruning the branches...

View Full Description

Features vs. Variables

Features vs. Variables: The predictors in a predictive model are sometimes given different terms by different disciplines. Traditional statisticians think in terms of variables. The machine learning community calls them features (also attributes or inputs). There is a subtle difference in meaning. In predictive modeling, depending on the nature of...

View Full Description

Prior and posterior

Prior and posterior Bayesian statistics typically incorporates new information (e.g. from a diagnostic test, or a recently drawn sample) to answer a question of the form "What is the probability that..." The answer to this question is referred to as the "posterior" probability, arrived at by modifying a "prior" probability...

View Full Description

Curb-stoning

Curb-stoning: In survey research, curb-stoning refers to the deliberate fabrication of survey interview data by the interviewer. Often this is done to avoid the work of actually conducting the surveys. Statistical methods have been developed that can help to identify data that is the product of curb-stoning. Browse Other Glossary...

View Full Description