#### Week #23 – Intraobserver Reliability

Intraobserver reliability indicates how stable are responses obtained from the same respondent at different time points. The greater the difference between the responses, the smaller the intraobserver reliability of the survey instrument. The correlation coefficient between the responses obtained at different time points from the same respondent is often…

Comments Off on Week #23 – Intraobserver Reliability

#### Week #22 – Independent Events

Two events A and B are said to be independent if P(A?B) = P(A).P(B). To put it differently, events A and B are independent if the occurrence or non-occurrence of A does not influence the occurrence of non-occurrence of B and vice-versa. For example, if…

Comments Off on Week #22 – Independent Events

#### Week #21 – Residuals

Residuals are differences between the observed values and the values predicted by some model. Analysis of residuals allows you to estimate the adequacy of a model for particular data; it is widely used in regression analysis.

Comments Off on Week #21 – Residuals

#### Week #20 – Concurrent Validity

The concurrent validity of survey instruments, like the tests used in psychometrics, is a measure of agreement between the results obtained by the given survey instrument and the results obtained for the same population by another instrument acknowledged as the "gold standard". The concurrent validity…

Comments Off on Week #20 – Concurrent Validity

#### Week #19 – Normality

Normality is a property of a random variable that is distributed according to the normal distribution. Normality plays a central role in both theoretical and practical statistics: a great number of theoretical statistical methods rest on the assumption that the data, or test statistics derived from…

Comments Off on Week #19 – Normality

#### Week #18 – n

In statistics, "n" denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

Comments Off on Week #18 – n

#### Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task.  Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms.  More typically, the corpus is a body of documents for…

Comments Off on Week #17 – Corpus

#### Week #16 – Weighted Kappa

Weighted kappa is a measure of agreement for Categorical data . It is a generalization of the Kappa statistic to situations in which the categories are not equal in some respect - that is, weighted by an objective or subjective function.

Comments Off on Week #16 – Weighted Kappa

#### Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics

April 27 marks the 80th anniversary of the death of Karl Pearson, who contributed to statistics the correlation coefficient, principal components, the (increasingly-maligned) p-value, and much more. Pearson was one of a trio of founding fathers of modern statistics, the others being Francis Galton and…

Comments Off on Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics

#### Week #15 – Rank Correlation Coefficient

Rank correlation is a method of finding the degree of association between two variables. The calculation for the rank correlation coefficient the same as that for the Pearson correlation coefficient, but is calculated using the ranks of the observations and not their numerical values. This…

Comments Off on Week #15 – Rank Correlation Coefficient

#### Week #14 – Manifest Variable

In latent variable models, a manifest variable (or indicator) is an observable variable - i.e. a variable that can be measured directly. A manifest variable can be continuous or categorical. The opposite concept is the latent variable.

Comments Off on Week #14 – Manifest Variable

#### Week #13 – Fisher´s Exact Test

Fisher´s exact test is the first (historically) permutation test. It is used with two samples of binary data, and tests the null hypothesis that the two samples are drawn from populations with equal but unknown proportions of "successes" (e.g. proportion of patients recovered without complications…

Comments Off on Week #13 – Fisher´s Exact Test

#### Week #12 – Homoscedasticity

Homoscedasticity generally means equal variation of data, e.g. equal variance.

Comments Off on Week #12 – Homoscedasticity

#### Week #11 – Posterior Probability

Posterior probability is a revised probability that takes into account new available information. For example, let there be two urns, urn A having 5 black balls and 10 red balls and urn B having 10 black balls and 5 red balls. Now if an urn…

Comments Off on Week #11 – Posterior Probability

#### Week #4 – Loss Function

A loss function specifies a penalty for an incorrect estimate from a statistical model. Typical loss functions might specify the penalty as a function of the difference between the estimate and the true value, or simply as a binary value depending on whether the estimate…

Comments Off on Week #4 – Loss Function

#### Week #3 – Endogenous Variable:

Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model. In other words, endogenous variables have explicit causes within the model. The concept of endogenous variable is fundamental in path analysis and structural equation…

Comments Off on Week #3 – Endogenous Variable:

#### Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2…

Comments Off on Week #2 – Casual Modeling

#### Week #1 – Nonstationary time series

A time series x_t is called to be nonstationary if its statistical properties depend on time. The opposite concept is stationary time series . Most real world time series are nonstationary. An example of a nonstationary time series is a record of readings of the…

Comments Off on Week #1 – Nonstationary time series

#### Week #10 – Arm

In an experiment, an arm is a treatment protocol - for example, drug A, or placebo.   In medical trials, an arm corresponds to a patient group receiving a specified therapy.  The term is also relevant for bandit algorithms for web testing, where an arm consists…

Comments Off on Week #10 – Arm

#### Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued.  An example might be a binary matrix used to power web searches - columns representing search terms and rows representing searches,…

Comments Off on Week #9 – Sparse Matrix

#### Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records.  It is often, but not always, randomly drawn.  In matrix form, the rows are records…

Comments Off on Week #8 – Homonyms department: Sample

#### Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation.  When there are…

Comments Off on Week #7 – Homonyms department: Normalization

#### Week #6 – Kolmogorov-Smirnov One-sample Test

The Kolmogorov-Smirnov one-sample test is a goodness-of-fit test, and tests whether an observed dataset is consistent with an hypothesized theoretical distribution. The test involves specifying the cumulative frequency distribution which would occur given the theoretical distribution and comparing that with the observed cumulative frequency distribution.

Comments Off on Week #6 – Kolmogorov-Smirnov One-sample Test

#### Week #5 – Cohort Data

Cohort data records multiple observations over time for a set of individuals or units tied together by some event (say, born in the same year). See also longitudinal data and panel data.

Comments Off on Week #5 – Cohort Data

#### Week #51 – Circular Icon Plots

Comments Off on Week #51 – Circular Icon Plots

#### Week #50 – Six-Sigma

Six sigma means literally six standard deviations. The phrase refers to the limits drawn on statistical process control charts used to plot statistics from samples taken regularly from a production process. Consider the process mean. A process is deemed to be "in control" at any…

Comments Off on Week #50 – Six-Sigma

#### Week #47 – Psychometrics

Psychometrics or psychological testing is concerned with quantification (measurement) of human characteristics, behavior, performance, health, etc., as well as with design and analysis of studies based on such measurements. An example of the problems being solved in psychometrics is the measurement of intelligence via "IQ"…

Comments Off on Week #47 – Psychometrics

#### Week #46 – Azure ML

Azure is the Microsoft Cloud Computing Platform and Services.  ML stands for Machine Learning, and is one of the services.  Like other cloud computing services, you purchase it on a metered basis - as of 2015, there was a per-prediction charge, and a compute time…

Comments Off on Week #46 – Azure ML

#### Week #45 – Ordered categorical data

Categorical variables are non-numeric "category" variables, e.g. color.  Ordered categorical variables are category variables that have a quantitative dimension that can be ordered but is not on a regular scale.  Doctors rate pain on a scale of 1 to 10 - a "2" has no…

Comments Off on Week #45 – Ordered categorical data

#### Week #44 – Bimodal

Bimodal literally means "two modes" and is typically used to describe distributions of values that have two centers.  For example, the distribution of heights in a sample of adults might have two peaks, one for women and one for men.

Comments Off on Week #44 – Bimodal

#### Week #43 – HDFS

HDFS is the Hadoop Distributed File System.  It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

Comments Off on Week #43 – HDFS

#### Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

Comments Off on Week #42 – Kruskal – Wallis Test

#### Week #41 – Analysis of Variance (ANOVA)

A statistical technique which helps in making inference whether three or more samples might come from populations having the same mean; specifically, whether the differences among the samples might be caused by chance variation.

Comments Off on Week #41 – Analysis of Variance (ANOVA)

#### Week #40 – Two-Tailed Test

A two-tailed test is a hypothesis test in which the null hypothesis is rejected if the observed sample statistic is more extreme than the critical value in either direction (higher than the positive critical value or lower than the negative critical value). A two-tailed test…

Comments Off on Week #40 – Two-Tailed Test

#### Week #39 – Split-Halves Method

In psychometric surveys, the split-halves method is used to measure the internal consistency reliability of survey instruments, e.g. psychological tests. The idea is to split the items (questions) related to the same construct to be measured, e.d. the anxiety level, and to compare the results…

Comments Off on Week #39 – Split-Halves Method

#### Week #38 – Life Tables

In survival analysis, life tables summarize lifetime data or, generally speaking, time-to-event data. Rows in a life table usually correspond to time intervals, columns to the following categories: (i) not "failed", (ii) "failed", (iii) censored (withdrawn), and the sum of the three called "the number…

Comments Off on Week #38 – Life Tables

#### Week #37 – Truncation

Truncation, generally speaking, means to shorten. In statistics it can mean the process of limiting consideration or analysis to data that meet certain criteria (for example, the patients still alive at a certain point). Or it can refer to a data distribution where values above…

Comments Off on Week #37 – Truncation

#### Week #36 – Tukey´s HSD (Honestly Significant Differences) Test

This test is used for testing the significance of unplanned pairwise comparisons. When you do multiple significance tests, the chance of finding a "significant" difference just by chance increases. Tukey´s HSD test is one of several methods of ensuring that the chance of finding a…

Comments Off on Week #36 – Tukey´s HSD (Honestly Significant Differences) Test

#### Week #35 – Robust Filter

A robust filter is a filter that is not sensitive to input noise values with extremely large magnitude (e.g. those arising due to anomalous measurement errors). The median filter is an example of a robust filter. Linear filters are not robust - their output may…

Comments Off on Week #35 – Robust Filter

#### Week #34 – Hypothesis Testing

Hypothesis testing (also called "significance testing") is a statistical procedure for discriminating between two statistical hypotheses - the null hypothesis (H0) and the alternative hypothesis ( Ha, often denoted as H1). Hypothesis testing, in a formal logic sense, rests on the presumption of validity of the null hypothesis - that is, the null hypothesis is rejected only if the data at hand testify strongly enough against it.

Comments Off on Week #34 – Hypothesis Testing

#### Week #33 – Kurtosis

Kurtosis measures the "heaviness of the tails" of a distribution (in compared to a normal distribution). Kurtosis is positive if the tails are "heavier" then for a normal distribution, and negative if the tails are "lighter" than for a normal distribution. The normal distribution has kurtosis of zero.

Comments Off on Week #33 – Kurtosis

#### Week #32 – False Discovery Rate

A "discovery" is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).

Comments Off on Week #32 – False Discovery Rate

#### Week # 31 – Skewness

Comments Off on Week # 31 – Skewness

#### Week # 30 – Icon Plots

Comments Off on Week # 30 – Icon Plots

#### Week #29 – Signal

The signal is the component of the observed data that carries useful information.

Comments Off on Week #29 – Signal

#### Week #28 – Non-parametric Regression

Non-parametric regression methods are aimed at describing a relationship between the dependent and independent variables...

Comments Off on Week #28 – Non-parametric Regression

#### Week #27 – Nominal scale

A nominal scale is really a list of categories to which objects can be classified.

Comments Off on Week #27 – Nominal scale

#### Week #26 – Noise

The noise is the component of the observed data (e.g. of a time series) that is random and carries no useful information.

Comments Off on Week #26 – Noise

#### Week #25 – Nearest Neighbor Clustering

The single linkage clustering method (or the nearest neighbor method) is a method of calculating distance between clusters in hierarchical cluster analysis .

Comments Off on Week #25 – Nearest Neighbor Clustering

#### Week # 24 – Edge

In a network analysis context, "edge" refers to a link or connection between two entities in a network

Comments Off on Week # 24 – Edge

#### Week #23 – Netflix Contest

The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available.  Individuals and teams then compete to develop the best performing model.

Comments Off on Week #23 – Netflix Contest

#### Week #22 – Splines

The linear model is ubiquitous in classical statistics, yet real-life data rarely follow a purely linear pattern.

Comments Off on Week #22 – Splines

#### Week # 21 – Association Rules

Association rules, also called "market basket analysis," is a data mining method applied to transaction data.

Comments Off on Week # 21 – Association Rules

#### Week #20 – R

This week's word is actually a letter.  R is a statistical computing and programming language and program, a derivative of the commercial S-PLUS program, which, in turn, was an offshoot of S from Bell Labs.

Comments Off on Week #20 – R

#### Week #19 – Prediction vs. Explanation

With the advent of Big Data and data mining, statistical methods like regression and CART have been repurposed to use as tools in predictive modeling.

Comments Off on Week #19 – Prediction vs. Explanation

#### Week #18 – Netflix Prize

The Netflix prize was a famous early application of crowdsourcing to predictive modeling.

Comments Off on Week #18 – Netflix Prize

#### Week #17 – A-B Test

An A-B test is a classic statistical design in which individuals or subjects are randomly split into two groups and some intervention or treatment is applied.

Comments Off on Week #17 – A-B Test

#### Week #16 – Moving Average

In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t-1.

Comments Off on Week #16 – Moving Average

#### Week #15 – Interaction term

In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.

Comments Off on Week #15 – Interaction term

#### Week #14 – Naive forecast

A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).

Comments Off on Week #14 – Naive forecast

#### Week #13 – RMSE

RMSE is root mean squared error.  In predicting a numerical outcome with a statistical model, predicted values rarely match actual outcomes exactly.

Comments Off on Week #13 – RMSE

#### Week #12 – Label

A label is a category into which a record falls, usually in the context of predictive modeling.  Label, class and category are different names for discrete values of a target (outcome) variable.

Comments Off on Week #12 – Label

#### Week #7.5 – Strip transect

A strip transect is a small subsection of a geographically-defined study area, typically chosen randomly.

Comments Off on Week #7.5 – Strip transect

#### Week #11 – Spark

Spark is a second generation computing environment that sits on top of a Hadoop system, supporting the workflows that leverage a distributed file system.

Comments Off on Week #11 – Spark

#### Week #10 – Bandits

Bandits refers to a class of algorithms in which users or subjects make repeated choices among, or decisions in reaction to, multiple alternatives.

Comments Off on Week #10 – Bandits

#### week #9 – Overdispersion

In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.

Comments Off on week #9 – Overdispersion

#### Week #8 – Confusion matrix

In a classification model, the confusion matrix shows the counts of correct and erroneous classifications.  In a binary classification problem, the matrix consists of 4 cells.

Comments Off on Week #8 – Confusion matrix

#### Week #7 – Multiple looks

In a classic statistical experiment, treatment(s) and placebo are applied to randomly assigned subjects, and, at the end of the experiment, outcomes are compared.

Comments Off on Week #7 – Multiple looks

#### Week #6 – Pruning the tree

Classification and regression trees, applied to data with known values for an outcome variable, derive models with rules like "If taxable income <\$80,000, if no Schedule C income, if standard deduction taken, then no-audit."

Comments Off on Week #6 – Pruning the tree

#### Week #5 – Features vs. Variables

The predictors in a predictive model are sometimes given different terms by different disciplines.  Traditional statisticians think in terms of variables.

Comments Off on Week #5 – Features vs. Variables

#### Week #4 – Logistic Regression

In logistic regression, we seek to estimate the relationship between predictor variables Xi and a binary response variable.  Specifically, we want to estimate the probability p that the response variable will be a 0 or a 1.

Comments Off on Week #4 – Logistic Regression

#### Week #3 – Prior and posterior

Bayesian statistics typically incorporates new information (e.g. from a diagnostic test, or a recently drawn sample) to answer a question of the form "What is the probability that..."

Comments Off on Week #3 – Prior and posterior

#### Week #2 – Permutation test

Consider two (or more) samples subjected to different treatments.  A permutation test assesses whether,

Comments Off on Week #2 – Permutation test

#### Week #1 – Quasi-experiment (revisited)

One avid reader took issue with a recent definition of "quasi experiment."  I had defined it

Comments Off on Week #1 – Quasi-experiment (revisited)

#### Course Spotlight: The Text Analytics Sequence

Text analytics or text mining is the natural extension of predictive analytics, and Statistics.com's text analytics program starts Feb. 6. Text analytics is now ubiquitous and yields insight in: Marketing: Voice of the customer, social media analysis, churn analysis, market research, survey analysis Business: Competitive…

Comments Off on Course Spotlight: The Text Analytics Sequence

#### Course Spotlight: Constrained Optimization

Say you operate a tank farm (to store and sell fuel). How much of each fuel grade should you buy? You have specified flow and storage capacities, constraints on what types of fuels can be stored in which tanks, prior contractual obligations about minimum monthly…

Comments Off on Course Spotlight: Constrained Optimization

#### Week # 52 – Quasi-experiment

In social science research, particularly in the qualitative literature on program evaluation, the term "quasi-experiment" refers to studies that do not involve the application of treatments via random assignment of subjects.

Comments Off on Week # 52 – Quasi-experiment

#### Week #51 – Curb-stoning

In survey research, curb-stoning refers to the deliberate fabrication of survey interview data by the interviewer.

Comments Off on Week #51 – Curb-stoning

#### College Credit Recommendation

Statistics.com Receives College Recommendation from the American Council on Education (ACE) College Credit Recommendation for Online Data Science Courses from The Institute for Statistics Education at Statistics.com LLC The American Council on Education's College Credit Recommendation Service (ACE CREDIT) has evaluated and recommended college credit…

Comments Off on College Credit Recommendation

#### Week #50 – Bag-of-words

Bag-of-words is a simplified natural language processing concept.

Comments Off on Week #50 – Bag-of-words

#### Week #49 – Stemming

In language processing, stemming is the process of taking multiple forms of the same word and reducing them to the same basic core form.

Comments Off on Week #49 – Stemming

#### Week #48 – Structured vs. unstructured data

Structured data is data that is in a form that can be used to develop statistical or machine learning models (typically a matrix where rows are records and columns are variables or features).

Comments Off on Week #48 – Structured vs. unstructured data

#### Week #47 – Feature engineering

In predictive modeling, a key step is to turn available data (which may come from varied sources and be messy) into an orderly matrix of rows (records to be predicted) and columns (predictor variables or features).

Comments Off on Week #47 – Feature engineering

#### Week #46 – Naive bayes classifier

A full Bayesian classifier is a supervised learning technique that assigns a class to a record by finding other records  with attributes just like it has, and finding the most prevalent class among them.

Comments Off on Week #46 – Naive bayes classifier

#### Week #45 – MapReduce

In computer science, MapReduce is a procedure that prepares data for parallel processing on multiple computers.

Comments Off on Week #45 – MapReduce

#### Big Data and Clinical Trials in Medicine

There was an interesting article a couple of weeks ago in the New York Times magazine section on the role that Big Data can play in treating patients -- discovering things that clinical trials are too slow, too expensive, and too blunt to find. The…

Comments Off on Big Data and Clinical Trials in Medicine

#### Week #44 – Likert scales

Likert scales are categorical ordinal scales used in social sciences to measure attitude.  A typical example is a set of response options ranging from "strongly agree" to "strongly disagree."

Comments Off on Week #44 – Likert scales

#### Week #43 – Node

A node is an entity in a network.  In a social network, it would be a person.  In a digital network, it would be a computer or device.

Comments Off on Week #43 – Node

#### Week #42 – Latent Variable Models

Latent variable models postulate some relationship between the statistical properties of observable variables.

Comments Off on Week #42 – Latent Variable Models

#### Week #41 – K-nearest neighbor

K-nearest-neighbor (K-NN) is a machine learning predictive algorithm that relies on calculation of distances between pairs of records.

Comments Off on Week #41 – K-nearest neighbor

#### Word #40 – Kappa Statistic

The kappa statistic measures the extent to which different raters or examiners differ when looking at the same data and assigning categories.

Comments Off on Word #40 – Kappa Statistic

#### Word #39 – Censoring

Censoring in time-series data occurs when some event causes subjects to cease producing data for reasons beyond the control of the investigator, or for reasons external to the issue being studied.

Comments Off on Word #39 – Censoring

#### Word #38 – Survival Analysis

Survival analysis is a set of methods used to model and analyze survival data, also called time-to-event data.

Comments Off on Word #38 – Survival Analysis

#### Word #37 – Joint Probability Distribution

The probability distribution for X is the possible values of X and their associated probabilities. With two separate discrete random variables, X and Y, the joint probability distribution is the function f(x,y)

Comments Off on Word #37 – Joint Probability Distribution

#### Word #36 – The Jackknife

With a sample of size N, the jackknife involves calculating N values of the estimator, with each value calculated on the basis of the entire sample less one observation.

Comments Off on Word #36 – The Jackknife

#### Word #35 – Interim Monitoring

In the interim monitoring of clinical trials, multiple looks are taken at the accruing patient results - say, response to a medication.

Comments Off on Word #35 – Interim Monitoring

#### Industry Spotlight: The brand premium for Chanel and Harvard

The classic illustration of the power of brand is perfume - expensive perfumes may cost just a few dollars to produce but can be sold for more than \$500 due to the cachet afforded by the brand. David Malan's Computer Science course at Harvard, CSCI…

Comments Off on Industry Spotlight: The brand premium for Chanel and Harvard

#### Industry Spotlight: SAS is back

The big news from the SAS world this summer was the release, on May 28, of the SAS University Edition, which brings the effective price for a single user edition of SAS down from around \$10,000 to \$0. It does most of the things that…

Comments Off on Industry Spotlight: SAS is back

#### Word #34 – NoSQL

A NoSQL database is distinguished mainly by what it is not -

Comments Off on Word #34 – NoSQL

#### Word #33 – Similarity matrix

A similarity matrix shows how similar records are to each other.

Comments Off on Word #33 – Similarity matrix