#### Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

October 8, 2015

A statistical technique which helps in making inference whether three or more samples might come from populations having the same mean; specifically, whether the differences among the samples might be caused by chance variation.

October 8, 2015

A two-tailed test is a hypothesis test in which the null hypothesis is rejected if the observed sample statistic is more extreme than the critical value in either direction (higher than the positive critical value or lower than the negative critical value). A two-tailed test…

September 29, 2015

In psychometric surveys, the split-halves method is used to measure the internal consistency reliability of survey instruments, e.g. psychological tests. The idea is to split the items (questions) related to the same construct to be measured, e.d. the anxiety level, and to compare the results…

September 22, 2015

In survival analysis, life tables summarize lifetime data or, generally speaking, time-to-event data. Rows in a life table usually correspond to time intervals, columns to the following categories: (i) not "failed", (ii) "failed", (iii) censored (withdrawn), and the sum of the three called "the number…

September 15, 2015

Truncation, generally speaking, means to shorten. In statistics it can mean the process of limiting consideration or analysis to data that meet certain criteria (for example, the patients still alive at a certain point). Or it can refer to a data distribution where values above…

August 18, 2015

This test is used for testing the significance of unplanned pairwise comparisons. When you do multiple significance tests, the chance of finding a "significant" difference just by chance increases. Tukey´s HSD test is one of several methods of ensuring that the chance of finding a…

August 18, 2015

A robust filter is a filter that is not sensitive to input noise values with extremely large magnitude (e.g. those arising due to anomalous measurement errors). The median filter is an example of a robust filter. Linear filters are not robust - their output may…

August 18, 2015

Hypothesis testing (also called "significance testing") is a statistical procedure for discriminating between two statistical hypotheses - the null hypothesis (H_{0}) and the alternative hypothesis ( H_{a}, often denoted as H_{1}). Hypothesis testing, in a formal logic sense, rests on the presumption of validity of the null hypothesis - that is, the null hypothesis is rejected only if the data at hand testify strongly enough against it.

August 18, 2015

Kurtosis measures the "heaviness of the tails" of a distribution (in compared to a normal distribution). Kurtosis is positive if the tails are "heavier" then for a normal distribution, and negative if the tails are "lighter" than for a normal distribution. The normal distribution has kurtosis of zero.

August 17, 2015

A "discovery" is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).

August 6, 2015

The signal is the component of the observed data that carries useful information.

July 23, 2015

Non-parametric regression methods are aimed at describing a relationship between the dependent and independent variables...

June 11, 2015

A nominal scale is really a list of categories to which objects can be classified.

June 11, 2015

The noise is the component of the observed data (e.g. of a time series) that is random and carries no useful information.

June 11, 2015

The single linkage clustering method (or the nearest neighbor method) is a method of calculating distance between clusters in hierarchical cluster analysis .

June 11, 2015

In a network analysis context, "edge" refers to a link or connection between two entities in a network

June 4, 2015

The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available. Individuals and teams then compete to develop the best performing model.

June 4, 2015

The linear model is ubiquitous in classical statistics, yet real-life data rarely follow a purely linear pattern.

May 26, 2015

Association rules, also called "market basket analysis," is a data mining method applied to transaction data.

May 26, 2015

This week's word is actually a letter. R is a statistical computing and programming language and program, a derivative of the commercial S-PLUS program, which, in turn, was an offshoot of S from Bell Labs.

May 19, 2015

With the advent of Big Data and data mining, statistical methods like regression and CART have been repurposed to use as tools in predictive modeling.

April 28, 2015

The Netflix prize was a famous early application of crowdsourcing to predictive modeling.

April 28, 2015

An A-B test is a classic statistical design in which individuals or subjects are randomly split into two groups and some intervention or treatment is applied.

April 28, 2015

When Apple CEO Tim Cook finally unveiled his company's new Apple Watch in a widely-publicizedrolloutearlier this month, most of the press coverage centered on its cost ($349 to start) and whether it would be as popular among consumers as the iPod or iMac. Nitin Indurkhyasaw…

March 24, 2015

In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t-1.

March 9, 2015

In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.

March 9, 2015

A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).

March 9, 2015

RMSE is root mean squared error. In predicting a numerical outcome with a statistical model, predicted values rarely match actual outcomes exactly.

March 9, 2015

A label is a category into which a record falls, usually in the context of predictive modeling. Label, class and category are different names for discrete values of a target (outcome) variable.

March 9, 2015

A strip transect is a small subsection of a geographically-defined study area, typically chosen randomly.

February 20, 2015

Spark is a second generation computing environment that sits on top of a Hadoop system, supporting the workflows that leverage a distributed file system.

February 20, 2015

Bandits refers to a class of algorithms in which users or subjects make repeated choices among, or decisions in reaction to, multiple alternatives.

February 20, 2015

In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.

January 30, 2015

In a classification model, the confusion matrix shows the counts of correct and erroneous classifications. In a binary classification problem, the matrix consists of 4 cells.

January 30, 2015

In a classic statistical experiment, treatment(s) and placebo are applied to randomly assigned subjects, and, at the end of the experiment, outcomes are compared.

January 16, 2015

Classification and regression trees, applied to data with known values for an outcome variable, derive models with rules like "If taxable income <$80,000, if no Schedule C income, if standard deduction taken, then no-audit."

January 16, 2015

The predictors in a predictive model are sometimes given different terms by different disciplines. Traditional statisticians think in terms of variables.

January 16, 2015

In logistic regression, we seek to estimate the relationship between predictor variables Xi and a binary response variable. Specifically, we want to estimate the probability p that the response variable will be a 0 or a 1.

January 16, 2015

Bayesian statistics typically incorporates new information (e.g. from a diagnostic test, or a recently drawn sample) to answer a question of the form "What is the probability that..."

January 16, 2015

Consider two (or more) samples subjected to different treatments. A permutation test assesses whether,

January 9, 2015

One avid reader took issue with a recent definition of "quasi experiment." I had defined it

January 9, 2015

Text analytics or text mining is the natural extension of predictive analytics, and Statistics.com's text analytics program starts Feb. 6. Text analytics is now ubiquitous and yields insight in: Marketing: Voice of the customer, social media analysis, churn analysis, market research, survey analysis Business: Competitive…

December 9, 2014

Say you operate a tank farm (to store and sell fuel). How much of each fuel grade should you buy? You have specified flow and storage capacities, constraints on what types of fuels can be stored in which tanks, prior contractual obligations about minimum monthly…

December 9, 2014

In social science research, particularly in the qualitative literature on program evaluation, the term "quasi-experiment" refers to studies that do not involve the application of treatments via random assignment of subjects.

December 5, 2014

In survey research, curb-stoning refers to the deliberate fabrication of survey interview data by the interviewer.

December 5, 2014

Statistics.com Receives College Recommendation from the American Council on Education (ACE) College Credit Recommendation for Online Data Science Courses from The Institute for Statistics Education at Statistics.com LLC The American Council on Education's College Credit Recommendation Service (ACE CREDIT) has evaluated and recommended college credit…

December 3, 2014

Bag-of-words is a simplified natural language processing concept.

November 7, 2014

In language processing, stemming is the process of taking multiple forms of the same word and reducing them to the same basic core form.

November 7, 2014

Structured data is data that is in a form that can be used to develop statistical or machine learning models (typically a matrix where rows are records and columns are variables or features).

November 7, 2014

In predictive modeling, a key step is to turn available data (which may come from varied sources and be messy) into an orderly matrix of rows (records to be predicted) and columns (predictor variables or features).

November 7, 2014

A full Bayesian classifier is a supervised learning technique that assigns a class to a record by finding other records with attributes just like it has, and finding the most prevalent class among them.

November 7, 2014

In computer science, MapReduce is a procedure that prepares data for parallel processing on multiple computers.

November 7, 2014

There was an interesting article a couple of weeks ago in the New York Times magazine section on the role that Big Data can play in treating patients -- discovering things that clinical trials are too slow, too expensive, and too blunt to find. The…

October 20, 2014

Likert scales are categorical ordinal scales used in social sciences to measure attitude. A typical example is a set of response options ranging from "strongly agree" to "strongly disagree."

October 10, 2014

A node is an entity in a network. In a social network, it would be a person. In a digital network, it would be a computer or device.

October 10, 2014

Latent variable models postulate some relationship between the statistical properties of observable variables.

October 1, 2014

K-nearest-neighbor (K-NN) is a machine learning predictive algorithm that relies on calculation of distances between pairs of records.

October 1, 2014

The kappa statistic measures the extent to which different raters or examiners differ when looking at the same data and assigning categories.

September 19, 2014

Censoring in time-series data occurs when some event causes subjects to cease producing data for reasons beyond the control of the investigator, or for reasons external to the issue being studied.

September 19, 2014

Survival analysis is a set of methods used to model and analyze survival data, also called time-to-event data.

September 18, 2014

The probability distribution for X is the possible values of X and their associated probabilities. With two separate discrete random variables, X and Y, the joint probability distribution is the function f(x,y)

September 18, 2014

With a sample of size N, the jackknife involves calculating N values of the estimator, with each value calculated on the basis of the entire sample less one observation.

September 18, 2014

In the interim monitoring of clinical trials, multiple looks are taken at the accruing patient results - say, response to a medication.

September 18, 2014

The classic illustration of the power of brand is perfume - expensive perfumes may cost just a few dollars to produce but can be sold for more than $500 due to the cachet afforded by the brand. David Malan's Computer Science course at Harvard, CSCI…

September 11, 2014

The big news from the SAS world this summer was the release, on May 28, of the SAS University Edition, which brings the effective price for a single user edition of SAS down from around $10,000 to $0. It does most of the things that…

August 21, 2014

Nobody expects Twitter feed sentiment analysis to give you unbiased results the way a well-designed survey will. A Pew Research study found that Twitter political opinion was, at times, much more liberal than that revealed by public opinion polls, while it was more conservative at…

August 4, 2014

Boston, August 3 2014: Bill Ruh, GE Software Center, says that the Internet of Things, 30 billion machines talking to one another, will dwarf the impact of the consumer internet. Speaking at the Joint Statistical Meetings today, Ruh predicted that the marriage of the IoT…

August 3, 2014

A NoSQL database is distinguished mainly by what it is not -

July 28, 2014

A similarity matrix shows how similar records are to each other.

July 28, 2014

Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).

July 28, 2014

A hold-out sample is a random sample from a data set that is withheld and not used in the model fitting process. After the model...

July 28, 2014

Heteroscedasticity generally means unequal variation of data, e.g. unequal variance. More specifically,

July 28, 2014

Goodness-of-fit measures the difference between an observed frequency distribution and a theoretical probability distribution which

July 15, 2014

The geometric mean of n values is determined by multiplying all n values together, then taking the nth root of the product. It is useful in taking averages of ratios.

July 15, 2014

Hierarchical linear modeling is an approach to analysis of hierarchical (nested) data - i.e. data represented by categories, sub-categories, ..., individual units (e.g. school -> classroom -> student).

June 6, 2014

In medical statistics, the hazard function is a relationship between a proportion and time.

June 6, 2014

The Fleming procedure (or *O´Brien-Fleming multiple testing procedure *) is a simple multiple testing procedure for comparing two treatments when the response to treatment is dichotomous . This procedure...

May 30, 2014

In a directed network, connections between nodes are directional. For example..

May 30, 2014

An adjacency matrix describes the relationships in a network. Nodes are listed in the top..

May 30, 2014

The exponential distribution is a model for the length of intervals between two consecutive random events in time or

May 30, 2014

Error is the deviation of an estimated quantity from its true value, or, more precisely,

May 29, 2014

Step-wise regression is one of several computer-based iterative variable-selection procedures.

May 16, 2014

Regularization refers to a wide variety of techniques used to bring structure to statistical models in the face of data size, complexity and sparseness.

May 9, 2014

SQL stands for structured query language, a high level language for querying relational databases, extracting information.

March 28, 2014

A Markov chain is a probability system that governs transition among states or through successive events.

March 14, 2014

MapReduce is a programming framework to distribute the computing load of very large data and problems to multiple computers.

March 14, 2014

As data processing requirements grew beyond the capacities of even large computers, distributed computing systems were developed to spread the load to multiple computers.

March 14, 2014

The curse of dimensionality is the affliction caused by adding variables to multivariate data models.

March 14, 2014

A data product is a product or service whose value is derived from using algorithmic methods on data, and which in turn produces data to be used in the same product, or tangential data products.

February 21, 2014

Ever wonder why, in World War II, ships in convoys were safer than ships traveling on their own? Most people assume it was due to the protection afforded by military escort vessels, of which there was a limited supply (insufficient to protect ships traveling on…

February 19, 2014

Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called independent variables.

February 12, 2014

Statistical distance is a measure calculated between two records that are typically part of a larger dataset, where rows are records and columns are variables. To calculate...

February 5, 2014

In predictive modeling, the goal is to make predictions about outcomes on a case-by-case basis: an insurance claim will be fraudulent or not, a tax return will be correct or in error, a subscriber...

February 5, 2014

In the machine learning community, a decision tree is a branching set of rules used to classify a record, or predict a continuous value for a record. For example

February 5, 2014

In predictive modeling, feature selection, also called variable selection, is the process (usually automated) of sorting through variables to retain variables that are likely...

February 5, 2014

In predictive modeling, bagging is an ensemble method that uses bootstrap replicates of the original training data to fit predictive models.

February 5, 2014

In predictive modeling, boosting is an iterative ensemble method that starts out by applying a classification algorithm and generating classifications.

