Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual. A simple...
Tokenization is an initial step in natural language processing. It involves breaking down a text into a series of basic units, typically words. For example...
A natural language is what most people outside the field of computer science think of as just a language (Spanish, English, etc.). The term...
White Hat Bias is bias leading to distortion in, or selective presentation of, data that is considered by investigators or reviewers to be acceptable because it is in the service of righteous goals.
An edge is a link between two people or entities in a network that can be
Stratified sampling is a method of random sampling.
A discrete distribution is one in which the data can only take on certain values, for example integers. A continuous distribution is one in which data can take on any value within a specified range (which may be infinite).
The central limit theorem states that the sampling distribution of the mean approaches Normality as the sample size increases, regardless of the probability distribution of the population from which the sample is drawn.
Classification and regression trees (CART) are a set of techniques for classification and prediction.
CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.
In a census survey , all units from the population of interest are analyzed. A related concept is the sample survey, in which only a subset of the population is taken.
Discriminant analysis is a method of distinguishing between classes of objects. The objects are typically represented as rows in a matrix.
Also called the training sample, training set, calibration sample. The context is predictive modeling (also called supervised data mining) - where you have data with multiple predictor variables and a single known outcome or target variable.
A general statistical term meaning a systematic (not random) deviation of an estimate from the true value.
One of several computer-based iterative procedures for selecting variables to use in a model. The process begins...
Outcomes to an experiment or repeated events are statistically significant if they differ from what chance variation might produce.
An estimator is a measure or metric intended to be calculated from a sample drawn from a larger population...
The centroid is a measure of center in multi-dimensional space.
A Binomial distribution is used to describe an experiment, event, or process for which the probability of success is the same for each trial and each trial has only two possible outcomes.
A combination of treatment comparisons (e.g. send a sales solicitation, or send nothing) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments.
Network analytics is the science of describing and, especially, visualizing the connections among objects.
Multiplicity issues arise in a number of contexts, but they generally boil down to the same thing: repeated looks at a data set in different ways, until something "statistically significant" emerges.
Support vector machines are used in data mining (predictive modeling, to be specific) for classification of records, by learning from training data.
In data analysis or data mining, an attribute is a characteristic or feature that is measured for each observation (record) and can vary from one observation to another. It might
The negative binomial distribution is the probability distribution of the number of Bernoulli (yes/no) trials required to obtain r successes.
A random walk is a process of random steps, motions, or transitions. It might be in one dimension (movement along a line), in two dimensions (movements in a plane), or in three dimensions or more.
Cover time is the expected number of steps in a random walk required
is a general computer-intensive approach used in estimating the accuracy of statistical models.
(also called dissimilarity matrix) describes pairwise distinction between M objects.
in discrete time is the transformation of the series to a new time series where the values are the differences between consecutive values of the original series.
(outcome or variable) means "having only two possible values", e.g.
A probability density function is a curve used
In predictive modeling, data partitioning is the division of the data available for analysis into two or three non-overlapping
Promoting better understanding of statistics throughout the world.
New Editor of Journal of Statistics Education
Read Peter's Letter to the Editor in Saturday's Washington Post.
Last year's popular story out of the Predictive Analytics World conference series was Andrew Pole's presentation of Target's methodology for predicting which customers were pregnant.
Evidence show that there is no significant difference between taking an online introductory statistics course and a traditional in-person class.
Facebook began trading around 11:30 this morning, and I spent 8 minutes
Newly elected American Statistical Association (ASA) Fellow, and recognized for his outstanding professional contributions to and leadership in the field of statistical science.
Arizona's immigration law goes before the Supreme Court this week...
I saw this job posting a while ago, and, in my next life,
David Unwin, Emeritus Chair in Geography, Bubeck College, University of London (and instructor at Statistics.com!) will be awarded the Association of American Geographers (AAG) Ronald F. Abler Distinguished Service Honors at the upcoming annual meeting next week.
February 12 was the 80th anniversary of the birth of Julian Simon, an early pioneer in resampling methods.
Statistics for Future Presidents - Steve Pierson, Director of Science Policy at ASA wrote interesting blog wondering how statistics for future presidents (or policymakers more generally) would compare with the recommended statistical skills/concepts for others. Take a look and let him know!
Teaching Geographic Information Science and Technology in Higher Education, 2012 (Wiley)
The story of the prospective Facebook IPO, and prior IPO's from LinkedIn, Pandora, and Groupon all involve "data scientists". Read an interview with Monica Rogati - Senior Data Scientist at LinkedIn to see the connection.
Dr. Michelle Everson is recognized for her outstanding contributions to and innovation in the teaching of elementary statistics.
John Elder's presentations on common data mining mistakes are a must-see if you have any experience or plans in the data mining arena.
"Any claim coming from an observational study is most likely to be wrong." Thus begins "Deming, data and observational studies," just published in "Significance Magazine" (Sept. 2011).
I was watching a Washington Nationals game on TV a couple of days ago, and the concept of "expected value" ...
A neurosurgeon, pathologist and epidemiologist are each told to examine a can of sardines on a table in a closed room, and present a report.
What do teenagers want? More importantly for the music industry, what music will they buy?
Advertisers shy away from round numbers, believing that $99 appears significantly cheaper than $100...
Did the NCAA get the March Madness rankings right? Check out SportsMeasures.com
What does Matt Asher's article "Attack of the Hair Trigger Bees" have to do with global warming? Matt Asher runs statisticsblog.com ...
The first Gallup Poll was published in October, 1935. In America Speaks,
Thinking about careers that use statistics? The job title "catastrophe modeling assistant" caught my eye recently in a job announcement. ...