Skip to content

Blog

Course Spotlight: Text Mining

The term text mining is sometimes used in two different meanings in computational statistics: Using predictive modeling to label many documents (e.g. legal docs might be “relevant” or “not relevant”) – this is what we call text mining. Using grammar and syntax to parse the meaning of individual documents – we use the term naturalContinue reading “Course Spotlight: Text Mining”

SAMPLE

Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

OVERFIT

As applied to statistical models – “overfit” means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

Quotes about Data Science

“The goal is to turn data into information, and information into insight.” – Carly Fiorina, former CEO, Hewlett-Packard Co. Speech given at Oracle OpenWorld “Data is the new science. Big data holds the answers.” – Pat Gelsinger, CEO, EMC, Big Bets on Big Data, Forbes“Hiding within those mounds of data is knowledge that could change the lifeContinue reading “Quotes about Data Science”

Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics

April 27 marks the 80th anniversary of the death of Karl Pearson, who contributed to statistics the correlation coefficient, principal components, the (increasingly-maligned) p-value, and much more. Pearson was one of a trio of founding fathers of modern statistics, the others being Francis Galton and Ronald Fisher.  Galton, Pearson and Fischer were deeply involved withContinue reading “Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics”

Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records.  It is often, but not always, randomly drawn.  In matrix form, the rows are records (subjects), columns are variables, and cell values are the valuesContinue reading “Week #8 – Homonyms department: Sample”

Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation.  When there are multiple variables in an analysis, normalization (also called standardization) removesContinue reading “Week #7 – Homonyms department: Normalization”