Have you ever needed to analyze data with a spatial component? Geographic clusters of disease, crimes, animals, plants, events?Or describing the spatial variation of something, and perhaps correlating it with some other predictor? Assessing whether the geographic distribution of something departs from randomness? Location data is ubiquitous, as are maps drawn by GIS software. SkilledContinue reading “Course Spotlight: Spatial Statistics Using R”
Yearly Archives: 2018
“Money and Brains” and “Furs and Station Wagons”
“Money and Brains” and “Furs and Station Wagons” were evocative customer shorthands that the marketing company Claritas came up with over a half century ago. These names, which facilitated the work of marketers and sales people, were shorthand descriptions of segments of customers identified through statistical cluster analysis. Cluster analysis is also used in marketContinue reading ““Money and Brains” and “Furs and Station Wagons””
Course Spotlight: Text Mining
The term text mining is sometimes used in two different meanings in computational statistics: Using predictive modeling to label many documents (e.g. legal docs might be “relevant” or “not relevant”) – this is what we call text mining. Using grammar and syntax to parse the meaning of individual documents – we use the term naturalContinue reading “Course Spotlight: Text Mining”
CONVOLUTION and TENSOR
Today’s Words of the Week are convolution and tensor, key components of deep learning.
BENFORD’S LAW
Benford’s law describes an expected distribution of the first digit in many naturally-occurring datasets.
CONTINGENCY TABLES
Contingency tables are tables of counts of events or things, cross-tabulated by row and column.
HYPERPARAMETER
Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.
SAMPLE
Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.
SPLINE
The easiest way to think of a spline is to first think of linear regression – a single linear relationship between an outcome variable and various predictor variables.
NLP
To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.
OVERFIT
As applied to statistical models – “overfit” means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data: