To a statistician, a sample is a collection of observations (cases). To a machine learner, it’s a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these:
To a statistician, a sample is a collection of observations (cases). To a machine learner, it’s a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these:
When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records. One of them is the “yacht owner” problem.
Rectangular data are the staple of statistical and machine learning models. Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.
Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample. It can occur in numerous situations, here are just a few:
A “likert scale” is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale. For example, a response could range from “agree strongly” through “agree somewhat” and “disagree somewhat” on to “disagree strongly.” Two key decisions the survey designer faces are
How many gradients to allow, and
Whether to include a neutral midpoint
A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category. Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:
Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb. With a pretense of being an individual selling a car on his or her own, and with no fixed location, such dealers avoid the fixed costs and regulations thatContinue reading “Curbstoning”
Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects. From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2. However, …
QUESTION: The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent). A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud. The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as “fraud”). If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?
Churn is a term used in marketing to refer to the departure, over time, of customers. Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.). A customer who leaves, for whatever reason, “churns.”
The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s. For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:
A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you’re measuring, who you’re measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition of the problem under study does not change once theContinue reading “Prospective vs. Retrospective”
It is 100 years since R A Fischer introduced the concept of “variance” (in his 1918 paper The Correlation Between Relatives on the Supposition of Mendelian Inheritance).
“Bag” refers to “bootstrap aggregating,” repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)
I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.
Today’s Words of the Week are convolution and tensor, key components of deep learning.
Benford’s law describes an expected distribution of the first digit in many naturally-occurring datasets.
Contingency tables are tables of counts of events or things, cross-tabulated by row and column.
Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.