Skip to content

Quasi-experiment

Quasi-experiment: In social science research, particularly in the qualitative literature on program evaluation, the term "quasi-experiment" refers to studies that do not involve the application of treatments via random assignment of subjects. They are also called observational studies. A quasi-experiment (or observational study) does involve the application of a treatment,...

View Full Description

Bag-of-words

Bag-of-words: Bag-of-words is a simplified natural language processing concept. Text documents are parsed and output as collections of words (i.e. stripped of punctuation, etc.). In the bag-of-words concept, the resulting collection of words is considered for further analytics without regard to order, grammar, etc. (but the multiple occurrence of words...

View Full Description

Stemming

Stemming: In processing unstructured text, stemming is the process of converting multiple forms of the same word into one stem, to simplify the task of analyzing the processed text. For example, in the previous sentence, "processing," "process," and "processed" would all be converted to the single stem "process." Browse Other...

View Full Description

Structured vs. unstructured data

Structured vs. unstructured data: Structured data is data that is in a form that can be used to develop statistical or machine learning models (typically a matrix where rows are records and columns are variables or features). Or data that is in a form that can be extracted and turned...

View Full Description

Feature engineering

Feature engineering: In predictive modeling, a key step is to turn available data (which may come from varied sources and be messy) into an orderly matrix of rows (records to be predicted) and columns (predictor variables or features). The feature engineering process involves review of the data by a domain...

View Full Description

Naive bayes classifier

Naive bayes classifier: A full Bayesian classifier is a supervised learning technique that assigns a class to a record by finding other records with attributes just like it has, and finding the most prevalent class among them. Naive Bayes (NB) recognizes that finding exact matches is unlikely to be feasible...

View Full Description

Node

Node: A node is an entity in a network. In a social network, it would be a person. In a digital network, it would be a computer or device. Nodes can be of different types in the same network - a criminal network might contain individuals, organizations, locations, aliases, etc....

View Full Description

k-Nearest neighbor

Statistical Glossary k-Nearest neighbor: K-nearest-neighbor (K-NN) is a machine learning predictive algorithm that relies on calculation of distances between pairs of records. The algorithm is used in classification problems where training data are available with known target values. The algorithm takes each record and assigns it the class to which...

View Full Description

NoSQL

A NoSQL database is distinguished mainly by what it is not - it is not a structured relational database format that links multiple separate tables. NoSQL stands for "not only SQL," meaning that SQL, or structured query language is not needed to extract and organize information. NoSQL databases tend to...

View Full Description

Predictive Modeling

Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.). Many of the techniques used (e.g. regression, logistic regression,...

View Full Description

Directed vs. Undirected Network

Directed vs. Undirected Network: In a directed network, connections between nodes are directional. For example, in a Twitter network, Smith might follow Jones but that does not mean that Jones follows Smith. Each directional relationship would have an edge to represent it, typically with an arrow. In an undirected network,...

View Full Description

Regularization

Regression Trees: Regularization refers to a wide variety of techniques used to bring structure to statistical models in the face of data size, complexity and sparseness. Advances in digital processing, storage and retrieval have led to huge and growing data sets ("Big Data"). Regularization is used to allow models to...

View Full Description

SQL

SQL: SQL stands for structured query language, a high level language for querying relational databases, extracting information. For example, SQL provides the syntax rules that can translate a query like this into a form that can be submitted to the database: "Find all sales of products X and Y in...

View Full Description

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC): A Markov chain is a probability system that governs transition among states or through successive events. For example, in the American game of baseball, the probability of reaching base differs depending on the "count" -- the number of balls and strikes facing the batter. 3...

View Full Description

MapReduce

MapReduce In computer science, MapReduce is a procedure that prepares data for parallel processing on multiple computers. The "map" function sorts the data, and the "reduce" function generates frequencies of items. The combined overall system manages the parceling out of the data to multiple processors, and managing the tasks. Apache...

View Full Description

Hadoop

Hadoop: As data processing requirements grew beyond the capacities of even large computers, distributed computing systems were developed to spread the load to multiple computers. Hadoop is a distributed computing system with two key features: (1) it is open source, and (2) it can use low-cost commodity computers in its...

View Full Description

Curse of Dimensionality

Curse of Dimensionality: The curse of dimensionality is the affliction caused by adding variables to multivariate data models. As variables are added, the data space becomes increasingly sparse, and classification and prediction models fail because the available data are insufficient to provide a useful model across so many variables. An...

View Full Description

Data Product

Data Product: A data product is a product or service whose value is derived from using algorithmic methods on data, and which in turn produces data to be used in the same product, or tangential data products. For example, at large web-based retail organizations like Amazon, shopping carts are used...

View Full Description

Feature

This term is used synonymously with attribute and variable, it is actually an independent variable (see dependent and independent variables). The term feature comes from the machine learning community, often in the phrase "feature selection" (which see). Browse Other Glossary Entries

View Full Description