
As data continue to grow at a faster rate than either population or economic activity, so do organizations' efforts to deal with the data deluge, and use it to capture value. And so do the methods used to analyze data, which creates an expanding set of terms (including some buzzwords) used to describe these methods.
This is a field in flux, and different people may have different conceptions of what terms mean. Comments on this page and its "definitions" are welcome. Since many of these terms are subsets of others, or overlapping, the clearest approach is to start with the more specific terms and move to the more general.
Basically the same thing as predictive modeling, but less specific and technical. Often used to describe the field more generally. Wikipedia reference here.
Another synonym for predictive modeling.
Data mining methods not involving the prediction of an outcome based on training models on data where the outcome is known. Unsupervised methods include cluster analysis, association rules, outlier detection, dimension reduction and more.
Covers nearly all of the above methods, and also carries the mantle of a well-established profession dating back to the mid 1800's. Although statisticians work on "big data" problems, the field of statistics has traditionally been focused on focused research studies (e.g. drug trials).
Refers to the huge amounts of data that large businesses and other organizations collect and store. It might be unstructured text (streams of tweets) or structured quantitative data (transaction databases). In the 1990's organizations began making efforts to extract useful information from this data. The challenges of big data lie mainly in the pre-analysis stage, in the IT domain.
Our friend, Gregory Piatetsky-Shapiro, Editor and Analytics/Data Mining Expert at KDnuggets conducted the following poll:
| What will replace "Big Data" as a hot buzzword ? [262 voters] |
|
| Smart Data (76) | |
| Big Analytics (73) | |
| Data+ (26) | |
| Linked Data (25) | |
| Internet of Things (23) | |
| Power Data (9) | |
| Good Data (5) | |
| Other(28) | |
For the full report, go to http://www.kdnuggets.com/polls/2012/what-will-replace-big-data.html
Analytics in which computers "learn" from data to produce models or rules that apply to those data and to other similar data. Predictive modeling techniques such as neural nets, classification and regression trees (decision trees), naive Bayes, k-nearest neighbor, and support vector machines are generally included. One characteristic of these techniques is that the form of the resulting model is flexible, and adapts to the data. Statistical modeling methods that have highly structured model forms, such as linear regression, logistic regression and discriminant analysis are generally not considered part of machine learning. Unsupervised learning methods such as association rules and clustering are also considered part of machine learning.
The science of describing and, especially, visualizing the connections among objects. The objects might be human, biological or physical. Graphical representation is a crucial part of the process; Wayne Zachary's classic 1977 network diagram of a karate club reveals the centrality of two individuals, and presages the club's subsequent split into two clubs. The key elements are the nodes (circles, representing individuals) and edges or links (lines representing connections).

(Wayne Zachary. An information flow model for conflict and fission in small groups, Journal of Anthropological Research, 33(4):452–473, 1977; cited in D. Easley & J. Kleinberg, Networks, Crowds, and Markets: Reasoning about a Highly Connected World, Cambridge University Press, 2010, available also at http://www.cs.cornell.edu/home/kleinber/networks-book/ where this figure is drawn from.)
Network analytics applied to connections among humans. Recently it has come also to encompass the analysis of web sites and internet services like Facebook.
Statistical or machine learning methods applied to web data such as page views, hits, clicks, and conversions (sales), generally with a view to learning what web presentations are most effective in achieving the organizational goal (usually sales). This goal might be to sell products and services on a site, to serve and sell advertising space, to purchase advertising on other sites, or to collect contact information. Key challenges in web analytics are the volume and constant flow of data, and the navigational complexity and sometimes lengthy gaps that precede users' relevant web decisions.
A combination of treatment comparisons (e.g. send a sales solicitation to one group, send nothing to another group) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments. Here are the steps, in conceptual terms, for a typical uplift model:
1. Conduct A-B test, where B is control
2. Combine all the data from both groups
3. Divide the data into a number of segments, each having roughly similar numbers of subjects who got treatment A and control. Tree-based methods are typically used for this.
4. The segments should be drawn such that, within each segment, the response to treatment A is substantially different from the response to control.
5. Considering each segment as the modeling unit, build a model that predicts whether a subject will respond positively to treatment A.
The challenge (and the novelty) is to recognize that the model cannot operate on individual cases, since subjects get either treatment A, OR control, but not both, so the "uplift" from treatment Z compared to control cannot be observed at the individual level, but only at the group level. Hence the need for the segments described in steps 3 and 4.
Note: Traditional A-B testing would stop at step 1, and apply the more successful treatment to all subjects.
Reference: "Real World Uplift Modelling with Significance-Based Uplift Trees," by N. J. Radcliffe and P. D. Surry, available as a white paper at stochasticsolutions.com/