HDFS is the Hadoop Distributed File System. It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.
Yearly Archives: 2015
Week #42 – Kruskal – Wallis Test
The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.
Week #32 – False Discovery Rate
A “discovery” is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).
Week #23 – Netflix Contest
The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available. Individuals and teams then compete to develop the best performing model.
Week #20 – R
This week’s word is actually a letter. R is a statistical computing and programming language and program, a derivative of the commercial S-PLUS program, which, in turn, was an offshoot of S from Bell Labs.
Be Smarter Than Your Devices: Learn About Big Data
When Apple CEO Tim Cook finally unveiled his company’s new Apple Watch in a widely-publicized rollout earlier this month, most of the press coverage centered on its cost ($349 to start) and whether it would be as popular among consumers as the iPod or iMac. Nitin Indurkhya saw things differently. “I think the most significantContinue reading “Be Smarter Than Your Devices: Learn About Big Data”
Week #16 – Moving Average
In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t-1.
Week #15 – Interaction term
In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.
Week #14 – Naive forecast
A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).
week #9 – Overdispersion
In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.
Week #8 – Confusion matrix
In a classification model, the confusion matrix shows the counts of correct and erroneous classifications. In a binary classification problem, the matrix consists of 4 cells.
Week #5 – Features vs. Variables
The predictors in a predictive model are sometimes given different terms by different disciplines. Traditional statisticians think in terms of variables.