Statistical Word of the Week

Apr 23, 2013

Week #17 - Bootstrapping

Bootstrapping is sampling with replacement from observed data to estimate the variability in a statistic of interest. See also permutation tests, a related form of resampling. A common application

of the bootstrap is to assess the accuracy of an estimate based on a sample of data from a larger population. Consider the sample mean. The best way to find out how lots of different sample means turn out is to actually draw them.  Lacking the real universe to draw samples from, we need a proxy universe that embodies everything we know about the real universe, and which we can use to draw samples from.

One resampling technique is to replicate the sample data a huge number of times to create a proxy universe based entirely on our sample. After all, the sample itself usually embodies everything we know about the population that spawned it, so it´s often the best starting point for creating an artificial proxy universe, from which we can draw resamples, and observe the distribution of the statistic of interest.

A shortcut is to simply sample with replacement from the original sample. By sampling with replacement, each sample observation has 1/n probability of being selected each time - just as if you were drawing without replacement from an infinitely large replicated universe. This technique is called the bootstrap.

Drawing resamples with replacement from the observed data, we record the means found in a large number of resamples. Looking over this set of means, we can read the values that bound 90% or 95% of the entries. (a bootstrap confidence interval)

For comparison: The Classical Statistics World

In classical statistics, we still invoke the concept of the larger universe. However, rather than creating a proxy universe and actually drawing from it, classical statistics works from a mathematical description of this larger universe, based on information provided by the sample (typically mean and standard deviation), and assumptions of normality.

It is important to note that both the resampling and classical approaches make inferences about the larger population starting from the same point - the observed sample. If the observed sample is way off base, both approaches are in trouble.

Reasons for using a bootstrap approach include the fact that it makes no assumption concerning the distribution of the data, and the fact that it can assess the variability of virtually any statistic.

The bootstrap procedure was suggested by Julian Simon, an economist, in a 1969 research methods text. Bradley Efron coined the term "bootstrap" in 1979, and developed and elaborated the method in the statistical literature starting in 1979.

Promoting better understanding of statistics throughout the world.

The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. We will provide a statistical term every week, delivered directly to your inbox. To improve your own statistical knowledge, sign up here.

Rather not have more email?  Bookmark our "Stats Word of the Week" page.

Want to be notified of future courses?

Student comments