Glossary

Bootstrapping

Bootstrapping:

Bootstrapping is sampling with replacement from observed data to estimate the variability in a statistic of interest. See also permutation tests, a related form of resampling.

A common application of the bootstrap is to assess the accuracy of an estimate based on a sample of data from a larger population. Consider the sample mean. The best way to find out how lots of different sample means turn out is to actually draw them.

Resampling: The New Statistics by Julian Simon (2nd ed.)
A basic introduction to resampling by an early pioneer of the bootstrap, 436 pp.

Lacking the real universe to draw samples from, we need a proxy universe that embodies everything we know about the real universe, and which we can use to draw samples from.

One resampling technique is to replicate the sample data a huge number of times to create a proxy universe based entirely on our sample. After all, the sample itself usually embodies everything we know about the population that spawned it, so it´s often the best starting point for creating an artificial proxy universe, from which we can draw resamples, and observe the distribution of the statistic of interest.

A shortcut is to simply sample with replacement from the original sample. By sampling with replacement, each sample observation has 1/n probability of being selected each time – just as if you were drawing without replacement from an infinitely large replicated universe. This technique is called the bootstrap.

Drawing resamples with replacement from the observed data, we record the means found in a large number of resamples. Looking over this set of means, we can read the values that bound 90% or 95% of the entries. (a bootstrap confidence interval)

For comparison: The Classical Statistics World

In classical statistics, we still invoke the concept of the larger universe. However, rather than creating a proxy universe and actually drawing from it, classical statistics works from a mathematical description of this larger universe, based on information provided by the sample (typically mean and standard deviation), and assumptions of normality.

It is important to note that both the resampling and classical approaches make inferences about the larger population starting from the same point – the observed sample. If the observed sample is way off base, both approaches are in trouble.

Reasons for using a bootstrap approach include the fact that it makes no assumption concerning the distribution of the data, and the fact that it can assess the variability of virtually any statistic.

The bootstrap procedure was first suggested by Julian Simon, an economist, in 1969. Bradley Efron coined the term “bootstrap” in 1979, and developed and elaborated the method in the statistical literature starting in 1979.

Browse Other Glossary Entries

Test Yourself

Planning on taking an introductory statistics course, but not sure if you need to start at the beginning? Review the course description for each of our introductory statistics courses and estimate which best matches your level, then take the self test for that course. If you get all or almost all the questions correct, move on and take the next test.