THE NEW BIOSTATISTICS OF RESAMPLING
Julian L. Simon and Peter Bruce
INTRODUCTION
A large proportion of articles in important medical journals
nowadays employ probabilistic-statistical machinery; for example,
42 percent of original articles in a 1978-79 sample from the New
England Journal of Medicine did so (Emerson and Colditz, 1992).
These statistical devices are not understood well (if at all) by
many clinicians and researchers. Furthermore, such methods are
often not used correctly for the context in which they are em-
ployed. For example, a study of 50 articles in the New England
Journal of Medicine which used the t statistic to compare the
means of three or more groups found that more than half the uses
were not appropriate (Godfrey, 1992).
Even if the researcher has chosen a sound technique, the
reader is hard put to understand what the researcher has done.
Consider this recent not-atypical example in the American Journal
of Public Health:
The BMDP statistical package was used to determine
statistical significance by entering age, gender, and
smoking status (current smoker, former smoker, never
smoker) as variables in a polychotomoous logistic
regression model and examining the P values associated
with the regression coefficients...[The] model showed
that age (P<.001 and gender (P=.02) were significantly
related to smoking status. (Hensrud and Sprafka, 1993,
p. 415).
The reason that these conventional techniques are daunting
to the reader, and often misused by researchers, is that they are
inherently deep and complicated mathematically. Even the rela-
tively-simple t test for the comparison of two sample means is
built upon a difficult body of formulae such as the Normal ap-
proximation, which contains unintuitive elements such as pi and e
(the base of natural logarithms). And even this relatively-
simple statistical test can only properly be used after consult-
ing a body of rules about when it is and is not applicable; the
process resembles cooking with a cookbook, and among students is
widely known as "pluginski". As the example above shows, the
reader is asked to take on faith that the test used is
appropriate and the program is correct; no details are given that
would allow even the sophisticated reader to make an informed
judgment.
In recent decades, an entirely different approach to statis-
tical testing has developed. At a theoretical level the resam-
pling method has taken the world of mathematical statistics by
storm. Resampling is at least as efficient as the formulaic
method in most situations. More important for purposes here,
resampling is transparently clear to both researcher and reader,
which reduces the likelihood that "Type 4 error" will occur -
that is, use of the wrong method; this intellectual advantage has
been shown in controlled experiments (Simon, Atkinson, and Shevo-
kas, 1976).
We shall first show the method in action, comparing resam-
pling procedures with the standard formulaic treatment of the
same data in a standard biostatistics text. After these specific
examples we provide a general procedure for the resampling meth-
od, and then end with discussion of the general properties of the
method.
INFARCTION AND CHOLESTEROL: RESAMPLING VERSUS CONVENTIONAL
Let's consider one of the simplest numerical examples of
probabilistic-statistical reasoning given toward the front of a
standard book on medical statistics (Kahn and Sempos, 1989).
Using data from the Framingham study, the authors ask: What is
an appropriate "confidence interval" on the observed ratio of
"relative risk" (a measure which is defined below, closely
related to the odds ratio) of the development of myocardial
infarction 16 years after the study began, for men ages 35-44
with serum cholesterol either above 250, or equal to or below
250? The raw data are shown in Table 1.
Table 1 (Kahn and Sempos Table 3-8, p. 61)
The reader of the text is provided with five pages of alge-
bra leading to a formula which is not only cumbersome to use as
well as being mathematically opaque except to a mathematical
statistician; also, it applies only if the risk is less than 10
percent and the "data set is large enough", a statistics journal
reference being provided in case the data set is not large
enough. Then the reader is given an "alternative method that is
very much easier to calculate", but which "we cannot explain in
terms of elementary statistics" (p. 62).
Rather than addressing the relative-risk problem immediate-
ly, let's work into it slowly, using the same data but breaking
the problem into parts to which we apply simpler procedures.
Hypothesis Tests With Measured Data
Consider this classic question about the Framingham serum
cholesterol data: What is the degree of surety that there is a
difference in myocardial infarction rates between the high- and
low-cholesterol groups?
The statistical logic begins by asking: How likely is that
the two observed groups "really" came from the same "population"
with respect to infarction rates? Operationally, we address this
issue by asking how likely it is that two groups as different in
disease rates as the observed groups would be produced by the
same "statistical universe".
Key step: we assume that the relevant "benchmark" or "null-
hypothesis" population (universe) is the composite of the two
observed groups. That is, if there really were no "true" differ-
ence in infarction rates between the two serum-cholesterol
groups, and the observed disease differences occurred just be-
cause of sampling variation, the most reasonable representation
of the population from which they came is the composite of the
two observed groups.
Therefore, we compose a hypothetical "benchmark" universe
containing (135 + 470 =) 605 men at risk, and designate (10 + 21
=) 31 of them as infarction cases. We want to determine how
likely it is that a universe like this one would produce - just
by chance - two groups that differ as much as do the actually
observed groups. That is, how often would random sampling from
this universe produce one sub-sample of 135 containing a large
enough number of infarctions, and the other sub-sample of 470
producing few enough infarctions, that the difference in occur-
rence rates would be as high as the observed difference of .029?
(10/135 = .074, and 21/470 = .045).
So far, everything that has been said applies both to the
conventional formulaic method and to the "new statistics" resam-
pling method. But the logic is seldom explained to the reader of
a piece of research - if indeed the researcher her/himself grasps
what the formula is doing. And if one just grabs for a formula
with a prayer that it is the right one, one need never analyze
the statistical logic of the problem at hand.
Now we tackle this problem with a method that you would
think of yourself if you began with the following mind-set: How
can I simulate the mechanism whose operation I wish to under-
stand? These steps will do the job:
1. Fill an urn with 605 balls, 31 red and the rest (605 -
31 = 574) green.
2. Draw one sample of 135 (simulating the high serum-
cholesterol group), one ball at a time and throwing it back
after it is drawn to keep the simulated probability of an infarc-
tion the same throughout the sample; record the number of reds.
Then do the same with another sample of 470 (the low serum-
cholesterol group).
3. Calculate the difference in infarction rates for the two
simulated groups, and compare it to the actual difference
of .029; if the simulated difference is that large, record "Yes"
for this trial; if not, record "No".
4. Repeat steps 2 and 3 until a total of (say) 400 or 1000
trials have been completed. Compute the frequency with which the
simulated groups produce a difference as great as actually ob-
served. This frequency is an estimate of the probability that a
difference as great as that actually observed in Framingham would
occur even if serum cholesterol has no effect upon myocardial
infarction.
The procedure above can be carried out with balls in a
ceramic urn in a few hours. Yet it is natural to seek the added
convenience of the computer to draw the samples. Therefore, we
illustrate in Figure 1 how a simple computer program handles this
problem. We use our own RESAMPLING STATS, but it can be executed
in other languages as well, though usually with more complexity
and less clarity.
Figure 1
The results of the test using this program may be seen in
the histogram in Figure 1. We find - perhaps surprisingly - that
a difference as large as observed would occur by chance fully 10
percent of the time. (If we were not guided by the theoretical
expectation that high serum cholesterol produces heart disease,
we might include the 10 percent difference going in the other
direction, giving a 20 percent chance). Even a ten percent chance
is sufficient to strongly call into question the conclusion that
high serum cholesterol is dangerous. At a minimum, this statis-
tical result should call for more research before taking any
strong action clinically or otherwise.
Where should one look to determine which procedures should
be used to deal with a problem such as set forth above? Unlike
the formulaic approach, the basic source is not a manual which
sets forth a menu of formulas together with sets of rules about
when they are appropriate. Rather, you consult your own under-
standing about what it is that is happening in (say) the Framing-
ham situation, and the question that needs to be answered, and
then you construct a "model" that is as faithful to the facts as
is possible. The urn-sampling described above is such a model
for the case at hand.
To connect up what we have done with the conventional ap-
proach, we apply a z test (conceptually similar to the t test,
but applicable to yes-no data; it is the Normal-distribution
approximation to the large binomial distribution) and we find
that the results are much the same as the resampling result - an
eleven percent probability.
Someone may ask: Why do a resampling test when you can use
a standard device like a z or t test? The great advantage of
resampling is that it avoids "Type 4 error" - using the wrong
method. The researcher is more likely to arrive at sound
conclusions with resampling because s/he can understand what s/he
is doing, instead of blindly grabbing a formula which may be in
error.
The textbook drawn from here is an excellent one; the diffi-
culty of the presentation is an inescapable consequence of the
formulaic approach to probability and statistics. The body of
complex algebra and tables that only a rare expert understands
down to the foundations constitutes an impenetrable wall to
understanding. Yet without such understanding, there can be only
rote practice, which leads to frustration and error.
Confidence Intervals for the Counted Data
Consider for now just the data for the sub-group of 135
high-cholesterol men. A second classic statistical question is
as follows: How much confidence should we have that if we were
take a much larger sample than was actually obtained, the mean
(actually the proportion 10/135 = .07) would be in some vicinity
of the observed sample mean? Let us first carry out a resampling
procedure to answer the questions, waiting until afterwards to
discuss the logic of the inference.
1. Construct an urn containing 135 balls - 10 black (in-
farction) and 125 red (no infarction) to simulate the universe as
we guess it to be.
2. Mix, choose a ball, record its color, replace it, and
repeat 135 times (to simulate a sample of 135 men).
3. Record the number of black balls among the 135 drawings.
4. Repeat steps 2-4 perhaps 1000 times, and observe how
much the number of blacks varies from sample to sample. We arbi-
trarily denote the boundary lines that include 45 percent of the
hypothetical samples in each side of the sample mean as the 90
percent "confidence intervals" around the mean of the actual
population.
Figure 2 shows how this can be done easily on the computer,
together with the results.
Figure 2
The variation in the histogram in Figure 2 highlights the
fact that a sample containing only 10 cases of infarction is very
small, and the number of observed cases - or the proportion of
cases - necessarily varies greatly from sample to sample.
Perhaps the most important implication of this statistical
analysis, then, is that we badly need to collect additional data.
This is a classic problem in confidence intervals, found in
all subject fields. For example, at the beginning of the first
chapter of a best-selling book in business statistics, Wonnacott
and Wonnacott use the example of a 1988 presidential poll. The
language used in the cholesterol-infarction example above is
exactly the same as the language used for the Bush-Dukakis poll
except for labels and numbers.
Also typically, the text gives a formula without explaining
it, and says that it is "fully derived" eight chapters later
(Wonnacott and Wonnacott, 1990, p. 5). With resampling, one
never needs such a formula, and never needs to defer the
explanation.
The philosophic logic of confidence intervals is quite deep
and controversial, less obvious than for the hypothesis test.
The key idea is that we can estimate for any given universe the
probability P that a sample's mean will fall within any given
distance D of the universe's mean; we then turn this around and
assume that if we know the sample mean, the probability is P that
the universe mean is within distance D of it. This inversion is
more slippery than it may seem. But the logic is exactly the
same for the formulaic method and for resampling. The only
difference is how one estimates the probabilities - either with a
numerical resampling simulation, or with a formula or other
deductive mathematical device (such as counting and partitioning
all the possibilities, as Galileo did when he answered a gam-
bler's question about three dice.) And when one uses the resam-
pling method, the probabilistic calculations are the least de-
manding part of the work. One then has mental capacity available
to focus on the crucial part of the job - framing the original
question soundly, choosing a way to model the facts so as to
properly resample the actual situation, and drawing appropriate
inferences from the simulation.
If you have understood the general logic of the procedures
used up until this point, you are in command of all the necessary
conceptual knowledge to construct your own tests to answer any
statistical question. A lot more practice, working on a variety
of problems, obviously would help. But the key elements are
simple: 1) Model the real situation accurately, 2) experiment
with the model, and 3) compare the results of the model with the
observed results.
Confidence Intervals on Relative Risk With Resampling
Now we are ready to calculate - with full understanding -
the confidence intervals on relative risk that the text sought.
Recall that the observed sample of 135 high cholesterol men had
10 infarctions (a proportion of .074), and the sample of 470 low
cholesterol men had 21 infarctions (a proportion of .045). We
estimate the relative risk of high cholesterol as .074/.045. Let
us frame the question this way: If we were to randomly draw a
sample from the universe of high-cholesterol men that is best
estimated from our data (.074 percent infarctions), and a sample
from the universe of low-cholesterol men (.045 percent infarc-
tions), and do this again and again, within which bounds would
the relative risk calculated from that simulation fall (say) 95
percent of the time?
The operation is quite the same as that for a single confi-
dence interval estimated above except that we do the operation
for both sub-samples at once, and then calculate the ratio bet-
ween their results. As before, we would like to know what would
happen if we could take additional samples from the universes
that spawned our actual samples. Lacking the resources to do so,
we let those original samples "stand in" for the universes from
which they came, serving as proxy "substitute universes." We can
imagine replicating each sample element millions of times to
"bootstrap" these "proxy universes." Paralleling the real world,
we take simulated samples of the same size as our original sam-
ples. (Actually, we can skip replicating each sample element a
million times and achieve the same resampling effect by sampling
with replacement from our original samples -- that way, the
chance that a sample element will be drawn will remain the same
from draw to draw.) We count the number of infarctions in each
of our resamples, and for the pair of resamples, we calculate the
relative risk measure and keep score of this result. We then
take additional pairs of resamples, each time calculating the
relative risk measure.
We may compare our results in Figure 3 - a confidence
interval extending from 0.69 to 3.4 - to the results given in
Kahn and Sempos, which are 0.79 to 3.5. 0.80 to 3.4, and 0.79 to
3.7 from three different formulas (pp. 62-63); the agreement is
close.
Figure 3
It is interesting that this may be the first time a calcula-
tion of relative risk using resampling has ever been published.
And it therefore should be a contribution to the statistics
literature comparable with the formulaic approaches published in
earlier years. But because the procedure is worked out here on
an ad hoc basis, and does not seem to be very difficult, it
probably is not worth publishing separately. We point this out
because resampling routinely produces entirely new procedures at
least as powerful as the previously-existing formulaic proce-
dures. These resampling procedures also have the advantage of
being fully understood even by persons who are not professional
statisticians but who think hard about their subject matter, and
then create appropriate procedures by working from first princi-
ples and modeling their actual research situations with care and
understanding. Even underclasspersons in a state university are
able to do this; one would expect persons in medical school or
beyond it to be at least equally capable. That is the true revo-
lution wrought by resampling.
SOME OTHER ILLUSTRATIONS
A Measured-Data Example: Test of a Drug to Prevent Low Birthweight
The Framingham infarction-cholesterol examples worked with
yes-no "count" data. Let us therefore consider some
illustrations of the use of resampling with measured data.
Another leading textbook (Rosner, 1982, p. 257) gives the
example of a test of the hypothesis that drug A prevents low
birthweights. The data for the treatment and control groups are
shown in Table 2. Here is a resampling approach to the
problem:
Table 2
1. If the drug has no effect, our best guess about the
"universe" of birthweights is that it is composed of (say) a
million each of the observed weights, lumped together. In other
words, in the absence of any other information or compelling
theory, we assume that the combination of our samples is our best
estimate of the universe. Hence write each of the birthweights
on a card, and put them into a hat. Drawing them one by one and
then replacing them is the operational equivalent of a very large
(but equal) number of each birthweight.
2. Repeatedly draw two samples of 15 each, and check how
frequently the observed difference is as large or larger than the
actual difference.
We find in Figure 4 that only 1% of the pairs of
hypothetical resamples produced means that differed by as much
as .82. We therefore conclude that the observed difference is
unlikely to have occurred by chance.
Figure 4
Matched-Patients Test of Three Treatments
There have been several recent three-way tests of treatments
for depression: drug versus cognitive therapy versus combined
drug and cognitive therapy. Consider this procedure for a
proposed test in 31 triplets of people have been matched within
triplet by sex, age, and years of education. The three treatments
are to be chosen randomly within each triplet. Assume that the
outcomes on a series of tests were ranked from best (#1) to worst
(#3) within each triplet, and assume that the combined drug-and-
therapy regime has the highest average rank. How sure can we be
that the observed result would not occur by chance?
In hypothetical Table 3 the average rank for the drug and
therapy regime is 1.74. Is it possible that the regimes do not
differ with respect to effectiveness, and that the drug and
therapy regime came out with the best rank just by the luck of
the draw? We test by asking "If there is no difference, what is
the probability of getting an average rank this good, just by
chance?"
Table 3
Figure 5 shows a program for a resampling procedure that
repeatedly produces 31 sets of ranks randomly selected among the
numbers 1, 2 and 3, and averages the ranks for each treatment.
We can then observe whether an average of 1.74 is unusually low,
and hence should not be ascribed to chance.
Figure 5
In 1000 repetitions of the simulation, 5% yielded average
ranks as low as the observed value. This is evidence that
something besides chance might be at work here. (The result is
at the borderline of the traditional 5% "level of significance"
(a p-value of .05), supposedly set arbitrarily by the great
statistician R.A. Fisher on the grounds that a 1-in-20 happening
is too coincidental to ignore.) That is, the resampling test
suggests that it would be very unlikely for one of the treatment
regimes to achieve, just by chance, results as much better than
the other two regimes as are actually observed.
An interesting feature of this problem is that it would be
hard to find a conventional test that would handle this three-way
comparison in an efficient manner. Certainly it would be impossi-
ble to find a test that would not require formulae and tables
that only a talented professional statistician could manage
satisfactorily, and even the professional is not likely to fully
understand those formulaic procedures.
A DEFINITION AND GENERAL PROCEDURE FOR RESAMPLING
A statistical procedure manipulates some replica of the
physical process in which you are interested. A resampling
method simulates (models) the process with easy-to-handle sym-
bols. The resampler postulates a universe composed of the ob-
served data, which are then used to produce new hypothetical
samples whose properties are then examined. That is, one exam-
ines how the universe behaves, comparing the outcomes to a crite-
rion that we choose.
Here is an "operational definition" of resampling: Using
the entire set of data you have in hand, produce new samples of
simulated data, and examine the results of those samples. That's
it in a nutshell.
VARIETIES OF RESAMPLING METHODS
A resampling test may be constructed for almost any statis-
tical inference. Every real-life situation can be modeled by
symbols of some sort, and one may experiment with this model to
obtain resampling trials. The most important counterindication
is insufficient data to perform a useful resampling test, in
which case a conventional test - which makes up for the absence
of observations with an assumed theoretical distribution - may
produce more accurate results if the universe from which the data
are selected resembles the chosen theoretical distribution.
Exploration of the properties of resampling tests is an active
field of research at present.
For the main tasks in statistical inference - hypothesis
testing and confidence intervals - the appropriate resampling
test often is immediately obvious, as seen in the case of choles-
terol and infarction rates above.
(Technical note to biostatisticians: Two sorts of procedures
are especially well-suited to resampling: 1) When the size of
the universe is properly assumed fixed, or for other reasons
sampling without replacement is called for, it is appropriate to
sample from among the possible permutations of the data; this is
an adaptation of Ronald Fisher's "exact" test (confusingly, also
called a "randomization" test). The three-way drug test above is
an illustration; the rank of one member of a triplet affects the
possible ranks of the other two members, and hence the sampling
is done "without replacement". 2) The bootstrap procedure is
appropriate when the size of the universe is properly assumed not
to be fixed in size, and the measurement of one entity in the
sample does not affect the measurement of another entity. This
device - for which there is no analog in conventional formulaic
statistics - is illustrated by the birthweight test above.)
Resampling is a much simpler intellectual task than the
formulaic method, because simulation obviates the need to calcu-
late the number of possible ways that the event in which you are
interested - an infarction, say, or a birth of a certain size -
can or cannot occur. In technical terms, resampling does not
require computation of the "sample space" or any part of it. In
all but the most elementary problems where simple permutations
and combinations suffice, such calculations require advanced
training and delicate judgment; these calculations are the root
of the mathematical and conceptual difficulty of conventional
formulaic statistics.
Resampling avoids the complex abstraction of sample-space
calculations by substituting the particular information about how
elements in the sample are generated randomly in a specific
event, as learned from the actual circumstances; the analytic
method does not use this information. In the case of the gam-
blers prior to Galileo, resampling used the (assumed) facts that
three fair dice are thrown with an equal chance of any outcome,
and they took advantage of experience with many such events
performed one at a time; in contrast, Galileo made no use of the
actual stochastic element of the situation, and gained no infor-
mation from a sample of such trials, but rather replaced all
possible sequences by exhaustive computation.
The resampling method is not theoretically inferior to the
formulaic method. Resampling is not "just" a stochastic-
simulation approximation to formulas. It is a quite different
route to the same endpoint, using different intellectual
processes and utilizing different sorts of inputs; both resam-
pling and formulaic calculation are shortcuts to estimation of
the sample space and its partitions. Its much lesser
intellectual difficulty is the source of the central advantage of
resampling. It improves the probability that the user will
arrive at a sound solution to a problem - the ultimate criterion
for all except for pure mathematicians.
The applicability of resampling is especially great in
biostatistics because of the small and irregular samples so
common in clinical research.
THE PLACE OF RESAMPLING IN THE REALM OF KNOWLEDGE
Probability theory and its offspring, inferential
statistics, constitute perhaps the most frustrating branch of
human knowledge.
Right from its beginnings in the seventeenth century,
the great mathematical discoverers knew that the probabilistic
way of thinking -- which we'll call "prob-stats" for short --
offers enormous power to improve our decisions and the quality of
our lives. Yet until very recently, when the resampling method
came along, scholars were unable to convert this powerful body of
theory into a tool that laypersons could and would use freely in
daily work and personal life. Instead, only professional
statisticians feel themselves in comfortable command of the prob-
stats way of thinking. The most frequent applications are by
medical and social scientists, who know that prob-stats is indis-
pensable to their work yet too often fear and misuse it.
Resampling is now fully accepted theoretically. The
publication of advanced papers exploring its properties is
proceeding at a breathtaking rate throughout the world. And
controlled studies show that people ranging from engineers and
scientists down to seventh graders quickly handle more problems
correctly than they do with conventional methods. Furthermore,
in contrast to the older conventional statistics, which is a
painful and humiliating experience for most students at all
levels, the published studies show that students enjoy resampling
statistics. But the resampling has not yet penetrated very far
into the classroom, for a variety of institutional and historical
reasons.
Resampling in Medical Education
Prob-stats is the bane of medical students as well as all
other students required to study it; the statistics course is a
painful rite of passage -- like fraternity paddling -- on the way
to a degree. Afterwards, the subject is happily put out of mind
forever.
Yet the practice of medicine becomes more and more dependent
upon a knowledge of statistics. Physicians like to say that they
practice on the basis of "clinical knowledge". Yet in an ever-
growing proportion of situations, choice of treatment comes
straight from research studies whose conclusions depend on sta-
tistical tests. Without a sound understanding of inference, a
physician cannot evaluate such studies and sort out which to rely
upon.
Teaching physicians statistics has been an impossible nut to
crack. As one statistician wrote about her attempt to teach
medical students conventional statistical methods: "I gazed into
the sea of glazed eyes and forlorn faces, shocked by the looks of
naked fear my appearance at the lectern prompted" (Vaisrub,
1990).
Students of probability and statistics simply memorize the
rules. Most users of prob-stats select their methods blindly,
understanding little or nothing of the basis for choosing one
method rather than another, and simply push the buttons for one
or another easily available computer operation. This often leads
to wildly inappropriate practices, and contributes to the
damnation of statistics.
The statistical community has made valiant attempts to
ameliorate the situation. Great statisticians have struggled to
find interesting and understandable ways to teach prob-stats.
Learned committees and professional associations have wrung their
hands in despair, and spent millions of dollars creating televi-
sion series and text books. Despite successes, these campaigns
to promote prob-stats have largely failed. The enterprise smash-
es up against an impenetrable wall - the body of complex algebra
and tables that only a rare expert understands right down to the
foundations. For example, almost no one can write the formula
for the "Normal" distribution that is at the heart of most sta-
tistical tests. Even fewer understand its meaning. Yet without
such understanding, there can be only rote learning.
The resampling method, in combination with the personal
computer, promises to cure this disease, and finally realize the
great potential of statistics and probability.
In the absence of formulae, black-box computer programs, and
cryptic tables, the resampling approach forces you to directly
address the problem at hand. Then, instead of asking "Which
formula should I use?" one begins to ask more profound questions
such as "Why is something `significant' if it occurs 4% of the
time by chance, yet not `significant' if a random process pro-
duces it 8% of the time?"
About "Exactness"
Earlier we suggested that the likelihood of arriving at a
sound answer with a valid method, rather than using an incorrect
method, is more important scientifically than any likely
inexactness from the resampling simulation method. But even that
concedes too much: The formulaic method itself is in no way
perfectly exact; rather, it rests on approximations. The Normal
distribution itself is only an approximation to the binomial.
And often there are approximations in computing formulas.
[There also is a certain irony in the common objection that
resampling is not "exact" because the results are "only" a sam-
ple. The basis of all statistical work is sample data drawn from
actual populations. Statisticians have only recently managed to
win battles against those bureaucrats and social scientists who,
out of ignorance of statistics, believed that only a complete
census of a country's population, or examination of every volume
in a library, could give satisfactory information about unemploy-
ment rates or book sizes. Indeed, samples are sometimes even
more accurate than censuses. Yet many of those same statisti-
cians have been skittish about simulated samples of data points
taken from the sample space - drawn far more randomly than the
data themselves, even at best. They tend to want a complete
"census" of the sample space, even when sampling is more likely
to arrive at a correct answer because it is intellectually sim-
pler (as with the gamblers and Galileo.)]
CONCLUSION
Probabilistic analysis is crucial in medicine, perhaps more
so than in any other discipline. Judgments about whether to use
one treatment or another, or to allow a new medicine on the
market, require that the decision-maker assess chance variability
in the data. But until now, the practice and teaching of proba-
bilistic statistics, with its abstruse structure of mathematical
formulas cum tables of values based on restrictive assumptions
concerning data distributions -- all of which separate the user
from the actual data or physical process under consideration --
have kept the full fruits of statistical understanding from the
medical community.
Estimating probabilities with conventional mathematical
methods is often so complex that the process scares many people.
And properly so, because the difficulties lead to frequent
errors. The statistical profession has long expressed grave
concern about the widespread use of conventional tests whose
foundations are poorly understood. The recent ready availability
of statistical computer packages that can easily perform
conventional tests with a single command, irrespective of whether
the user understands what is going on or whether the test is
appropriate, has exacerbated this problem. This has led teachers
to emphasize descriptive statistics and even ignore inferential
statistics.
Beneath every formal statistical procedure there lies a
physical process. Resampling methods allow one to work directly
with the underlying physical model by simulating it. The term
"resampling" refers to the use of the given data, or a data
generating mechanism such as a die, to produce new samples, the
results of which can then be examined. Resampling estimates
probabilities by numerical experiments instead of with formulae
-- by flipping coins or picking numbers from a hat, or with the
same operations simulated on a computer.
The resampling method enables people to obtain the benefits
of statistics and probability theory without the shortcomings of
conventional methods, because it is free of mathematical formulas
and restrictive assumptions and is easy to understand and use,
especially in conjunction with the computer language and program
RESAMPLING STATS.
It is the overall approach - the propensity to turn first to
resampling methods to handle practical problems - that most
clearly distinguishes resampling from conventional statistics.
In addition, some resampling methods are new in themselves, the
result of the basic resample-it tendency of the past quarter
century.
Resampling replaces the complex mathematical calculations
about the size of the sample space and its parts by simulating
the conditions that produce the individual events; the informa-
tion about these concrete conditions is not used by the formulaic
method. This very different intellectual method is the source of
its clarity and simplicity.
REFERENCES
Edgington, Eugene S., Randomization Tests, Marcel Dekker, N.
Y., 1980
Efron, Bradley, and Diaconis, Persi; "Computer Intensive
Methods in Statistics," Scientific American, May, 1983, pp. 116-
130.
Emerson, John D., and Graham A. Colditz, "Use of Statistical
Analysis in the New England Journal of Medicine", in John C.
Bailar III and Frederick Mosteller, Medical Uses of Statistics
(Boston: NEJM Books, 1992), pp. 45-57.
Godrey, Katherine, "Comparing the Means of Several Groups",
in John C. Bailar III and Frederick Mosteller, Medical Uses of
Statistics (Boston: NEJM Books, 1992), pp. 233-258.
Hensrud, Donald D., and J. Michael Sprafka, "The Smoking
Habits of Minnesota Physicians", American Journal of Public
Health, vol 83, March, 1993, 415-417.
Kahn, Harold A., and Christopher T. Sempos, Statistical
Methods in Epidemiology (New York: Oxford, 1989)
Noreen, Eric W., Computer Intensive Methods for Testing
Hypotheses, (New York: Wiley, 1989)
Rosner, Bernard, Fundamentals of Biostatistics, (Boston:
Duxbury, 1982)
Simon, Julian L., Basic Research Methods in Social Science,
1969, (New York: Random House, 1989; 3rd Edition, 1985, with
Paul Burstein)
Simon, Julian L., Atkinson, David T., and Shevokas, Carolyn,
"Probability and Statistics: Experimental Results of a Radically
Different Teaching Method," American Mathematical Monthly, v.
83, No. 9, Nov. 1976
Simon, Julian L., and Bruce, Peter C., "Resampling: Everday
Statistical Tool," Chance, v. 4, #1, 1991
Simon, Julian L., Resampling: Probability and Statistics a
Radically Different Way (Belmont, CA: Wadsworth, forthcoming
1993).
Vaisrub, Naomie, Chance, Winter, 1990, p. 53*************
Wonnacott, Thomas H. and Ronald J. Wonnacott, Introductory
Statistics for Business and Economics 4th edition (New York:
Wiley, 1990).
URN 31#1 574#2 men An urn called "men" with 31 ones
(=infarctions) and 574 twos
(=no infarction)
SAMPLE 135 men high Sample (with replacement!) 135
of the numbers in this urn, give
this group the name "high"
SAMPLE 470 men low Same for a group of 470, call
it low
COUNT high =1 a Count infarctions in first group
DIVIDE a 135 aa Express as a proportion
COUNT low =1 b Count infarctions in second
group
DIVIDE b 470 bb Express as a proportion
SUBTRACT aa bb c Find the difference in
infarction rates
SCORE c z Keep score of this difference
END
HISTOGRAM z
COUNT z >=.029 k How often was the resampled
difference >= the observed
difference?
DIVIDE k 1000 kk Convert this result to a
proportion
PRINT kk
200+
+
+
F +
r +
e 150+
q +
u +
e +
n + **
c 100+ **
y + ** ***
+ ** ***
* + ******
+ ****** *
Z 50+ ***********
+ ***********
+ ************ **
+ ****************
+ *********************
0+-------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
-0.1 -0.05 0 0.05 0.1
Difference between resamples
(proportion with infarction)
kk = 0.102 (the proportion of resample pairs
with a difference >= .029)
URN 10#1 125#0 men An urn (called "men") with
ten 1's (infarctions)
and 125 0's (no infarction)
REPEAT 1000 Do 1000 trials
SAMPLE 135 men a Sample (with replacement) 135
numbers from the urn, put them in
"a"
COUNT a =1 b Count the infarctions
DIVIDE b 135 c Express as a proportion
SCORE c z Keep score of the result
END End the trial, go back and repeat
HISTOGRAM z Produce a histogram of all trial
results
PERCENTILE z (2.5 97.5) k Determine the 2.5th and 97.5th
percentiles of all trial results;
these points enclose 95% of the
results
PRINT k
F +
r +
e 150+
q + *
u + * *
e + * **
n + ** **
c 100+ ** ** *
y + * ** ** *
+ * ** ** **
* + * ** ** **
+ * ** ** **
Z 50+ * ** ** **
+ * ** ** ** ** **
+ * ** ** ** ** ** *
+ ** ** ** ** ** ** *
+ ** ** ** ** ** ** ** *
0+-------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
0 0.05 0.1 0.15 0.2
Proportion with infarction
k = 0.037037 0.11852
(This is the 95% confidence interval, enclosing 95% of the resam-
ple results)
URN 10#1 125#0 high The universe of 135 high cholesterol
men, 10 of whom (1's) have infarctions
URN 21#1 449#0 low The universe of 470 low cholesterol
men, 21 of whom (1's) have infarctions
REPEAT 1000 Repeat the steps that follow 1000
times
SAMPLE 135 high high$ Sample 150 (with replacement) from
the high cholesterol universe, and
put them in "high$" [the "$"
suffix just indicates a resampled
counterpart to the actual sample]
SAMPLE 470 low low$ Similarly for 470 from
the low cholesterol universe
COUNT high$ =1 a Count the infarctions in the first
resampled group
DIVIDE a 135 aa Convert to a proportion
COUNT low$ =1 b Count the infarctions in the second
resampled group
DIVIDE b 470 bb Convert to a proportion
DIVIDE aa bb c Divide the proportions to calculate
relative risk
SCORE c z Keep score of this result
END End the trial, go back and repeat
HISTOGRAM z Produce a histogram of trial results
PERCENTILE z (2.5 97.5) k Find the percentiles that
bound 95% of the trial results
PRINT k
F + *
r + *
e 75+ * *
q + * *
u + * *
e + **** * *
n + **** * *
c 50+ * ****** *
y + * *********
+ ************
* + ************* *
+ **************** *
Z 25+ **************** *
+ ********************
+ * ******************** *
+ *********************** * *
+ ****************************** * * *
0+---------------------------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
0 1 2 3 4 5 6
Relative risk
Results (estimated 95% confidence interval):
k = 0.68507 3.3944
NUMBERS (6.9 7.6 7.3 7.6 6.8 7.2 8.0 5.5 5.8 7.3 8.2 6.9 6.8 5.7
8.6) treat
NUMBERS (6.4 6.7 5.4 8.2 5.3 6.6 5.8 5.7 6.2 7.1 7.0 6.9 5.6 4.2
6.8) control
CONCAT treat control all Combine all observations in
same vector
REPEAT 1000 Do 1000 simulations
SAMPLE 15 all treat$ Take a resample of 15 from all
birthweights (the $ indicates
a resampling counterpart to a
real sample)
SAMPLE 15 all control$ Take a second, similar resample
MEAN treat$ mt Find the means of the two
resamples
MEAN control$ mc
SUBTRACT mt mc dif Find the difference between the
means of the two resamples
SCORE dif z Keep score of the result
END End the simulation experiment,
go back and repeat
HISTOGRAM z Produce a histogram of the
resample differences
COUNT z >= .82 k How often did resample
differences exceed the observed
difference of .82?
F +
r +
e 75+
q +
u +
e +
n + * * * *
c 50+ * * * *** *
y + ***********
+ ************ *
* + ***************
+ ****************
Z 25+ * ******************* *
+ * *********************
+ ** **********************
+ ******************************
+ ***********************************
0+---------------------------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
-1.5 -1 -0.5 0 0.5 1 1.5
Resample differences in pounds
Result: Only 1.3% of the pairs of resamples produced means that
differed by as much as .82. We can conclude that the observed
difference is unlikely to have occurred by chance.
REPEAT 1000 Do 1000 simulations
GENERATE 31 (1 2 3) ranks Generate 31 numbers, each
number a 1, 2 or 3, to
simulate random assignment of
ranks 1-3 to the drug/
therapy alternative
MEAN ranks rankmean Take the mean of these 31
SCORE rankmean z Keep score of the mean
END End the simulation, go back
and repeat
HISTOGRAM z Produce a histogram of the
rank means
COUNT z <=1.74 k How often mean rank better than
1.74, the observed value?
PRINT k
100+
+ * *
+ * *
F + * * *
r + * ** * *
e 75+ * ** * *
q + ** ** * ** *
u + ** ** * ** *
e + ** ** * ** *
n + ** ** * ** **
c 50+ ** ** * ** **
y + * ** ** * ** ** *
+ * ** ** * ** ** *
* + * * ** ** * ** ** *
+ ** * ** ** * ** ** * *
Z 25+ ** * ** ** * ** ** * **
+ * ** * ** ** * ** ** * **
+ * ** * ** ** * ** ** * ** * *
+ * * ** * ** ** * ** ** * ** * * *
+ * ** ** * ** * ** ** * ** ** * ** * ** *
0+---------------------------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
1.4 1.6 1.8 2 2.2 2.4 2.6
Development of Mycardial infarction in Framingham after 16 Years
Men Age 35-44, by Level of Serum cholesterol
Serum cholesterol Developed MI Did not develop MI Total
(mg%)
>250 10 125 135
<=250 21 449 470
Source: Shurtleff, D. The Framingham Study: An Epidemiologic
investigation of Cardiovascular Disease, Section 26. Washington,
DC, U.S. Government Printing Office. Cited in Kahn and Sempos
(1989), p. 61, Table 3-8
Birthweights in a Clinical Trial to Test a Drug
for Preventing Low Birthweights
Baby Weight (lb)
Patient Treatment group Control group
1 6.9 6.4
2 7.6 6.7
3 7.3 5.4
4 7.6 8.2
5 6.8 5.3
6 7.2 6.6
7 8.0 5.8
8 5.5 5.7
9 5.8 6.2
10 7.3 7.1
11 8.2 7.0
12 6.9 6.9
13 6.8 5.6
14 5.7 4.2
15 8.6 6.8
Source: Rosner, Table 8.7
Observed Rank of Treatments, by Effectiveness (Hypothetical)
Treatment
Triplet Group Drug Therapy Drug/Therapy
1 3 1 2
2 2 3 1
3 1 3 2
. . . .
. . . .
. . . .
31 2 1 3
Avg. rank 2.29 1.98 1.74