THE NEW BIOSTATISTICS OF RESAMPLING

                        Julian L. Simon and Peter Bruce


                                  INTRODUCTION

             A large proportion of articles in important medical journals

        nowadays employ probabilistic-statistical machinery; for example,

        42 percent of original articles in a 1978-79 sample from the New

        England Journal of Medicine did so (Emerson and Colditz, 1992).

        These statistical devices are not understood well (if at all) by

        many clinicians and researchers.  Furthermore, such methods are

        often not used correctly for the context in which they are em-

        ployed.  For example, a study of 50 articles in the New England

        Journal of Medicine which used the t statistic to compare the

        means of three or more groups found that more than half the uses

        were not appropriate (Godfrey, 1992).

             Even if the researcher has chosen a sound technique, the

        reader is hard put to understand what the researcher has done.

        Consider this recent not-atypical example in the American Journal
        of Public Health:
             The BMDP statistical package was used to determine
             statistical significance by entering age, gender, and
             smoking status (current smoker, former smoker, never
             smoker) as variables in a polychotomoous logistic
             regression model and examining the P values associated
             with the regression coefficients...[The] model showed
             that age (P<.001 and gender (P=.02) were significantly
             related to smoking status. (Hensrud and Sprafka, 1993,
             p. 415).


             The reason that these conventional techniques are daunting

        to the reader, and often misused by researchers, is that they are

        inherently deep and complicated mathematically.   Even the rela-

        tively-simple t test for the comparison of two sample means is

        built upon a difficult body of formulae such as the Normal ap-

        proximation, which contains unintuitive elements such as pi and e

        (the base of natural logarithms).  And even this relatively-

        simple statistical test can only properly be used after consult-

        ing a body of rules about when it is and is not applicable; the

        process resembles cooking with a cookbook, and among students is

        widely known as "pluginski".  As the example above shows, the

        reader is asked to take on faith that the test used is

        appropriate and the program is correct; no details are given that

        would allow even the sophisticated reader to make an informed

        judgment.

             In recent decades, an entirely different approach to statis-

        tical testing has developed.  At a theoretical level the resam-

        pling method has taken the world of mathematical statistics by

        storm.  Resampling is at least as efficient as the formulaic

        method in most situations.  More important for purposes here,

        resampling is transparently clear to both researcher and reader,

        which reduces the likelihood that "Type 4 error" will occur -

        that is, use of the wrong method; this intellectual advantage has

        been shown in controlled experiments (Simon, Atkinson, and Shevo-

        kas, 1976).

             We shall first show the method in action, comparing resam-

        pling procedures with the standard formulaic treatment of the

        same data in a standard biostatistics text.  After these specific

        examples we provide a general procedure for the resampling meth-

        od, and then end with discussion of the general properties of the

        method.


           INFARCTION AND CHOLESTEROL: RESAMPLING VERSUS CONVENTIONAL

             Let's consider one of the simplest numerical examples of

        probabilistic-statistical reasoning given toward the front of a

        standard book on medical statistics (Kahn and Sempos, 1989).

        Using data from the Framingham study, the authors ask:  What is

        an appropriate "confidence interval" on the observed ratio of

        "relative risk" (a measure which is defined below, closely

        related to the odds ratio) of the development of myocardial

        infarction 16 years after the study began, for men ages 35-44

        with serum cholesterol either above 250, or equal to or below

        250?  The raw data are shown in Table 1.

                   Table 1 (Kahn and Sempos Table 3-8, p. 61)

             The reader of the text is provided with five pages of alge-

        bra leading to a formula which is not only cumbersome to use as

        well as being mathematically opaque except to a mathematical

        statistician; also, it applies only if the risk is less than 10

        percent and the "data set is large enough", a statistics journal

        reference being provided in case the data set is not large

        enough.  Then the reader is given  an "alternative method that is

        very much easier to calculate", but which "we cannot explain in

        terms of elementary statistics" (p. 62).

             Rather than addressing the relative-risk problem immediate-

        ly, let's work into it slowly, using the same data but breaking

        the problem into parts to which we apply simpler procedures.


        Hypothesis Tests With Measured Data

             Consider this classic question about the Framingham serum

        cholesterol data:  What is the degree of surety that there is a

        difference in myocardial infarction rates between the high- and

        low-cholesterol groups?

             The statistical logic begins by asking:  How likely is that

        the two observed groups "really" came from the same "population"

        with respect to infarction rates?  Operationally, we address this

        issue by asking how likely it is that two groups as different in

        disease rates as the observed groups would be produced by the

        same "statistical universe".

             Key step: we assume that the relevant "benchmark" or "null-

        hypothesis" population (universe) is the composite of the two

        observed groups.  That is, if there really were no "true" differ-

        ence in infarction rates between the two serum-cholesterol

        groups, and the observed disease differences occurred just be-

        cause of sampling variation, the most reasonable representation

        of the population from which they came is the composite of the

        two observed groups.

             Therefore, we compose a hypothetical "benchmark" universe

        containing (135 + 470 =) 605 men at risk, and designate (10 + 21

        =) 31 of them as infarction cases.  We want to determine how

        likely it is that a universe like this one would produce - just

        by chance - two groups that differ as much as do the actually

        observed groups.  That is, how often would random sampling from

        this universe produce one sub-sample of 135 containing a large

        enough number of infarctions, and the other sub-sample of 470

        producing few enough infarctions, that the difference in occur-

        rence rates would be as high as the observed difference of .029?

        (10/135 = .074, and 21/470 = .045).

             So far, everything that has been said applies both to the

        conventional formulaic method and to the "new statistics" resam-

        pling method.  But the logic is seldom explained to the reader of

        a piece of research - if indeed the researcher her/himself grasps

        what the formula is doing.  And if one just grabs for a formula

        with a prayer that it is the right one, one need never analyze

        the statistical logic of the problem at hand.

             Now we tackle this problem with a method that you would

        think of yourself if you began with the following mind-set:  How

        can I simulate the mechanism whose operation I wish to under-

        stand?  These steps will do the job:

             1.  Fill an urn with 605 balls, 31 red and the rest (605 -

        31 = 574) green.

             2.  Draw one sample of 135 (simulating the high serum-

        cholesterol group), one ball at a time and throwing it back

        after it is drawn to keep the simulated probability of an infarc-

        tion the same throughout the sample; record the number of reds.

        Then do the same with another sample of 470 (the low serum-

        cholesterol group).

             3.  Calculate the difference in infarction rates for the two

        simulated groups, and compare it to the actual difference

        of .029; if the simulated difference is that large, record "Yes"

        for this trial; if not, record "No".

             4.  Repeat steps 2 and 3 until a total of (say) 400 or 1000

        trials have been completed.  Compute the frequency with which the

        simulated groups produce a difference as great as actually ob-

        served.  This frequency is an estimate of the probability that a

        difference as great as that actually observed in Framingham would

        occur even if serum cholesterol has no effect upon myocardial

        infarction.

             The procedure above can be carried out with balls in a

        ceramic urn in a few hours.  Yet it is natural to seek the added

        convenience of the computer to draw the samples.  Therefore, we

        illustrate in Figure 1 how a simple computer program handles this

        problem. We use our own RESAMPLING STATS, but it can be executed

        in other languages as well, though usually with more complexity

        and less clarity.

                                    Figure 1

             The results of the test using this program may be seen in

        the histogram in Figure 1.  We find - perhaps surprisingly - that

        a difference as large as observed would occur by chance fully 10

        percent of the time. (If we were not guided by the theoretical

        expectation that high serum cholesterol produces heart disease,

        we might include the 10 percent difference going in the other

        direction, giving a 20 percent chance). Even a ten percent chance

        is sufficient to strongly call into question the conclusion that

        high serum cholesterol is dangerous.  At a minimum, this statis-

        tical result should call for more research before taking any

        strong action clinically or otherwise.

             Where should one look to determine which procedures should

        be used to deal with a problem such as set forth above?  Unlike

        the formulaic approach, the basic source is not a manual which

        sets forth a menu of formulas together with sets of rules about

        when they are appropriate.  Rather, you consult your own under-

        standing about what it is that is happening in (say) the Framing-

        ham situation, and the question that needs to be answered, and

        then you construct a "model" that is as faithful to the facts as

        is possible.  The urn-sampling described above is such a model

        for the case at hand.

             To connect up what we have done with the conventional ap-

        proach, we apply a z test (conceptually similar to the t test,

        but applicable to yes-no data; it is the Normal-distribution

        approximation to the large binomial distribution) and we find

        that the results are much the same as the resampling result - an

        eleven percent probability.

             Someone may ask:  Why do a resampling test when you can use

        a standard device like a z or t test?  The great advantage of

        resampling is that it avoids "Type 4 error" - using the wrong

        method.  The researcher is more likely to arrive at sound

        conclusions with resampling because s/he can understand what s/he

        is doing, instead of blindly grabbing a formula which may be in

        error.

             The textbook drawn from here is an excellent one; the diffi-

        culty of the presentation is an inescapable consequence of the

        formulaic approach to probability and statistics.  The body of

        complex algebra and tables that only a rare expert understands

        down to the foundations constitutes an impenetrable wall to

        understanding.  Yet without such understanding, there can be only

        rote practice, which leads to frustration and error.


        Confidence Intervals for the Counted Data

             Consider for now just the data for the sub-group of 135

        high-cholesterol men.  A second classic statistical question is

        as follows:  How much confidence should we have that if we were

        take a much larger sample than was actually obtained, the mean

        (actually the proportion 10/135 = .07) would be in some vicinity

        of the observed sample mean?  Let us first carry out a resampling

        procedure to answer the questions, waiting until afterwards to

        discuss the logic of the inference.

             1.  Construct an urn containing 135 balls - 10 black (in-

        farction) and 125 red (no infarction) to simulate the universe as

        we guess it to be.

             2.  Mix, choose a ball, record its color, replace it, and

        repeat 135 times (to simulate a sample of 135 men).

             3.  Record the number of black balls among the 135 drawings.

             4.  Repeat steps 2-4 perhaps 1000 times, and observe how

        much the number of blacks varies from sample to sample.  We arbi-

        trarily denote the boundary lines that include 45 percent of the

        hypothetical samples in each side of the sample mean as the 90

        percent "confidence intervals" around the mean of the actual

        population.

             Figure 2 shows how this can be done easily on the computer,

        together with the results.

                                    Figure 2

             The variation in the histogram in Figure 2 highlights the

        fact that a sample containing only 10 cases of infarction is very

        small, and the number of observed cases - or the proportion of

        cases - necessarily varies greatly from sample to sample.

        Perhaps the most important implication of this statistical

        analysis, then, is that we badly need to collect additional data.

             This is a classic problem in confidence intervals, found in

        all subject fields.  For example, at the beginning of the first

        chapter of a best-selling book in business statistics, Wonnacott

        and Wonnacott use the example of a 1988 presidential poll. The

        language used in the cholesterol-infarction example above is

        exactly the same as the language used for the Bush-Dukakis poll

        except for labels and numbers.

             Also typically, the text gives a formula without explaining

        it, and says that it is "fully derived" eight chapters later

        (Wonnacott and Wonnacott, 1990, p. 5).  With resampling, one

        never needs such a formula, and never needs to defer the

        explanation.

             The philosophic logic of confidence intervals is quite deep

        and controversial, less obvious than for the hypothesis test.

        The key idea is that we can estimate for any given universe the

        probability P that a sample's mean will fall within any given

        distance D of the universe's mean; we then turn this around and

        assume that if we know the sample mean, the probability is P that

        the universe mean is within distance D of it.  This inversion is

        more slippery than it may seem.  But the logic is exactly the

        same for the formulaic method and for resampling.  The only

        difference is how one estimates the probabilities - either with a

        numerical resampling simulation, or with a formula or other

        deductive mathematical device (such as counting and partitioning

        all the possibilities, as Galileo did when he answered a gam-

        bler's question about three dice.)  And when one uses the resam-

        pling method, the probabilistic calculations are the least de-

        manding part of the work.  One then has mental capacity available

        to focus on the crucial part of the job - framing the original

        question soundly, choosing a way to model the facts so as to

        properly resample the actual situation, and drawing appropriate

        inferences from the simulation.

             If you have understood the general logic of the procedures

        used up until this point, you are in command of all the necessary

        conceptual knowledge to construct your own tests to answer any

        statistical question.  A lot more practice, working on a variety

        of problems, obviously would help.  But the key elements are

        simple:  1) Model the real situation accurately, 2) experiment

        with the model, and 3) compare the results of the model with the

        observed results.


        Confidence Intervals on Relative Risk With Resampling

             Now we are ready to calculate - with full understanding -

        the confidence intervals on relative risk that the text sought.

        Recall that the observed sample of 135 high cholesterol men had

        10 infarctions (a proportion of .074), and the sample of 470 low

        cholesterol men had 21 infarctions (a proportion of .045).  We

        estimate the relative risk of high cholesterol as .074/.045.  Let

        us frame the question this way:  If we were to randomly draw a

        sample from the universe of high-cholesterol men that is best

        estimated from our data (.074 percent infarctions), and a sample

        from the universe of low-cholesterol men (.045 percent infarc-

        tions), and do this again and again, within which bounds would

        the relative risk calculated from that simulation fall (say) 95

        percent of the time?

             The operation is quite the same as that for a single confi-

        dence interval estimated above except that we do the operation

        for both sub-samples at once, and then calculate the ratio bet-

        ween their results.  As before, we would like to know what would

        happen if we could take additional samples from the universes

        that spawned our actual samples.  Lacking the resources to do so,

        we let those original samples "stand in" for the universes from

        which they came, serving as proxy "substitute universes."  We can

        imagine replicating each sample element millions of times to

        "bootstrap" these "proxy universes."  Paralleling the real world,

        we take simulated samples of the same size as our original sam-

        ples.  (Actually, we can skip replicating each sample element a

        million times and achieve the same resampling effect by sampling

        with replacement from our original samples -- that way, the

        chance that a sample element will be drawn will remain the same

        from draw to draw.)  We count the number of infarctions in each

        of our resamples, and for the pair of resamples, we calculate the

        relative risk measure and keep score of this result.  We then

        take additional pairs of resamples, each time calculating the

        relative risk measure.

             We may compare our results in Figure 3 - a confidence

        interval extending from 0.69 to 3.4 - to the results given in

        Kahn and Sempos, which are 0.79 to 3.5. 0.80 to 3.4, and 0.79 to

        3.7 from three different formulas (pp. 62-63); the agreement is

        close.

                                    Figure 3

             It is interesting that this may be the first time a calcula-

        tion of relative risk using resampling has ever been published.

        And it therefore should be a contribution to the statistics

        literature comparable with the formulaic approaches published in

        earlier years.  But because the procedure is worked out here on

        an ad hoc basis, and does not seem to be very difficult, it

        probably is not worth publishing separately. We point this out

        because resampling routinely produces entirely new procedures at

        least as powerful as the previously-existing formulaic proce-

        dures. These resampling procedures also have the advantage of

        being fully understood even by persons who are not professional

        statisticians but who think hard about their subject matter, and

        then create appropriate procedures by working from first princi-

        ples and modeling their actual research situations with care and

        understanding. Even underclasspersons in a state university are

        able to do this; one would expect persons in medical school or

        beyond it to be at least equally capable. That is the true revo-

        lution wrought by resampling.


                            SOME OTHER ILLUSTRATIONS

        A Measured-Data Example:  Test of a Drug to Prevent Low Birthweight

             The Framingham infarction-cholesterol examples worked with

        yes-no "count" data.  Let us therefore consider some

        illustrations of the use of resampling with measured data.

             Another leading textbook (Rosner, 1982, p. 257) gives the

        example of a test of the hypothesis that drug A prevents low

        birthweights.  The data for the treatment and control groups are

        shown in Table 2.   Here is a resampling approach to the

        problem:

                                     Table 2

             1.  If the drug has no effect, our best guess about the

        "universe" of birthweights is that it is composed of (say) a

        million each of the observed weights, lumped together.  In other

        words, in the absence of any other information or compelling

        theory, we assume that the combination of our samples is our best

        estimate of the universe.  Hence write each of the birthweights

        on a card, and put them into a hat.  Drawing them one by one and

        then replacing them is the operational equivalent of a very large

        (but equal) number of each birthweight.

             2. Repeatedly draw two samples of 15 each, and check how

        frequently the observed difference is as large or larger than the

        actual difference.

             We find in Figure 4 that only 1% of the pairs of

        hypothetical resamples produced means that differed by as much

        as .82.  We therefore conclude that the observed difference is

        unlikely to have occurred by chance.

                                    Figure 4


        Matched-Patients Test of Three Treatments

             There have been several recent three-way tests of treatments

        for depression: drug versus cognitive therapy versus combined

        drug and cognitive therapy.  Consider this procedure for a

        proposed test in 31 triplets of people have been matched within

        triplet by sex, age, and years of education. The three treatments

        are to be chosen randomly within each triplet.  Assume that the

        outcomes on a series of tests were ranked from best (#1) to worst

        (#3) within each triplet, and assume that the combined drug-and-

        therapy regime has the highest average rank.  How sure can we be

        that the observed result would not occur by chance?

              In hypothetical Table 3 the average rank for the drug and

        therapy regime is 1.74.  Is it possible that the regimes do not

        differ with respect to effectiveness, and that the drug and


        therapy regime came out with the best rank just by the luck of

        the draw?  We test by asking "If there is no difference, what is

        the probability of getting an average rank this good, just by

        chance?"

                                     Table 3

             Figure 5 shows a program for a resampling procedure that

        repeatedly produces 31 sets of ranks randomly selected among the

        numbers 1, 2 and 3, and averages the ranks for each treatment.

        We can then observe whether an average of 1.74 is unusually low,

        and hence should not be ascribed to chance.

                                    Figure 5

             In 1000 repetitions of the simulation, 5% yielded average

        ranks as low as the observed value.  This is evidence that

        something besides chance might be at work here.  (The result is

        at the borderline of the traditional 5% "level of significance"

        (a p-value of .05), supposedly set arbitrarily by the great

        statistician R.A. Fisher on the grounds that a 1-in-20 happening

        is too coincidental to ignore.)   That is, the resampling test

        suggests that it would be very unlikely for one of the treatment

        regimes to achieve, just by chance, results as much better than

        the other two regimes as are actually observed.

             An interesting feature of this problem is that it would be

        hard to find a conventional test that would handle this three-way

        comparison in an efficient manner. Certainly it would be impossi-

        ble to find a test that would not require formulae and tables

        that only a talented professional statistician could manage

        satisfactorily, and even the professional is not likely to fully

        understand those formulaic procedures.

                A DEFINITION AND GENERAL PROCEDURE FOR RESAMPLING

             A statistical procedure manipulates some replica of the

        physical process in which you are interested.  A resampling

        method simulates (models) the process with easy-to-handle sym-

        bols.  The resampler postulates a universe composed of the ob-

        served data, which are then used to produce new hypothetical

        samples whose properties are then examined.  That is, one exam-

        ines how the universe behaves, comparing the outcomes to a crite-

        rion that we choose.

             Here is an "operational definition" of resampling:  Using

        the entire set of data you have in hand, produce new samples of

        simulated data, and examine the results of those samples.  That's

        it in a nutshell.


                         VARIETIES OF RESAMPLING METHODS

             A resampling test may be constructed for almost any statis-

        tical inference.  Every real-life situation can be modeled by

        symbols of some sort, and one may experiment with this model to

        obtain resampling trials.  The most important counterindication

        is insufficient data to perform a useful resampling test, in

        which case a conventional test - which makes up for the absence

        of observations with an assumed theoretical distribution - may

        produce more accurate results if the universe from which the data

        are selected resembles the chosen theoretical distribution.

        Exploration of the properties of resampling tests is an active

        field of research at present.

             For the main tasks in statistical inference - hypothesis


        testing and confidence intervals - the appropriate resampling

        test often is immediately obvious, as seen in the case of choles-

        terol and infarction rates above.

             (Technical note to biostatisticians: Two sorts of procedures

        are especially well-suited to resampling:  1) When the size of

        the universe is properly assumed fixed, or for other reasons

        sampling without replacement is called for, it is appropriate to

        sample from among the possible permutations of the data;  this is

        an adaptation of Ronald Fisher's "exact" test (confusingly, also

        called a "randomization" test).  The three-way drug test above is

        an illustration; the rank of one member of a triplet affects the

        possible ranks of the other two members, and hence the sampling

        is done "without replacement".  2) The bootstrap procedure is

        appropriate when the size of the universe is properly assumed not

        to be fixed in size, and the measurement of one entity in the

        sample does not affect the measurement of another entity.   This

        device - for which there is no analog in conventional formulaic

        statistics - is illustrated by the birthweight test above.)

             Resampling is a much simpler intellectual task than the

        formulaic method, because simulation obviates the need to calcu-

        late the number of possible ways that the event in which you are

        interested - an infarction, say, or a birth of a certain size -

        can or cannot occur.  In technical terms, resampling does not

        require computation of the "sample space" or any part of it.  In

        all but the most elementary problems where simple permutations

        and combinations suffice, such calculations require advanced

        training and delicate judgment; these calculations are the root

        of the mathematical and conceptual difficulty of conventional

        formulaic statistics.

             Resampling avoids the complex abstraction of sample-space

        calculations by substituting the particular information about how

        elements in the sample are generated randomly in a specific

        event, as learned from the actual circumstances; the analytic

        method does not use this information.  In the case of the gam-

        blers prior to Galileo, resampling used the (assumed) facts that

        three fair dice are thrown with an equal chance of any outcome,

        and they took advantage of experience with many such events

        performed one at a time; in contrast, Galileo made no use of the

        actual stochastic element of the situation, and gained no infor-

        mation from a sample of such trials, but rather replaced all

        possible sequences by exhaustive computation.

             The resampling method is not theoretically inferior to the

        formulaic method.  Resampling is not "just" a stochastic-

        simulation approximation to formulas.  It is a quite different

        route to the same endpoint, using different intellectual

        processes and utilizing different sorts of inputs; both resam-

        pling and formulaic calculation are shortcuts to estimation of

        the sample space and its partitions.  Its much lesser

        intellectual difficulty is the source of the central advantage of

        resampling.  It improves the probability that the user will

        arrive at a sound solution to a problem - the ultimate criterion

        for all except for pure mathematicians.

             The applicability of resampling is especially great in

        biostatistics because of the small and irregular samples so

        common in clinical research.

                THE PLACE OF RESAMPLING IN THE REALM OF KNOWLEDGE

             Probability theory and its offspring, inferential

        statistics, constitute perhaps the most frustrating branch of

        human knowledge.

                  Right from its beginnings in the seventeenth century,

        the great mathematical discoverers knew that the probabilistic

        way of thinking -- which we'll call "prob-stats" for short --

        offers enormous power to improve our decisions and the quality of

        our lives.  Yet until very recently, when the resampling method

        came along, scholars were unable to convert this powerful body of

        theory into a tool that laypersons could and would use freely in

        daily work and personal life.  Instead, only professional

        statisticians feel themselves in comfortable command of the prob-

        stats way of thinking.  The most frequent applications are by

        medical and social scientists, who know that prob-stats is indis-

        pensable to their work yet too often fear and misuse it.

             Resampling is now fully accepted theoretically.  The

        publication of advanced papers exploring its properties is

        proceeding at a breathtaking rate throughout the world. And

        controlled studies show that people ranging from engineers and

        scientists down to seventh graders quickly handle more problems

        correctly than they do with conventional methods.  Furthermore,

        in contrast to the older conventional statistics, which is a

        painful and humiliating experience for most students at all

        levels, the published studies show that students enjoy resampling

        statistics.  But the resampling has not yet penetrated very far

        into the classroom, for a variety of institutional and historical

        reasons.


        Resampling in Medical Education

             Prob-stats is the bane of medical students as well as all

        other students required to study it; the statistics course is a

        painful rite of passage -- like fraternity paddling -- on the way

        to a degree.  Afterwards, the subject is happily put out of mind

        forever.

             Yet the practice of medicine becomes more and more dependent

        upon a knowledge of statistics.  Physicians like to say that they

        practice on the basis of "clinical knowledge". Yet in an ever-

        growing proportion of situations, choice of treatment comes

        straight from research studies whose conclusions depend on sta-

        tistical tests.  Without a sound understanding of inference, a

        physician cannot evaluate such studies and sort out which to rely

        upon.

             Teaching physicians statistics has been an impossible nut to

        crack.  As one statistician wrote about her attempt to teach

        medical students conventional statistical methods:  "I gazed into

        the sea of glazed eyes and forlorn faces, shocked by the looks of

        naked fear my appearance at the lectern prompted" (Vaisrub,

        1990).

             Students of probability and statistics simply memorize the

        rules.  Most users of prob-stats select their methods blindly,

        understanding little or nothing of the basis for choosing one

        method rather than another, and simply push the buttons for one

        or another easily available computer operation.  This often leads

        to wildly inappropriate practices, and contributes to the

        damnation of statistics.

             The statistical community has made valiant attempts to

        ameliorate the situation.  Great statisticians have struggled to

        find interesting and understandable ways to teach prob-stats.

        Learned committees and professional associations have wrung their

        hands in despair, and spent millions of dollars creating televi-

        sion series and text books.  Despite successes, these campaigns

        to promote prob-stats have largely failed.  The enterprise smash-

        es up against an impenetrable wall - the body of complex algebra

        and tables that only a rare expert understands right down to the

        foundations.  For example, almost no one can write the formula

        for the "Normal" distribution that is at the heart of most sta-

        tistical tests.  Even fewer understand its meaning.  Yet without

        such understanding, there can be only rote learning.

             The resampling method, in combination with the personal

        computer, promises to cure this disease, and finally realize the

        great potential of statistics and probability.

             In the absence of formulae, black-box computer programs, and

        cryptic tables, the resampling approach forces you to directly

        address the problem at hand.  Then, instead of asking "Which

        formula should I use?" one begins to ask more profound questions

        such as "Why is something `significant' if it occurs 4% of the

        time by chance, yet not `significant' if a random process pro-

        duces it 8% of the time?"


        About "Exactness"

             Earlier we suggested that the likelihood of arriving at a

        sound answer with a valid method, rather than using an incorrect

        method, is more important scientifically than any likely

        inexactness from the resampling simulation method.  But even that

        concedes too much:  The formulaic method itself is in no way

        perfectly exact; rather, it rests on approximations. The Normal

        distribution itself is only an approximation to the binomial.

        And often there are approximations in computing formulas.

             [There also is a certain irony in the common objection that

        resampling is not "exact" because the results are "only" a sam-

        ple.  The basis of all statistical work is sample data drawn from

        actual populations.  Statisticians have only recently managed to

        win battles against those bureaucrats and social scientists who,

        out of ignorance of statistics, believed that only a complete

        census of a country's population, or examination of every volume

        in a library, could give satisfactory information about unemploy-

        ment rates or book sizes.  Indeed, samples are sometimes even

        more accurate than censuses.  Yet many of those same statisti-

        cians have been skittish about simulated samples of data points

        taken from the sample space - drawn far more randomly than the

        data themselves, even at best.  They tend to want a complete

        "census" of the sample space, even when sampling is more likely

        to arrive at a correct answer because it is intellectually sim-

        pler (as with the gamblers and Galileo.)]


                                   CONCLUSION

             Probabilistic analysis is crucial in medicine, perhaps more

        so than in any other discipline.  Judgments about whether to use

        one treatment or another, or to allow a new medicine on the

        market, require that the decision-maker assess chance variability

        in the data.  But until now, the practice and teaching of proba-

        bilistic statistics, with its abstruse structure of mathematical

        formulas cum tables of values based on restrictive assumptions

        concerning data distributions -- all of which separate the user

        from the actual data or physical process under consideration --

        have kept the full fruits of statistical understanding from the

        medical community.

             Estimating probabilities with conventional mathematical

        methods is often so complex that the process scares many people.

        And properly so, because the difficulties lead to frequent

        errors.  The statistical profession has long expressed grave

        concern about the widespread use of conventional tests whose

        foundations are poorly understood.  The recent ready availability

        of statistical computer packages that can easily perform

        conventional tests with a single command, irrespective of whether

        the user understands what is going on or whether the test is

        appropriate, has exacerbated this problem.  This has led teachers

        to emphasize descriptive statistics and even ignore inferential

        statistics.

             Beneath every formal statistical procedure there lies a

        physical process.  Resampling methods allow one to work directly

        with the underlying physical model by simulating it.  The term

        "resampling" refers to the use of the given data, or a data

        generating mechanism such as a die, to produce new samples, the

        results of which can then be examined. Resampling estimates

        probabilities by numerical experiments instead of with formulae

        -- by flipping coins or picking numbers from a hat, or with the

        same operations simulated on a computer.

             The resampling method enables people to obtain the benefits

        of statistics and probability theory without the shortcomings of

        conventional methods, because it is free of mathematical formulas

        and restrictive assumptions and is easy to understand and use,

        especially in conjunction with the computer language and program

        RESAMPLING STATS.

             It is the overall approach - the propensity to turn first to

        resampling methods to handle practical problems - that most

        clearly distinguishes resampling from conventional statistics.

        In addition, some resampling methods are new in themselves, the

        result of the basic resample-it tendency of the past quarter

        century.

             Resampling replaces the complex mathematical calculations

        about the size of the sample space and its parts by simulating

        the conditions that produce the individual events; the informa-

        tion about these concrete conditions is not used by the formulaic

        method.  This very different intellectual method is the source of

        its clarity and simplicity.







                                   REFERENCES

             Edgington, Eugene S., Randomization Tests, Marcel Dekker, N.

        Y., 1980

             Efron, Bradley, and Diaconis, Persi; "Computer Intensive

        Methods in Statistics,"  Scientific American, May, 1983, pp. 116-

        130.

             Emerson, John D., and Graham A. Colditz, "Use of Statistical

        Analysis in the New England Journal of Medicine", in John C.

        Bailar III and Frederick Mosteller, Medical Uses of Statistics

        (Boston:  NEJM Books, 1992), pp. 45-57.

             Godrey, Katherine, "Comparing the Means of Several Groups",

        in John C. Bailar III and Frederick Mosteller, Medical Uses of

        Statistics (Boston:  NEJM Books, 1992), pp. 233-258.

             Hensrud, Donald D., and J. Michael Sprafka, "The Smoking

        Habits of Minnesota Physicians", American Journal of Public

        Health, vol 83, March, 1993, 415-417.

             Kahn, Harold A., and Christopher T. Sempos, Statistical

        Methods in Epidemiology (New York:  Oxford, 1989)

             Noreen, Eric W., Computer Intensive Methods for Testing

        Hypotheses,  (New York: Wiley, 1989)

             Rosner, Bernard, Fundamentals of Biostatistics,  (Boston:

        Duxbury, 1982)

             Simon, Julian L., Basic Research Methods in Social Science,

        1969,  (New York: Random House, 1989; 3rd Edition, 1985, with

        Paul Burstein)

             Simon, Julian L., Atkinson, David T., and Shevokas, Carolyn,

        "Probability and Statistics:  Experimental Results of a Radically

        Different Teaching Method,"  American Mathematical Monthly, v.

        83, No. 9, Nov. 1976

             Simon, Julian L., and Bruce, Peter C., "Resampling: Everday

        Statistical Tool," Chance, v. 4, #1, 1991

             Simon, Julian L., Resampling:  Probability and Statistics a

        Radically Different Way  (Belmont, CA:  Wadsworth, forthcoming

        1993).

             Vaisrub, Naomie, Chance, Winter, 1990, p. 53*************

             Wonnacott, Thomas H. and Ronald J. Wonnacott, Introductory

        Statistics for Business and Economics 4th edition (New York:

        Wiley, 1990).








        URN 31#1 574#2 men             An urn called "men" with 31 ones
                                       (=infarctions) and 574 twos
                                       (=no infarction)
          SAMPLE 135 men high          Sample (with replacement!) 135
                                       of the numbers in this urn, give
                                       this group the name "high"
          SAMPLE 470 men low           Same for a group of 470, call
                                       it low
          COUNT high =1 a              Count infarctions in first group
          DIVIDE a 135 aa              Express as a proportion
          COUNT low =1 b               Count infarctions in second
                                       group
          DIVIDE b 470 bb              Express as a proportion
          SUBTRACT aa bb c             Find the difference in
                                       infarction rates
          SCORE c z                    Keep score of this difference
        END
        HISTOGRAM z
        COUNT z >=.029 k               How often was the resampled
                                       difference >= the observed
                                       difference?
        DIVIDE k 1000 kk               Convert this result to a
                                       proportion
        PRINT kk


          200+
             +
             +
        F    +
        r    +
        e 150+
        q    +
        u    +
        e    +
        n    +                     **
        c 100+                     **
        y    +                  ** ***
             +                  ** ***
        *    +                  ******
             +                  ****** *
        Z  50+                ***********
             +                ***********
             +               ************ **
             +              ****************
             +           *********************
            0+-------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
             -0.1      -0.05       0       0.05       0.1
                          Difference between resamples
                          (proportion with infarction)


        kk       =      0.102  (the proportion of resample pairs

                                with a difference >= .029)








        URN 10#1 125#0 men           An urn (called "men") with
                                     ten 1's (infarctions)
                                     and 125 0's (no infarction)
        REPEAT 1000                  Do 1000 trials
          SAMPLE 135 men a           Sample (with replacement) 135
                                     numbers from the urn, put them in
                                     "a"
          COUNT a =1 b               Count the infarctions
          DIVIDE b 135 c             Express as a proportion
          SCORE c z                  Keep score of the result
        END                          End the trial, go back and repeat
        HISTOGRAM z                  Produce a histogram of all trial
                                     results
        PERCENTILE z (2.5 97.5) k    Determine the 2.5th and 97.5th
                                     percentiles of all trial results;
                                     these points enclose 95% of the
                                     results
        PRINT k


        F    +
        r    +
        e 150+
        q    +                *
        u    +              * *
        e    +              * **
        n    +             ** **
        c 100+             ** ** *
        y    +           * ** ** *
             +           * ** ** **
        *    +           * ** ** **
             +           * ** ** **
        Z  50+           * ** ** **
             +        * ** ** ** ** **
             +        * ** ** ** ** ** *
             +       ** ** ** ** ** ** *
             +       ** ** ** ** ** ** ** *
            0+-------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
               0       0.05       0.1      0.15       0.2
                    Proportion with infarction

        k        =   0.037037    0.11852


        (This is the 95% confidence interval, enclosing 95% of the resam-
        ple results)







        URN 10#1 125#0 high        The universe of 135 high cholesterol
                                   men, 10 of whom (1's) have infarctions
        URN 21#1 449#0 low         The universe of 470 low cholesterol
                                   men, 21 of whom (1's) have infarctions
        REPEAT 1000                Repeat the steps that follow 1000
                                   times
          SAMPLE 135 high high$    Sample 150 (with replacement) from
                                   the high cholesterol universe, and
                                   put them in "high$" [the "$"
                                   suffix just indicates a resampled
                                   counterpart to the actual sample]
          SAMPLE 470 low low$      Similarly for 470 from
                                   the low cholesterol universe
          COUNT high$ =1 a         Count the infarctions in the first
                                   resampled group
          DIVIDE a 135 aa          Convert to a proportion
          COUNT low$ =1 b          Count the infarctions in the second
                                   resampled group
          DIVIDE b 470 bb          Convert to a proportion
          DIVIDE aa bb c           Divide the proportions to calculate
                                   relative risk
          SCORE c z                Keep score of this result
        END                        End the trial, go back and repeat
        HISTOGRAM z                Produce a histogram of trial results
        PERCENTILE z (2.5 97.5) k     Find the percentiles that
                                      bound 95% of the trial results
        PRINT k


        F    +                *
        r    +                *
        e  75+                * *
        q    +                * *
        u    +                * *
        e    +             **** *  *
        n    +             **** *  *
        c  50+          *  ******  *
        y    +          *  *********
             +          ************
        *    +          ************* *
             +         ****************   *
        Z  25+         ****************   *
             +         ********************
             +       * ********************  *
             +       *********************** *    *
             +     ****************************** * *   *
            0+---------------------------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
               0         1         2         3         4         5         6
                           Relative risk
        Results (estimated 95% confidence interval):

        k        =    0.68507     3.3944






        NUMBERS (6.9 7.6 7.3 7.6 6.8 7.2 8.0 5.5 5.8 7.3 8.2 6.9 6.8 5.7
                 8.6) treat
        NUMBERS (6.4 6.7 5.4 8.2 5.3 6.6 5.8 5.7 6.2 7.1 7.0 6.9 5.6 4.2
                 6.8) control
        CONCAT treat control all         Combine all observations in
                                         same vector
          REPEAT 1000                    Do 1000 simulations
          SAMPLE 15 all treat$           Take a resample of 15 from all
                                         birthweights (the $ indicates
                                         a resampling counterpart to a
                                         real sample)
          SAMPLE 15 all control$         Take a second, similar resample
          MEAN treat$  mt                Find the means of the two
                                         resamples
          MEAN control$ mc
          SUBTRACT mt mc dif             Find the difference between the
                                         means of the two resamples
          SCORE dif z                    Keep score of the result
          END                            End the simulation experiment,
                                         go back and repeat
        HISTOGRAM z                      Produce a histogram of the
                                         resample differences
        COUNT z >= .82 k                 How often did resample
                                         differences exceed the observed
                                         difference of .82?

        F    +
        r    +
        e  75+
        q    +
        u    +
        e    +
        n    +                          * *   * *
        c  50+                          * * * *** *
        y    +                          ***********
             +                         ************ *
        *    +                        ***************
             +                       ****************
        Z  25+                    * ******************* *
             +                    * *********************
             +                   ** **********************
             +                 ******************************
             +               ***********************************
            0+---------------------------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
             -1.5       -1       -0.5        0        0.5        1        1.5
                            Resample differences in pounds

        Result:  Only 1.3% of the pairs of resamples produced means that
        differed by as much as .82.  We can conclude that the observed
        difference is unlikely to have occurred by chance.








        REPEAT 1000                     Do 1000 simulations
          GENERATE 31 (1 2 3) ranks     Generate 31 numbers, each
                                        number a 1, 2 or 3, to
                                        simulate random assignment of
                                        ranks 1-3 to the drug/
                                        therapy alternative
          MEAN ranks rankmean           Take the mean of these 31
          SCORE rankmean z              Keep score of the mean
        END                             End the simulation, go back
                                        and repeat
        HISTOGRAM z                     Produce a histogram of the
                                        rank means
        COUNT z <=1.74 k                How often mean rank better than
                                        1.74, the observed value?
        PRINT k


          100+
             +                            *    *
             +                            *    *
        F    +                            *  * *
        r    +                         *  ** * *
        e  75+                         *  ** * *
        q    +                         ** ** * ** *
        u    +                         ** ** * ** *
        e    +                         ** ** * ** *
        n    +                         ** ** * ** **
        c  50+                         ** ** * ** **
        y    +                       * ** ** * ** ** *
             +                       * ** ** * ** ** *
        *    +                     * * ** ** * ** ** *
             +                    ** * ** ** * ** ** * *
        Z  25+                    ** * ** ** * ** ** * **
             +                  * ** * ** ** * ** ** * **
             +                  * ** * ** ** * ** ** * ** * *
             +             *    * ** * ** ** * ** ** * ** * *  *
             +          * ** ** * ** * ** ** * ** ** * ** * ** *
            0+---------------------------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
              1.4       1.6       1.8        2        2.2       2.4       2.6






        Development of Mycardial infarction in Framingham after 16 Years
                  Men Age 35-44, by Level of Serum cholesterol



        Serum cholesterol       Developed MI  Did not develop MI    Total
        (mg%)

        >250                          10           125               135
        <=250                         21           449               470



        Source: Shurtleff, D.  The Framingham Study: An Epidemiologic
        investigation of Cardiovascular Disease, Section 26.  Washington,
        DC, U.S. Government Printing Office.  Cited in Kahn and Sempos
        (1989), p. 61, Table 3-8









                 Birthweights in a Clinical Trial to Test a Drug
                         for Preventing Low Birthweights


                                         Baby Weight (lb)
        Patient             Treatment group     Control group
             1                   6.9                 6.4
             2                   7.6                 6.7
             3                   7.3                 5.4
             4                   7.6                 8.2
             5                   6.8                 5.3
             6                   7.2                 6.6
             7                   8.0                 5.8
             8                   5.5                 5.7
             9                   5.8                 6.2
             10                  7.3                 7.1
             11                  8.2                 7.0
             12                  6.9                 6.9
             13                  6.8                 5.6
             14                  5.7                 4.2
             15                  8.6                 6.8


        Source: Rosner, Table 8.7




          Observed Rank of Treatments, by Effectiveness (Hypothetical)


                                         Treatment

        Triplet Group          Drug    Therapy    Drug/Therapy

             1                   3         1         2
             2                   2         3         1
             3                   1         3         2
             .                   .         .         .
             .                   .         .         .
             .                   .         .         .
             31                  2         1         3
                  Avg. rank     2.29     1.98      1.74