Skip to content

Big Sample, Unreliable Result

Which would you rather have?  A large sample that is biased, or a representative sample that is small?  The American Statistical Association committee that reviewed the 1948 Kinsey report on male sexual behavior, based on interviews with over 5000 men, left no doubt of their preference for the latter.  The statisticians –  William Cochran, Frederick Mosteller, John Tukey, and W. O. Jenkins – were leaders in their profession, and identified multiple sources of bias in the Kinsey data collection effort.  Participation was voluntary, and generated to some degree by referral, leading to self-selection bias. Prison populations were substantially over-represented. One result was an over-estimate of the prevalence of homosexuality among men.  Tukey dismissively said that he would put greater stock in a randomly selected sample of 3 than in 300 selected by Kinsey.

Nonetheless, Sample Size Matters

On the other hand, sample size does matter, even if it is secondary to proper sample selection methods.  As Daniel Kahneman put it in Thinking, Fast and Slow:

The exaggerated faith in small samples is only one example of a more general illusion – we pay more attention to the content of messages than to information about their reliability, and as a result end up with a view of the world around us that is simpler and more coherent than the data justify.

The smaller the sample, the more it is prone to misinterpretation.  Random variation makes it unreliable as a tool for estimation, and also gives scope for interesting chance events to attract the attention of the investigator.

How Big?

How big should your sample be?  You can find general guidance associated with particular tasks (polling, auditing, behavioral studies), but a more analytical approach exists, based on the principles of statistical inference.

This approach presumes that you are gathering data to investigate a hypothesis, typically concerning the effect some condition or treatment has on subjects, an effect that shows up in a difference between, or among, groups that experience different treatments or conditions.  The basic idea is to gather a sample that is big enough to assure you that, if the effect you are investigating exists, your study will find it. This involves balancing three parameters set by the user:

  • Effect size
  • Level of significance
  • Power

Setting the Parameters

Effect size:  The smaller the effect size you hope to find, the bigger the sample needed.  A useful analogy is finding stars with a telescope – the dimmer the star, the bigger the telescope you need to distinguish it.  “Effect size” is the difference you hope exists in the population(s) you are investigating. For continuous numeric data, it would be expressed as a difference in means of the distributions.  What does “find” mean? Here it means to conclude that there is a statistically significant difference, or effect. For example, if you are testing two different colors for a “buy” button on a web site, finding a difference means that a difference between two groups of web users experiencing different colors is statistically significant at a pre-chosen level of significance.  

Level of Significance:  The “tighter” the definition of statistical significance (e.g. 0.01 instead of 0.05), the bigger the sample needed.  P-values and the whole idea of statistical significance have fallen into some disfavor as a result of their abuse – as the number of academic researchers seeking to publish papers has risen, the p-value has become a “necessary and sufficient” publishing criterion, opening the door to great numbers of published studies whose only “virtue” is that they contain a statistically significant result, lacking practical significance or proper study design.  Nevertheless, in determining sample size, a determination of statistical significance as a criterion for validating a finding is needed.

Power:  Power is the probability of achieving a statistically significant result in a sample study, if the specified effect size is real in the population being studied.  For example, if a medication has a real effect of reducing blood pressure by 10%, and you conduct a study (at your specified significance level) between a medication group and a control group, power is the probability that the study will return a result of “significant.”  Note that the study does not necessarily have to yield a 10% difference between the two groups – rather it simply has to yield a statistically significant difference. The more power you seek, the bigger the sample needed.


Specifying the three parameters is an exercise in tradeoffs.  The smaller the effect you want to be able to find, and the greater the power (probability of finding that effect), the bigger the sample you need.  If your initial goals with respect to these key parameters yield a sample requirement that is beyond your budget or capability, you must compromise something; that is, be willing to set a larger effect size threshold (meaning that you might well miss a desired effect), or you must tolerate a lower power, or both.  The level of statistical significance is not so malleable; it is usually set by external requirements, e.g. regulators or journal publishers who often specify a traditional level of 5%.


Setting the three parameters is a necessary, but not sufficient condition to find sample size.  A fourth factor affecting sample size is the variance in the data. This, of course, is not a parameter set by the user.  The greater the variance in the data, the greater the sample size needed to identify a given effect of interest. Thus, any estimate of required sample size must necessarily incorporate an assumption about variance in the data.  This might be estimated from earlier samples of data, or from knowledge about the process or population involved.

Putting it All Together

Once you have some estimate of the variance in the data, you can visualize the procedure to calculate sample size via a resampling simulation procedure, illustrated here for the case of two samples with continuous numeric data:

  1. Specify the desired effect size, level of significance, and power
  2. Specify two data random generators to generate normally-distributed data from populations with two means that differ by the desired effect size, and with variance as estimated from prior information*
  3. Generate two samples of size n1 — one from each of the data generators
  4. Conduct a significance test on the two samples; record whether the difference is significant
  5. Repeat steps 3-4 say 1000 times; note what proportion of the time the difference is significant – this is the power
  6. If the power is right on target, n1 is the appropriate sample size; if the power is too low you will need to increase the sample size and if the power is higher than needed you can reduce the sample size
  7. Iteratively try different levels of n until the power is where you need it

*If you actually have real data appropriate to the study, you can substitute two bootstrap generators (one shifted by the effect size) for the normally distributed data generator.

In most cases, power will be determined by software calculating formulas, though the bootstrap simulation approach can be used where the situation and the statistic of interest do not fit the data scenario required by the software.