Estimating the number of samples required

ERDC TN-DOER-C15

July 2000

to as stratified random sampling (Lubin, Williams, and Lin 1995). The sampling method will

ultimately be selected based on the greatest confidence in capturing representative data, quality and

availability of existing information on which to base the method selection, and cost considerations.

Additional discussion can be found in U.S. Environmental Protection Agency/U.S. Army Corps of

Engineers (1995).

Estimating the number of samples required. Ultimately, the number of samples obtained will

be determined by cost considerations. The upper threshold will almost certainly be set by the

number of samples required to determine the desired parameter (e.g., contaminant concentrations,

percent sand) with a specified degree of confidence. If a normally distributed sample can be

assumed, then from the empirical rule, approximately 95 percent of the values will lie within 1.96 s of

the mean, where s is the standard deviation of the sample. An acceptable margin of error can then

be used to estimate the number of samples required. For example, to calculate the mean concen-

tration of a constituent at a selected depth within 10 mg/kg at the 95 percent confidence level, then:

= 10

1.96

(1)

Solving for n gives the number of samples required to determine the mean within 10 mg/kg, at the

95 percent confidence level. Higher or lower confidence levels can be used. Further discussion

can be found in Mendenhall and Beaver (1994). The obvious disadvantage to this method is that

some idea of the variability of the data to be obtained is required prior to sampling. One could use

results from analysis of selected samples taken within the CDF to estimate s and determine how

many additional samples should be analyzed. (The standard deviation for the subsample can be

calculated directly, or the range of the data can be used to estimate s (Appendix I).) If no data are

available, an action level can be used as an estimated value for the variance. Such an iterative

approach is described by Lubin, Williams, and Lin (1995) using a mathematical relation for

estimating sample numbers that does not use the mean, but does incorporate acceptable error levels

(α and β). However, environmental data are typically highly variable (large s), which may result

in unrealistically high numbers of samples required. Additionally, these approaches require the

assumption of a normal distribution, which is not typical of most environmental data. The geometric

alternative variance can be used to estimate required sample size for lognormally distributed data;

this approach is further described in Lubin, Williams and Lin (1995). Another alternative is to

sample sequentially, evaluating data as they are generated and continuing to sample until a definitive

threshold is achieved at a desired confidence level. The sequential approach and additional methods

for estimating required sample numbers for different grid configurations and confidence levels are

described in Lubin, Williams, and Lin (1995).

Several of the nonparametric data analysis methods require a minimum number of samples and

observations to be valid, or require equally paired numbers of observations between samples to be

compared. For example, the Kruskal-Wallis H-test (nonparametric ANOVA) requires at least three

samples with at least three observations per sample. When there are more than 6 observations per

sample, the distribution of the H statistic is well approximated by the chi-square distribution

(McBean and Rovers 1998). The STATSS (Lubin, Williams, and Lin 1995) guidance document

provides simple guidance for determining the number of samples required for a specified error level

or confidence interval.