2.2: Estimating Population Characteristics in Surveys
- Page ID
- 96635
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)To provide a practical illustration of the different methods of survey sampling, assume that the investigator wishes to estimate the percentage of adult cows (beef and dairy) in a large geographic area that have antibodies to enzootic bovine leukosis virus. The unit of concern is the cow, and the true but unknown percentage of reactor cows in the target population is the parameter to be estimated. N represents the number of cows in the population and n the number of cows in the sample.
2.2.1 Nonprobability Sampling
Nonprobability sampling is a collection of methods that do not rely on formal random techniques to identify the units to be included in the sample. Some nonprobability methods include judgment sampling, convenience sampling, and purposive sampling.
In judgment sampling representative units of the population are selected by the investigator. In convenience sampling, the sample is selected because it is easy to obtain; for example, local herds, kennels, or volunteers may be used. Using convenience or judgment sampling often produces biased results, although some people believe they can select representative samples. This drawback and the inability to quant.itatively predict the sample's expected performance suggest these methods rarely should be used for survey purposes. In purposive sampling, the selection of units is based on known exposure or disease status. Purposive sampling is often used to select units for analytic observational studies, but it is inadequate for obtaining data to estimate population parameters.
Examples of the application of nonprobability sampling to estimate the prevalence of enzootic bovine leukosis virus include the selection of cows from what the investigator thinks are representative herds and the selection of cows from herds owned by historically cooperative or nearby farmers.
The following sampling methods belong to a class known as probability samples. The discussion assumes that sampling is performed without replacement; hence an individual element can only be chosen once.
2.2.2 Simple Random Sampling
In simple random sampling, one selects a fixed percentage of the population using a formal random process; as for example, flipping a coin or die, drawing numbers from a hat, using random number generators or random number tables. ("Random" is often used to describe a variety of haphazard, convenience and/or purposive sampling methods, but here it refers to the formal statistical procedure.) Strictly speaking, a formal random selection procedure is required for the investigator to calculate the precision of the sample estimate, as measured by the standard error of the mean. In practice, formal random sampling provides the investigator with assurance that the sample should be representative of the population being investigated, and for the parameter being estimated, confidence intervals are calculated on this premise. Despite mathematical and theoretical advantages, simple random sampling is often more difficult to use in the field than systematic sampling (described in 2.2.3). Consider the procedure for selecting a sample of 100/o of feedlot steers as they pass through a handling facility. In simple random sampling, a list of randomly obtained numbers - representing, for example, the animals' identification (i.e., ear tags) or the order of the animals through a handling facility- would be prepared beforehand to identify the animals for the sample. The practicalities of using such a list in a field situation (e.g., losing count of animals and/or continuously having to refer to a list of numbers) may make this type of sampling inappropriate.
To obtain a simple random sample of cows for the prevalence of enzootic bovine leukosis antibodies one would obtain a list of n random numbers between 1 and N, each number identifying a cow in the sampling frame. Thus the cows selected would be distributed randomly throughout the sampled population.
2.2.3 Systematic Random Sampling
In systematic sampling then sampling units are selected from the sampling frame at regular intervals (e.g., every fifth farm or every third animal), thus the interval k is 5 or 3 respectively. If k is fixed initially, n will vary with N; whereas if n is fixed initially, k becomes the integer nearest to Nin. When systematic methods are used, the starting point in the first interval is selected on a formal random basis. Systematic sampling is a practical way to obtain a representative sample, and it ensures that the sampling units are distributed evenly over the entire population. There are two major disadvantages of this method. First, it is possible that the characteristic being estimated is related to the interval itself. For example, in estimating the prevalence of respiratory disease in swine at slaughter, one might systematically select a day of the week (e.g., Wednesday) to examine lungs. If swine slaughtered on Wednesdays were not representative of swine slaughtered on the other days of the week (e.g., because of local market customs), a biased result would be obtained. The second disadvantage is the difficulty of quantitatively assessing the variability of estimates obtained by systematic random sampling. Io practice, one uses methods appropriate for simple random sampling to obtain these estimates.
If \(N/k\) is not an integer, some bias will result in the sample estimate because some animals (elements) will have more impact on the mean than others. This is of little concern if N is large and k is small relative to N. To prevent this bias, select the desired k and draw a random number (RN) between l and N; then divide RN by k and note the remainder. This remainder identifies the starting point between 1 and k (i.e., a remainder of 0 means the starting point is the kth individual, a remainder of 2 the second individual, and so forth) (Levy and Lemeshow 1980, p 76).
In sampling to estimate the prevalence of antibodies to enzootic bovine leukosis virus, using a list of all N cows in the area in question (the sampling frame), the initial animal to be tested would be selected from the first Nin animals randomly. Subsequently, every kth cow would be tested. In selecting 10% of steers, one could randomly select a number between I and IO (say 6) and then the 6th, 16th, 26th, etc. animal through the facility would be included in the sample.
2.2.4 Stratified Random Sampling
In stratified sampling, prior to selection, the sampling frame is divided into strata based on factors likely to influence the level of the characteristic (e.g., prevalence of antibodies) being estimated. Then a simple random or systematic random sample is selected within each stratum.
Stratified sampling is more flexible than simple random sampling because a different sampling percentage can be used in the various strata (e.g., 2% in one strat.um and 5% in another). Also, the precision of the sample estimate may be improved, because only the within-stratum variation contributes to the variation (standard error) of the mean in stratified sampling; whereas in simple random sampling both the within-stratum and the between-stratum variation are present. A graphic illustration of this feature is shown in Figure \(\PageIndex{1}\).
In simple random sampling, the variability of the estimate of prevalence has components related to both within-herd type and between-herd type variation in prevalence. In stratified random sampling, the variability of the estimate has components related to only the within-herd type variation in prevalence; hence its variability is expected to be less than that obtained in simple random sampling. For example in Figure 2.1 the variability of the prevalence in beef herds, about the mean for beef herds, and the variation of the prevalence in dairy herds, about the mean for dairy herds, are much smaller than if type of herd is ignored and the variation of herd disease prevalence about the overall mean is calculated. Variation (see Table 2.1) of the mean (estimate of prevalence) is calculated using standard formula for the variance or its square root, the standard deviation. The standard deviation of a mean is referred to as a standard error.

The obvious disadvantage of stratified sampling is that the status of all sampling units, with respect to the factors forming the strata, must be known prior to drawing the sample. In general, the number of factors used for stratification should be limited to those likely t.o have a major impact on the value of the characteristic (e.g., prevalence of antibodies) being estimated.
As an example of this method and given that dairy cows are likely to have a higher rate of enzootic bovine Jeukosis antibodies than beef cows, one should obtain a more precise estimate of the population mean (prevalence) if strata were formed based on type of cow. Also, if 60% of the cow population N comprised dairy cows, 600Jo of the sample n should be dairy cows. This is called proportional weighting, and it keeps the arithmetic involved in calculating the sample statistic simple. Cows would be selected within each stratum by using simple random or systematic random sampling methods.
In the sampling methods discussed, the sampling unit and the unit of concern are the same (i.e., a cow). These methods are well suited for sampling from laboratory files or from relatively small groups of identifiable animals. However, the practical difficulty of obtaining a list (the sampling frame) of all cows in a large geographic area such as a province or state is a drawback. Additionally, with stratified sampling, the appropriate characteristics of each sampling unit must be identified (e.g., as dairy or beef in the previous example). To overcome these problems, allow flexibility in sampling strategy, and decrease the cost of the sampling, it is often easier to initially sample herds or other natural aggregates of animals within the area, although individual animals are the units of concern. Two of the more common sampling methods used for this purpose are cluster and multistage sampling.
2.2.5 Cluster Sampling
In cluster sampling, the initial sampling unit is larger than the unit of concern (e.g., usually the individual). Clusters of individuals often arise naturally (e.g., litters, pens, or herds) or they may be formed artificially (e.g., geographic clusters). Administrative units such as counties may also be used as artificial clusters for sampling purposes. The clusters (sampling units) can be selected by systematic, simple, or stratified random methods; all individuals within the sampling units are tested.
Sometimes the group, be it a herd, pen, or litter, is the unit of concern, and therefore is not considered to be a cluster. Some examples of this situation are investigations to classify herds as to whether they are infected with enzootic bovine leukosis; estimation of the mean somatic cell count for dairy herds using bulk tank milk samples: and estimation of the mean herd milk production or days to conception.
In the bovine leukosis example, a cluster sample could be obtained by taking a simple random sample of all herds in the sampled population and testing all cows within the selected herds. From the formula in Table 2.1, note that the variability of the mean of the cluster sample is a funct.ion of the between-herd variance and the number of clusters m in the sample, not the number of animals in the sample.
2.2.6 Multistage Sampling
This method is similar to cluster sampling except that sampling takes place at all stages. As an example of two-stage sampling, one would begin as in cluster sampling by selecting a sample of the primary units {e.g., herds) listed in the sampling frame. Then within each primary unit, a sample of secondary units (e.g., animals) would be selected. Thus the difference between cluster and two-stage sampling is that subsampling within the primary units is conducted in the latter method.
Multistage sampling is used because of its practical advantages and flexibility. The number of primary (n,) and secondary units (n2) may be varied to account for different costs of sampling primary versus secondary units as well as the variability of the characteristic being estimated between primary units and between secondary units within primary units (see 2.2.9).
To continue with the bovine leukosis example, one could proceed in the same manner as cluster sampling, but after selection of the herds (the primary units), a simple or systematic random sample of cows within each herd (the secondary units) would be selected. This process could be extended to three-stage sampling by selecting small geographic areas as the primary units, selecting herds within these areas as secondary units, and finally selecting animals within the herds as tertiary units. Whenever possible, one should select each stage's sampling units with probability proportional to the number of individuals they contain. This minimizes the error of estimate and stabilizes the sample size. The main disadvantage of cluster and multistage samples is that more individuals may be required in the sample to obtain the same precision as would be expected if individuals could be selected with simple random sampling.
As an illustration of multistage sampling, suppose that in the bovine leukosis example there are M farms (say 120) and N animals (say 8000) in the population. The objective is to estimate the proportion of animals having enzootic bovine leukosis antibodies using a sample size of 800 (n = 800). The sampling frame would have the format shown in Table 2.2.
Suppose the number of primary sampling units (farms) to be selected is 40(n1) and, on average, 20(n2) secondary units (animals) will be selected from within each primary unit. (Note that n, x n1 = n.) If the number of animals in each herd was unknown, one could take a simple or systematic random sample of 40 herds and randomly select a fixed percentage (i.e., 300'/o = MnlmN) of the animals in each herd for testing. When the number of animals in each herd is known, a more optimal procedure is to sample the primary units with probability proportional to their size, and then to select a fixed number of animals from each herd. In this example, the initial step is to randomly select 40 numbers within the range of 1 to 8000. Each of the random numbers will identify a farm according to the cumulative number column. Subsequently, 20 animals may be randomly selected from each farm. Both of these procedures give each individual the same probability of being selected. Since it is assumed that sampling is without replacement, if a farm is identified twice, another should be selected randomly. (Technically it would be better to randomly selected twice the number of animals from that herd.) If fewer than 20 animals are present in a specified herd, the practical solution is to test all available animals.
A modification of this method to ensure that each farm may be selected only once is the use of systematic random techniques. For example, the selection interval k is found by dividing the total number of animals N by n, (in this case, k = 8000/40 = 200). A number is then selected randomly from the range I to k (e.g., 151). The remaining 39 numbers (351, 551, etc.) would identify the farms to include in the sample. This process will select a farm only once, providing the interval k is greater than the number of animals on the largest farm.
2.2.7 Calculating the Estimate
The point estimate of the prevalence of reactors in the population, the parameter P( T + ), is the test-positive proportion in the sample, the statistic p(T +)or p. To calculate this statistic the number of test positives are added together and divided by the sample size. (This assumes a proportionally weighted sample when stratified sampling is used, which is self-weighting in terms of the mean. The same approach is also used for estimates obtained from cluster or multistage samples. See Snedecor and Cochran 1980 for details.) Calculating the estimate of a population mean (say average milk production) is performed in an analogous manner (see 3.6).
In the enzootic bovine leukosis example, if 125 of 2000 cows were test-positive, the estimate of the prevalence of reactors in the population would be jj = 125/2000 = 0.063 or 6.3%. If a 32 I I Basie Principles simple random sample or systematic random sample were used to obtain the sample, the variability of the point estimate would be:
Variance (ft) = p(l - /))In = 0.063 x 0.937/2000 V(jJ) = 0.295 x 10-..
Standard Error (jJ) = V(ft) 1ri SE(jJ) = 0.0054 (0.54%)
These estimates could be written as 6.3% ± 0.50Jo (SE). With moderately large sample sizes, 65% of all possible sample means will be within 1 standard error of the true mean, 95% within 1.96 standard errors, and 99% within 2.6 standard errors. The calculation of a confidence interval as an extension of the above facts is described in 3.6. More complex calculations are required to determine the variability of means obtained from cluster or two-stage samples (see Table 2.1). Since the clusters are rarely of equal size, the reader can use the formula shown in Table 2.1 for the initial calculations, but should consult one of the reference texts for details of more accurate methods.
2.2.8 Sample Size Considerations
Accurate determinations of the sample size required for a survey can be quite detailed, and most complex surveys will require the assistance of a statistician. For less complex surveys one of the following formulas should provide suitable estimates. To determine the sample size n necessary to estimate the prevalence of reactors P{,T+) in a population (the mean of a qualitative variable, morbidity rates or mortality rates, see 3.2 and 3.3), the investigator must provide an educated guess of the probable level of reactors P (read "P hat"), and must specify how close to P{,T +) the estimate should be.
Suppose the available evidence suggests that approximately 30% (P = 0.3) of the cow population will have antibodies to enzootic bovine leukosis. Also, assume the investigator wishes the survey estimate to be within 6% of the true level 95% of the time. (6% is termed the allowable error, or required precision, and is represented in the following formula by L.) Then the required sample size is:
\[\begin{aligned}
n & =4 \hat{P} \hat{Q} / L^2 \quad \text { where } \hat{Q}=1-\hat{P} \\
& =4 \times 0.3 \times 0.7 / 0.06^2=0.84 / 0.0036=233
\end{aligned}\]
Thus approximately 230 cows would be needed for the survey.
In general, the number of animals in the population has little influence 2 I Sampling Methods 33 on the required sample size except when n is greater than O. lN. For example, if the herd contained only 200 cows (N = 200), the required number of cows is found using the reciprocal of l In• + 1/ N where n* is the above sample size estimate. In this instance, the number required to obtain the same precision is the reciprocal of 11233 + 11200 = 1/108; thus the required sample is approximately 108 animals (Cannon and Roe 1982).
When determining the sample size necessary to estimate the mean of a quantitative variable (e.g., production parameters, see 3.6), the investigator needs to supply an estimate of the standard deviation or variance of that variable in the target population and specify how close to the mean the sample estimate should be. Suppose reproductive efficiency as measured by the calving-to-conception interval is the event of interest. Assume that the available evidence suggests that the standard deviation of this interval is 20 days, and the investigator wishes the sample to provide an estimate within 5 days of the true average 95% of the time. Then S = 20 and L = 5, and the required sample size is:
\[n=4 S^2 / L^2=4 \times 20^2 / 5^2=1600 / 25=64\]
Thus approximately 64 cows are required for the survey. The number 4 in the previous formulas is the approximate square of Z = 1.%, which provides a 950/o confidence level. If the investigator wished to be 99% certain that the results would be within ± L of the true level, 6.6 (the approximate square of Z = 2.56) should be substituted for 4. The reader is encouraged to experiment with different values in each of the above formulas to assist in understanding the consequences of these changes. In using the above formulas, it is assumed that the sampling unit is the same as the unit of concern. When using cluster or multistage sampling, an upward adjustment in the sample size may be required to obtain the desired precision in the estimate. If the disease is not very contagious and/or the within-primary-unit correlation coefficient is small, a two to three times increase in the sample size should be appropriate. For very contagious diseases, the necessary sample size may have to be increased five to seven times (Leech and Sellers 1979). These increases are based on rule-of-thumb, and more accurate formulas as described in 2.2. 9 should be used when the appropriate information on the within- and between-herd variances is available.
2.2.9 Cost considerations in survey design
Frequently, the investigator must perform the sampling under monetary as well as practical and biologic constraints. Thus, rather than only specifying the precision of the estimate, the investigator may seek to obtain the highest precision for a specified cost or, conversely, the least cost for a specified precision.
Simple probability sampling procedures are not. particularly flexible in terms of meeting monetary constraints, other than altering (usually reducing) the total number of sampling units studied. However, stratified sampling allows the investigator to select different numbers of units from different strata, depending on the relative costs associated with sampling in each stratum. The basic rule is to reduce the number of samples in strata with high sampling costs and to increase the number with lower sampling costs. The optimal stratified sample will have stratum weights proportional to N~/C/12 where N is the number in the population in stratum, S1 is the standard deviation of the parameter being measured in stratum j, and C1 is the cost of sampling in stratum j. If the resulting sample is not proportionally weighted according to the population structure, the calculation of the sample mean should be done using the weighting formula in Table 2.1.
Cluster sampling is often used because of practical difficulties in obtaining a sampling frame in which the individual is the sampling unit. Thus circumventing these "practical difficulties" by using cluster sampling is really a reaction to economic constraints. For example, it may cost less to sample 4000 swine using cluster sampling than to sample 1000 using random sampling, although the precision of the estimate obtained by the latter may be greater than that obtained using cluster sampling with more individuals.
The most flexible sampling method to take account of cost factors is multistage sampling. In two-stage sampling one may vary the number of primary and secondary units selected according to the costs of sampling primary units (e.g .. herds) as well as the costs of sampling secondary units (e.g., animals within a herd). In the enzootic bovine leukosis example, the cost of traveling to a herd to obtain samples may be large relative to the cost of obtaining a sample from an individual cow once on the farm. This would suggest an increase in the number of secondary units (cows) and a decrease in the number of primary units (herds) to reduce the total cost of sampling. The balance between primary and secondary sampling units can be investigated formally. If c is the total monies available for sampling, C1 the cost of sampling primary units, and c1 the cost of sampling secondary units, the relationship between these costs and the numbers of primary and secondary units is:
\[c=c_1 n_1+c_2 n_1 n_2\]
The appropriate number of secondary units n2 to select, minimizing costs for a given precision, or vice-versa (Snedecor and Cochran 1980), is found using:
\[n_2=\left(c_1 s_2^2 / c_2 s_1^2\right)^{1 / 2}\]
The number of primary units n1 may then be found using the previous formula, since c, c" C2 and n1 are known. If C1 = c., then n2 is merely a function of the respective variances; namely, ni = (sf/${)''".
Suppose a person wished to estimate the blood globulin level in mature dairy cows. Assume that the total money available for the project (c) is $10,000, that it will cost an average of $100 per farm (c,) to sample each herd (this includes travel costs}, and that the cost per cow (c2) is $10 once at the herd (this includes the cost of blood vials, needles, technician time, and laboratory analysis). Assume also that the between-herd variability (si) in globulin concentration is 8g/l and the within-herd (cow-to-cow) variability (s1) is 4 g/l. On this basis,
\[n_2=\left(100 \times 4^2 / 10 \times 8^2\right)^{1 / 2}=2.5^{1 / 2}=1.6\]
Since ni should be an integer, round 1.6 to 2 cows per herd. Now, solve the initial cost equation for n,.
\[\begin{aligned}
10,000 & =100 n_1+10 \times 2 n_1=120 n_1 \\
n_1 & =83
\end{aligned}\]
Thus, approximately 80-85 herds would be used, taking 2 cows per herd.
Despite the high cost per herd, the relatively large between-herd variability dictates that a large number of herds are required. In this instance, if \(C_1 = C_2\), the ratio \((s^2_2/s^2_1)^{1/2}\) indicates that one animal (the minimum number) per herd should be selected.