Random sampling: cluster sampling

With this post dedicated to cluster sampling, we conclude our first block of posts on random sampling. With our next post, we will launch into nonrandom sampling methods, which are used most commonly in online research.

Cluster sampling is a method that makes the most of groups or clusters in the population that correctly represent the total population in relation to the characteristic that we wish to measure. In other words, all of the variability that exists in a population is contained within the population. When this is the case, we can select just a few of these clusters to conduct our study.

Let’s look at this method from another point of view. In most of the methods we’ve seen so far, the sampling units have coincided with the units to be studied (individuals). With cluster sampling, however, the sampling units are groups of units to be studied, which can be very beneficial when it comes to minimizing the cost of the sampling process. Of course, there’s a trade-off: this technique usually entails less precision, since there is a lack of heterogeneity among the clusters.

THE SAMPLING PROCESS

The first step in applying this method is defining the clusters. This involves identifying a characteristic that enables us to divide the population into discrete groups (with no overlap) and to include every individual in a group (none can be left out) in such a way that there is no difference between the groups in relation to what we want to measure. Once we have defined these clusters, we can randomly select a few to study.

One characteristic often used to define clusters is geography. For example, if we want to study what percent of the Argentine population smokes, we could divide the entire population into provinces and study just a few of them. Provided that we have no reason to think that smoking rate changes from one province to another, this solution enables us to concentrate our sampling efforts on a single geographic location. If we’re going to conduct the study through personal interview, this could amount to major savings on travel expenses.

Once we have defined the clusters, the next step is to select the clusters that are going to be studied through either simple random sampling or systematic sampling.

Finally, once we have selected the clusters to be studied, we can research all of the subjects that make up the clusters, or even apply a new sampling process within the cluster—for example, we could obtain a sample through simple random sampling or systematic sampling. If we opt to do this, we’re dealing with a two-stage sampling process: in the first stage, we select the cluster, and in the second we select the individuals within the cluster. If, on the other hand, we study all of the individuals within the clusters, we call it single-stage cluster sampling.

STRATIFIED AND CLUSTER SAMPLING

The idea of cluster sampling is reminiscent of stratified sampling. In both cases, we divide the population into groups. Yet in one sense, these methods’ underlying approaches are in opposition.

Stratified sampling is especially suitable when the groups (strata) have a high level of internal homogeneity and are very different among themselves. In that case, it’s good to make sure that our sample is representative of all the strata. With cluster sampling, it’s quite the opposite: we want the groups into which we divide the population to be very similar, so that there is no major difference between studying individuals in one group or another.

So, despite the fact that both methods divide up the population (into strata or clusters), the individual selection process is radically different.

• This method’s greatest advantage is operational: selecting a cluster to study is typically easier and more affordable than creating a random or systematic sample. For example, we saw above how using geographic clusters can amount to significant savings on travel.
• Strangely enough, it is common for studies conducted online to continue to think in terms of regions, even though there is no operational incentive to do so; very much to the contrary, this approach heightens the risk of imprecision due to differences between the regions studied and the rest of population. This practice is the unjustified bequest of techniques that were good for live interviews, but that make no sense for other methods.
• The chief disadvantage of using cluster sampling is the notable risk that the clusters may not be truly homogeneous among each other. In the above example about Argentine smokers, perhaps one of the provinces is more inclined to smoke because it is more urban, or for cultural reasons, or due to any number of other possible factors.

THE EFFECTIVENESS OF CLUSTER SAMPLING

How does this method stack up against those that we saw before? As with stratified sampling, how well this method works depends on the “relatedness” between variance within the clusters and variance outside the clusters.

This relatedness is expressed with an intracluster correlation coefficient (δ), which is defined as the linear correlation coefficient between all the pairs of values for the variable in the study measured over the cluster units and extended to all of the clusters. Ultimately, this coefficient is a measure of homogeneity within the clusters.

The lower the intracluster correlation coefficient δ, the greater the effectiveness of the cluster sampling. Bear in mind that the goal is for the clusters to be just as heterogeneous as the whole sample, so that the selection of a given cluster will yield the same information as the random selection of individuals from the entire population.

If we compare simple random sampling with cluster sampling, we can demonstrate that if δ=0, both methods are equivalent. This condition implies that the clusters are just as heterogeneous as the population as a whole. The worst-case scenario would be if δ=+1, and the best-case scenario would be δ=-1/(M-1), where M is the size of the cluster. But normally, δ is always going to be greater than zero, since it’s normal for the units within a cluster to bear a certain resemblance to one another.

Another way to see the impact of this problem is to calculate the sample size necessary for cluster sampling to achieve the same level of precision as simple random sampling. This is expressed as

nc = na (1 + (M-1) δ)

where nc is the sample size in cluster sampling and nais the sample size that we would need for simple random sampling. Therefore, the factor (1+(M-1) δ) is the sample size variation that we would need in order to use clusters. The variation is generally an increase. This fact is known as the design effect.

We hope that this post has helped you better understand this random sampling method. Check out the links below to read the other articles that form this series: