Sampling Techniques in Data Analysis

Considerable emphasis is given to the analytical methods and algorithms used in Data science projects, extracting meaningful insights from data and discovering valuable information. But equally as important (arguably even more important) is the data preparation prior to beginning a project; the quality of the data is the bedrock on which any data analysis or machine learning project is based on. It would be naive to expect quality outputs from an analysis with subpar data inputs – garbage in garbage out, as the saying goes. Therefore it's essential to ensure that the data samples collected are of sufficient quality. But how to choose the appropriate sampling technique for your data?

In this post I intend to provide an overview of some sampling techniques for data collection, and give suggestions on how to pick the most optimal methods for your data. The Sampling Methods I will describe here are as follows:
- Simple random sampling
- Stratified sampling
- Cluster sampling
- Systematic sampling
Each method has it's advantages and disadvantages, and certain methods are more suitable than others depending on the needs of the data. This post will describe these sampling techniques in detail, and give examples of use cases where these methods are recommended.
Simple Random Sampling
Simple random sampling (SRS) does exactly what the name would suggest— the sample is selected from the population at random, irrespective of other considerations such as population characteristics. This is generally effective when the population is considered to be relatively homogeneous, i.e. each element in the population is expected to be alike to the others.
The advantage to this is that due to its randomness, it's difficult to introduce biases in the data – a large enough sample size would theoretically be representative of the overall population, which is ideal if the end goal is to model general population behaviours. There are some drawbacks to this approach – namely, that smaller subgroups within the whole may be underrepresented in the data. In cases like this the simple random sample may be unsuitable for purpose.

An example of this is picking residents in a town at random to conduct surveys on public health – a statistician may approach this task by first getting a list of all town residents, assigning a number to each person, then using a random number generator to select a sample of residents for the survey. However, if this poll is particularly interested in the health of the elderly population of the town (i.e. over 90 years old) then this method may exclude that small subpopulation entirely – meaning that SRS should then be discarded as a viable option for the needs of such a survey.
Stratified Sampling
By comparison, stratified sampling addresses the potential underrepresentation issues of SRS directly, by first dividing the population into distinct subgroups (or strata) based on their characteristics – returning to the example of the town health survey, these strata may be grouped by age group, or further subdivided by gender or income. Following this, random samples are taken from each subgroup (stratum) to build the sample population for analysis.
This is a practical approach for when you want to ensure adequate representation from each subgroup. Depending on the needs of the survey, a statistician may either take an equal number of individuals from each stratum, or select a number of individuals based on their proportion to the total population – in this way a survey-taker can maintain proportional representation in their survey. With this in mind, it may be difficult to sort the population into clearly defined strata – making the task of creating a stratified sample more complex than a simple random sample.

Cluster Sampling
With cluster sampling the population is initially grouped into clusters, then clusters are randomly selected for the sample. In this context cluster sampling has similarities with stratified sampling, in that the population is first segmented before subgroups are chosen at random. However as opposed to randomly selecting individuals from each subgroup, it is the subgroups which are randomly selected instead.
Cluster groupings are typically based on factors such as proximity, with the central guideline being that each cluster must be distinct from the others. Returning to the town health survey as an analogy, clusters may be based on neighbourhoods or even houses, where some or all of the household is added to the sample. Another example of this would be in a production environment, where entire batches of product are chosen at random for sampling rather than selecting individual units from each batch. This has the benefit of being more convenient than scraping through all units in the assembly line. One thing to be careful with is ensuring that all clusters are independent of each other, so that each element belongs to one cluster only – otherwise this may result in potential sampling errors.

As well as this, cluster sampling may introduce biases due to a clustering effect – elements within each cluster are correlated which may lead to larger standard errors and reduced precision compared against SRS. There are methods to adjust for these errors, but this adds complexity to the sampling process.
Systematic Sampling
Lastly, systematic sampling involves selecting a starting point within the population and then regularly choosing every nth item to add to the sample size – this is particularly convenient for large population sizes where a list is available. An example of this is on the production line for post-processing measurements from a machine, where every 10th product running through the tool is measured for defects. In this example, 10% of the total population is added to the sample size, to ensure quality control from the machine's processing.

The benefits to this approach include simplicity and efficiency of data gathering, while maintaining a uniform coverage of the population. Unfortunately this method is sensitive to element ordering – if there's a repeating pattern appearing cyclically in the population, this may also introduce bias in the sample.
Choosing an appropriate sampling method
How can you determine the most suitable sampling technique for your data? There are many factors to consider when choosing a sampling technique, which are often specific to the type of analysis to be conducted. While no one specific method fits all scenarios, the following statements are good rules of thumb for choosing a sampling method:
- All elements in the population are equally important. Sample bias must be minimised. The sample needs to be representative of the general population. Subgroups within the population are not a focus for the data collection → Use simple random sampling
- Subgroups need to be represented in the data collection. Dividing the population into strata addresses possible biasing concerns → Use stratified sampling
- The population is naturally organised into clusters. There is little or no similarity within clusters that may lead to bias. Clusters are independent of each other → Use cluster sampling
- The population is well-organised and ordered. All elements in the population are equally important. There is no risk of a repeating pattern in the data which may lead to bias → Use systematic sampling
This isn't an exhaustive process for choosing a sampling method – there may be other factors to consider – but generally this approach works for the large majority of cases. Ultimately the question comes down to what data is important in the data collection process, whether potential biases are addressed, and any potential limitations on data gathering. The optimal sampling technique will sufficiently address these concerns – so long as you bear this in mind when choosing your sampling method, you can be confident in obtaining high quality data for your purposes.