Introduction to statistical sampling and resampling

One of the most hidden desires of any researcher is to be able to have data on the entire population that he / she intends to study.
Studying the whole population allows the researcher to obtain a complete understanding of the phenomenon under study, since this allows to collect information on all the individuals that make up that population.
This is unrealistic in most cases, for both practical and theoretical reasons.
For example, let's consider a population made of humans in the city of Rome, Italy. If our study needs to include responses from these individuals, it could be impossible to locate and contact all of them due to real-world limitations.
We would need to locate and ask for their responses – this for all the individuals in Rome.
This could prove to be too expensive and time consuming, be it for a researcher working solo or a team.
For this reason, it is often necessary to use a sample as an approximation of the population.
Sampling is not as straight-forward as one could think. There are non-obvious definitions and nuances that are worth knowing about. In fact, being able to formalize a clear and well-thought sampling strategy can have huge impact on your study, both for your and the team.
In this article you'll learn about statistical sampling and how it can hugely affect the outcomes of your studies and experiments.
Here's what you'll learn by reading this article
- you'll be able to define what a sample is and how does it differ from the population on the statistical level
- understand why sampling is necessary in most cases
- what makes a sample representative of the population and what are the factors that impact such property
- you'll learn some of the most relevant sampling techniques, with examples and images of how to think about them in a research design setting
Let's get started!
Let's define what a sample really is
A sample is nothing else than a subset of the population you wish to study. Unlike the population, which represents the entire group of individuals or objects that you want to analyze, the sample represents only a portion of it.
Studying a sample allows us to study the population by approximating some degree of truth that we expect to find in the data we are observing.
That's it.
Now, the nuances emerge when one asks questions, such like
- what are the inclusion criteria for my experiment?
- what kind of sampling strategy suits best for this scenario?
- how do I collect the data from this sample?
and more. So while defining what a sample is an easy thing to do, sampling is quite complex and impacts the whole research you are working on.
Why is it difficult to study a population?
The reasons can be quite numerous, but some of the most common are
- the population is too large and you just can't grab data from all of the individuals
- lack of resources, such as time and money, to collect data on the whole population
- difficulty in identifying all the individuals belonging to the population
- inability to collect data on some individuals, due to some form of inaccessibility
And many other aspects that depend on the project.
In these cases, statistical sampling becomes a practical and efficient solution for estimating population characteristics. Once data is collected from the sample, it can be used to make inferences about the larger population.
What does "representative" sample mean?
A representative sample is a subset of the population that is expected to share some characteristics with the population itself.
Given some metric (like the height of the individuals or their scores on some test),
a representative sample is one that accurately reflects the distribution of that metric if the whole population were to be studied instead.
Most of the time the researcher will have no idea how the population is distributed for a given metric, so he / she could use sampling techniques, such as random sampling, which seeks to ensure that each individual in the population has an equal probability of being included in the sample.
But it is not that simple, since the only empirical way for the researcher to validate his sampling technique is through observation and experimentation.
For example, you could use random sampling because you think that individuals are quite similar to one another in the population. If this were the case, experimentation could reveal that there are instead separate groups that differ quite largely from one another, which are worth studying singularly instead.
Factors that impact sampling
Here is a list of factors that impact the quality of the sampling and its ability to correctly approximate the population:
- The reference population: the choice of the sample depends on the knowledge of the reference population, i.e. the group of people, objects, events or data from which the sample is to be extracted.
- The sampling method: There are several sampling methods, including simple random sampling, stratified sampling, cluster sampling, systematic sampling and quota-based sampling. The choice of method depends on the characteristics of the population and the objectives of the study.
- The sample size: The sample size depends on the level of precision and reliability required for the estimates. In general, the larger the sample size, the more accurate the estimates are. This is because the reference population gets closer and closer as the number of individuals in the sample increases.
- Inclusion criteria: The inclusion criteria used may affect the representativeness of the sample and the accuracy of the estimates. It is important to use appropriate screening procedures to avoid selection bias.
- How data is collected: How data is collected (for example, telephone interviews, online questionnaires, field observations) can affect data quality and sample representativeness.
- The time over which the data is collected: This can influence the representativeness of the sample, as the characteristics of the population can vary over time.
Sampling techniques
We will now see a list of sampling techniques available to the researcher that allow to approximate the population. Again, there is no "best" technique – you should understand and adapt them to the case you are working on.
- random sampling
- stratified sampling
- systematic sampling
- quota-based sampling
- bootstrapping
Let's go through one by one.
Random sampling
Simple random sampling is one of the most widely used sampling techniques and involves randomly selecting individuals from the population, such that each individual has an equal chance of being included in the sample.
This sampling technique is useful when the population is homogeneous and there is no reason to divide it into groups. Furthermore, simple random sampling is relatively easy to implement and requires no specialized knowledge.
However, simple random sampling has some limitations, such as the difficulty of ensuring representativeness of the sample when the population is highly heterogeneous.
In this case it is important to have domain specific knowledge to understand and correctly deal with these phenomena.
Here's an image on how random sampling works

Stratified sampling
Stratified sampling is a sampling technique that involves dividing the population into homogeneous groups, called strata, based on one or more characteristics.
Once the strata are defined, a simple random sample is selected within each stratum.
This technique is useful when the population has heterogeneity in the characteristics of interest and you want to ensure that each stratum is adequately represented in the sample.
For example, if you wanted to study a company's customer satisfaction across different regions, you could divide the population by region and select a simple random sample of customers from each region.

Systematic sampling
Systematic sampling is a sampling method in which population items are arranged in an ordered list and one item is selected every k items (for example, every tenth item) starting from a random starting point.
This sampling method is used when the list of population items is already available and a random selection of a sample is required.
Systematic sampling is useful when the population is large and identifying each individual would require too much time or resources.
However, systematic sampling can be biased if the range among the individuals chosen coincides with a particular pattern in the population.

Quota-based sampling
Quota sampling is a non-probability sampling method in which individuals are selected in order to obtain a proportional representation of the characteristics of the reference population.
With this method, the population is divided into categories or "quotas" based on some characteristics of interest (for example, gender, age, education, geographic region) and the number of individuals to be selected for each quota is determined based on the proportions of the population.
The selection of subjects within each quota can be done using a random or non-random sampling method, depending on the needs of the study.
The advantage of quota-based sampling is that it allows to obtain a sample that proportionally represents the characteristics of the reference population, even if the choice of subjects within each quota is not made randomly.
This sampling method is often used in opinion polls, as it allows a representative sample to be reached relatively quickly and cheaply. However, quota sampling may be influenced by the knowledge and opinions of the recruiter who selects subjects for the sample, and may therefore be subject to selection bias.

Bootstrapping
Bootstrapping is a resampling technique that allows to approximate the population by randomly selecting with substitution a certain number of elements from the population to generate a sample.
This process is repeated many times, so as to create a large amount of samples. From these samples any statistic (such as the mean or the median) is extracted and this becomes part of the final sample which will approximate the population.
Bootstrapping is useful when you want to estimate the accuracy of a sample statistic or machine learning model. Instead of making assumptions about the population distribution, bootstrapping uses the distribution of the synthetic samples to estimate the standard error and confidence intervals.
Bootstrapping is especially useful when the population distribution is unknown or when it is not possible to obtain repeated samples from the original population.

Conclusion
In this article we have seen how statistical sampling is a fundamental concept during the research process.
We have seen how sampling can help obtain information on the population of interest with greater efficiency than collecting data on the entire population, based on the researcher's knowledge in the reference domain and the various biases to which he is exposed.
We also discussed commonly used sampling techniques, including random, stratified, systematic, and quota sampling, as well as bootstrapping.
Finally, we underlined the importance of a correct definition of the reference population and of the choice of the most appropriate sampling method for the objectives of the study.
Hope this intro helps you with your personal growth, domain knowledge and your projects.
To the next article