We live in the world of big data. It’s the new oil that fuels our engines while it also helps us reach the destination. Perhaps one day it will also drive us there. Well, that’s already happening.
But big data means …well? Big. Too big to even compute. The latest CPU is not going to break the ice when it comes to processing millions of data points. Who can blame them? Nature has imposed physical limitations on our devices. And we have found our workarounds.
The word data sampling should be self-explanatory. We are simply picking a sample from data. This data is typically too large that we cannot physically or with the help of equipment compute it ourselves. Hence, a sample is taken which is a subset of that data. The original data set from where the sample is taken is called population. The below image depicts the relation between population and sample.
Creating a sample can look easy on paper but that totally depends on the original population. When we choose a sample, we expect the sample to imitate all the characteristics of the original population. But that is not a given. If the population is uniform, then a random sample can be selected. But the majority of the times, the population is rather complex and requires special techniques to create samples. These sampling techniques are broadly divided into two categories - probability sampling & non-probability sampling. Before we go there, let’s discuss few scenarios where sampling has been helpful to us in real life.
On average, a country has 40 million people as of 2021. Although that number is skewed in favour of big nations like India, China, or the United States, most countries still have millions of people. It is important for a government to be able to estimate the population of their country. The most important socio-economic factors are gauged with the population as the base. They certainly cannot go and count each person. By the time they’re done counting, demographics would have changed.
So how are we estimating the population? Well, of course, we will just create a sample and then extrapolate the whole population. That is easier said than done because population density will be different in different parts of the country. The mathematical methods chosen for extrapolation will be different as population density varies. We have to define the boundaries where population density changes considerably to start making another localized sample. Then we have to logically combine the results to get the final population. This is a great example of a non-probability sampling which we will get to later.
Food sampling is done to check if the food contains harmful contaminants. It is also used to check if the manufacturer stays true to the key ingredients that the food is supposed to be made. Just like the case above, we cannot actually check millions of individual containers. For that, a sample is selected. This is a good example of probability sampling. Since every container has an equal chance of having a harmful contaminant, we can totally pick a random item and put it in a sample. This sample will be representative of the whole population. We will discuss more probability sampling below.
On the basic level, sampling methods are divided into two - probability sampling & non-probability sampling. Probability sampling is chosen when each member of the population has equal probability. If the probability varies throughout the population, then non-probability sampling methods are utilized. We will discuss 4 popular methods from each category.
As already mentioned above, a probability sample is utilized when each item in the population has an equal probability for a certain event to occur. We already gave an example for food inspection which is a good example for probability sampling. There are 4 types of probability sampling methods.
- Simple random sampling
- Systematic sampling
- Stratified sampling
- Cluster sampling
Simple random sampling means just as it sounds. We’ll pick random items from the population without any particular order. The only limiting factor will be the sample size. The larger the size, the closer it will resemble the original population.
If all items in the population have equal probability, making a random sample is the easiest way. It will reduce the selection bias since we are not following any particular order. However, we also risk not selecting the items with characteristics that interest us.
In this type of sampling, the first item is selected randomly. Then every item is selected with a defined interval. In the below sample, we randomly selected item 2 as the starting point. Every next item is selected with 2 items apart.
This sampling method can help us make sure that we treat the entire population equally. We don’t want our sample to only depend on a part of our population. However, the interval chosen might (very rarely) happen to make a pattern that creates a bias on the sample.
In this type of sampling, the population is divided into subgroups. These subgroups are called strata. They’re divided based on specific features like gender, age, etc. Then a sample from these subgroups is selected. In the below example, we have decided to treat circles, pentagons, triangles, and squares as subgroups. Then we randomly select an item from each subgroup for our sample.
This method of sampling is useful when we want to create a random sample based on an underlying characteristic or a feature. The feature could simply be something like gender or locality.
Cluster sampling is the final form of probability sampling. In this method, the population is divided into subgroups. These subgroups are called clusters. A random cluster is selected as our sample.
Here we have divided the whole population into 3 clusters. The third cluster is chosen as the sample. We can choose different clusters as our sample to get more in-depth knowledge of our population. This type of sampling is also useful when we want to understand more about a specific part of the population.
When each item in the population does not have equal probability, then non-probability sampling methods are used. We talked about how sampling for a nation’s population is non-probability sampling. This is because population density varies throughout a nation. So if one finds 10 people per square km in a particular locality, there is no guarantee that 10 people per sq km exist throughout the nation. So the probability of finding people changes hence it accounts for non-probability sampling. Although individual localized sampling can still be probability sampling. We will discuss 4 types of non-probability sampling.
- Convenience sampling
- Quota sampling
- Judgement sampling
- Snowball sampling
In this method, the sample is made of individuals who are available and willing to take part. A good example of convenience sampling would be exit polls. The individual has to be both available and willing to take part. The primary problem with convenience sampling is that we’re prone to bias.
In quota sampling, some pre-established feature of the population is used to create the sample. For example, we want to do a survey on women welfare products. So our sample should ideally have gender as a pre-established feature for sampling. Additionally, if we only want to interview adults, age can be another feature.
In this method, we leave it to an expert to create the sample. The expert judges the members of the population and select who are worthy to be in the sample. The authenticity of the sample completely lies in the expertise of the judge and is prone to bias.
In this sampling method, the individuals we already included in our sample are asked to refer to other potential members. This method is employed when we are having a hard time isolating the boundaries of our sample. The name snowball just means the sample will get bigger like a rolling snowball.
We have discussed a total of 8 types of sampling methods in this post. The right method depends on the population and what we want to do with it. The very first question is to simply ask whether each member of the population have an equal probability or not. If it doesn’t, all probability methods can be dismissed. The non-probability methods can be chosen depending on the situation at hand.
For probability sampling, a simple random sample is the most common and popular. It also avoids selection bias. But no one is being restricted from using other types of sampling methods. One is also free to use different types of probability sampling to see what gives the most optimal solution.