Sampling Methods in R

sangeetha natarajan
Analytics Vidhya
Published in
8 min readMay 24, 2021

--

Photo by Paul Bergmeir on Unsplash

What is sampling?

Let’s say that we have a population of size N, a sample is nothing but a subset of data taken from that population. The process of selecting a sample is known as sampling.

Why sampling?

Sampling can be used in any two of the below scenarios

  1. When the entire population data is not available

In this case, we might have to use a sample of data available to make inferences about the entire population.

2.When the population data is too big

In this case, we might use any of the below sampling techniques to make samples from the population, and hence further inferences can be made.

Note: It is important to ensure that an ideal sample is chosen since an incorrect sample may lead to inferences that do not correlate with the population.

Sampling Methods:

The sampling method is classified into two major categories

  1. Probabilistic Sampling
  2. Non-probabilistic Sampling

Probabilistic Sampling:

In this sampling, samples are chosen at random, and each sample has a known probability of being selected.

Probabilistic Sampling is further classified as

  1. Simple Random Sampling
  2. Systematic Sampling
  3. Stratified Sampling
  4. Cluster Sampling

1.Simple Random Sampling:

Random Sampling is one of the most popular and frequently used sampling methods. In a simple random sampling, every case in the population has an equal probability of getting selected in the sample.

A random sample can be obtained by labeling all the cases sequentially and generating uniform random numbers to select the cases from the population.

A simple random sample in R can be generated as below using the sample() function.

The sample function is defined as below

sample(x, size, replace = FALSE, prob = NULL)

A sample of size 10 from the numbers 1 to 10 can be generated as below.

> sample(1:10,10)[1] 8 2 9 10 3 7 6 5 1 4

As seen above the sample by default is generated without replacement i.e, an item once picked for sampling will not be used again for sampling. Samples with replacement can be created by setting the replace parameter to TRUE as below.

> sample(1:10, replace=T)[1] 8 8 8 9 3 5 3 2 10 7

As we can see in the sample above 8 has been repeated thrice and 3 twice in the sample.

Ideally, in random sampling, all the items have an equal probability( p=0.1in the above case) of getting selected. But in the sample() method we can also assign probabilities for the items to be selected in a sample.

Let's say we have a list with 2 items (red, green) by default both red and green have a 50% (p=0.5)chance of being selected. Suppose we need more reds in the sample than green it can be done by using the ‘prob’ parameter in the sample() as below

> sample(c(“red”,” green”),10,replace=T,prob=c(0.6,0.4))[1] “red” “red” “red” “red” “red” “red” “green” “red” “green”[10] “red”

As we can see there seem to be more red’s in the sample than green since the probability of red to be chosen(p=0.6) is set more than green(p=0.4).

2.Systematic Sampling:

Systematic sampling is used in situations where the population data is an ordered list or is arranged in time. For eg, to analyze the average sales of a shop on all Sundays, systematic sampling can be used by choosing the average sales data of all the 7th day(Sunday) of the week to be included in the sample.

I.e, In systematic sampling, individuals are chosen at fixed intervals from the population data. To create a sample of size n from a population of size p fixed interval(k) is taken as p/n

i.e, k=p/n

i.e, for a population of size 1000, to create a sample of size 100 (1000/100), every 10th item from any random starting point can be chosen to be included in the sample.

Now, let's see how to create a systematic sample in R,

Systematic sample in R:

To create a systematic sample in R, the S.SY() function of the “TeachingSampling” package is used.

install.packages("TeachingSampling")  
library(TeachingSampling)
P <- c("Mon-8", "Tues-4", "Wed-4", "Thurs-6", "Fri-7","Sat-45","Sun-34","Mon-21", "Tues-11","Wed-34","Thurs-16","Fri-10","Sat-17","Sun-19")
#systematic sample from a population of 14 with every 2nd included from the populaion P
systematic_sample <- S.SY(14,2)
systematic_sample
P[systematic_sample]
> P[systematic_sample]
[1] "Mon-8" "Wed-4" "Fri-7" "Sun-34" "Tues-11" "Thurs-16" "Sat-17"

Using the above R code, from a population P which contains the units sold on all the days of the week over a period of 14 days, we have created a systematic sample with just the units sold on alternate days.

Note: Systematic sampling is easier to implement than Simple Random Sampling. However, in a systematic sampling, not every item has an equal chance of being selected and hence many items might never get chosen. Also if a population has periodic trends, the effectiveness of the systematic sample depends on the relationship between the periodic interval and the systematic sampling interval.

3.Stratified Sampling:

In stratified sampling, the population is divided into smaller subgroups based on some common factors that best describe the entire population like age, sex, income, etc. The groups thus formed are known as stratum/strata.

For example, to analyze the amount of time spent by male and female users in sending messages per day, the strata could be taken as male and female users and random sampling can be used to select items within the male and female strata.

Note: Stratified sampling gives precise estimates compared to random sampling but the biggest disadvantage is that it requires knowledge of the appropriate characteristics of the population(the details of which are not always available), and it can be difficult to decide which characteristics to stratify by.

Stratified Sampling in R:

Using dplyr

Let's see how to create a stratified sample using the iris dataset with 3 samples from each species.

library(dplyr)
set.seed(1)
iris %>%
group_by (Species) %>%
sample_n(., 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 5.7 3.8 1.7 0.3 setosa
3 5.2 3.5 1.5 0.2 setosa
4 5.7 3.0 4.2 1.2 versicolor
5 5.2 2.7 3.9 1.4 versicolor
6 5.0 2.3 3.3 1.0 versicolor
7 6.5 3.0 5.2 2.0 virginica
8 6.4 2.8 5.6 2.2 virginica
9 7.4 2.8 6.1 1.9 virginica

Using strata():

The above same stratified samples can also be created using the strata function of the sampling package as below

library(sampling)  
stratas = strata(iris, c("Species"),size = c(3,3,3), method = "srswor")
stratas
Species ID_unit Prob Stratum17 setosa 17 0.06 124 setosa 24 0.06 145 setosa 45 0.06 165 versicolor 65 0.06 279 versicolor 79 0.06 295 versicolor 95 0.06 2114 virginica 114 0.06 3116 virginica 116 0.06 3128 virginica 128 0.06 3

Note: In the above function method represents the method used to select individual samples within the strata. The following methods are generally used.

srswor-simple random sampling without replacement

srswr- simple random sampling with replacement

poisson-Poisson sampling

systematic- systematic sampling

4.Cluster Sampling:

A cluster sampling is generally used in cases where the population data is geographical in nature or when there are some predefined groups within the population based on demographics, habits, background, etc.

In a clustered sampling, the population is first divided into small groups known as clusters and then random clusters are chosen to create a sample.

If all the elements of the chosen clusters are included in the sample then it is known as Single-stage cluster sampling and if a random selection of elements from each cluster is included in the sample, then it is known as Two-stage cluster sampling.

For example, suppose that an organization wants to analyze the side effects of a drug across the United States, in this case, a two-stage cluster sampling can be performed by first dividing the entire population into cities(where each city data has the details about the side effects of the drug for all the patients) and then randomly selecting patients within these cities to be included in the sample.

Cluster Sampling in R:

To perform cluster sampling we have used below the Elementary School Teacher Workload dataset of the SDaA package. The workload dataset contains the workload details such as hours worked, preparation time, etc of teachers of different schools across different districts.

install.packages(“SDaA”)
library(SDaA)
data(“teachers”)
> head(teachers)
dist school hrwork size preprmin assist1 large 12 35.00 26 210 02 large 12 35.00 18 75 03 large 12 35.00 27 300 04 large 12 34.60 34 90 05 large 12 33.75 30 180 06 large 12 35.00 27 300 0#list of all the school_ids
> unique(teachers[,2])
[1] 12 13 20 21 22 36 38 41 11 30 31 32 4 23 7 15 16 28 29 6 2 18 19 33 34 1 8 9 3 24 25
#creating a cluster sample with 7 randomly selected clusters.
#Here we have formed clusters using the school variable.Hence each cluster contains the workload data of 7 randomly selected schools.
set.seed(123456)
cl=cluster(teachers,clustername=c("school"),size=7,method="srswor")
cl_data = getdata(teachers, cl)
> head(cl_data)dist hrwork size preprmin assist school ID_unit Prob260 sm/me 30.00 9 NA 0 1 260 0.2258065243 large 35.00 20 225 600 18 243 0.2258065244 large 35.00 16 90 300 18 244 0.225806519 large 37.50 25 180 0 20 19 0.225806518 large 38.75 24 240 0 20 18 0.225806517 large 38.35 24 120 0 20 17 0.2258065#list of the randomly selected schools
> unique(cl_data[,6])
[1] 1 18 20 25 28 31 41#count of workload details within each school clusters
> table(cl_data$school)
8 12 16 21 28 34 38
5 13 24 7 18 7 10
#random sampling of clusters with a sample size of 5, so that each cluster contains 5 randomly selected workload details per school cluster.
cl_sam <- cl_data %>% group_by(school) %>% sample_n(size = 5)
#Each of the 7 clusters have 5 randomly selected workload data.
> table(cl_sam$school)
8 12 16 21 28 34 38
5 5 5 5 5 5 5

Using the above R code we have created 7 random clusters where each cluster contains a specific school's workload data. The clusters are further sampled randomly with a sample size of 5.Hence each cluster has 5 workload data for each of the selected school clusters.

Difference between Stratified and Cluster sampling:

The main difference between stratified sampling and cluster sampling is that in cluster sampling the groups/clusters occur naturally like cities, districts, etc, and these chosen clusters elements as a whole are used for sampling e.g, for the workload data 7 school clusters were chosen initially and all the elements of these clusters alone have been used for further sampling. i.e, The workload load data from just 7 schools were used in sampling ignoring the workload details of the remaining schools.

Whereas, in a stratified sampling the groups(strata in this case)do not exist initially and elements from each of the strata created are chosen to be included in the sample. For eg, for the above iris 3 elements from each of the three species available have been included in the sample and no species was ignored as a whole as in the case of Cluster Sampling.

--

--