Optimum Sample Allocation in Stratified Sampling Schemes with Stratallo Package

The goal of stratallo package is to provide implementations of the efficient algorithms that solve a classical problem in survey methodology - an optimum sample allocation problem in stratified sampling schemes. In this context, the classical problem of optimum sample allocation is the Tschuprov-Neyman’s sense (Neyman 1934; Tschuprov 1923). It is formulated as determination of a vector of strata sample sizes that minimizes the variance of the \(\pi\)-estimator of the population total of a given study variable, under constraint on total sample size. This problem can be further complemented by adding lower or upper bounds constraints on sample sizes is strata.

A minor modification of the classical optimium sample allocation problem leads to the minimum sample size allocation. This problems lies in the determination of a vector of strata sample sizes that minimizes total sample size, under assumed fixed level of the \(\pi\)-estimator’s variance. As in the case of the classical optimal allocation, the problem of minimum sample size allocation can be complemented by imposing upper bounds constraints on sample sizes in strata.

Stratallo provides two user functions, dopt and nopt that solve sample allocation problems briefly characterized above. In this context, it is assumed that the sampling designs in strata are chosen so that the variance of the \(\pi\)-estimator of the population total is of the following generic form: \[ D^2_{st}(x_w,\, w \in \mathcal W) = \sum_{w \in \mathcal W}\, \frac{a_w^2}{x_w} - b, \] where \(\mathcal W= \{1, \ldots, H\}\) denotes set of strata labels with total number of strata equals to \(H\), \((x_w)_{w \in \mathcal W}\) are the strata sample sizes, and parameters \(b\), and \(a_w > 0,\, w \in \mathcal W\), do not depend on the \((x_w)_{w \in \mathcal W}\). Among the stratified sampling designs that have the \(\pi\)-estimator’s variance of the above form is stratified simple random sampling without replacement design. Under this design \(a_w = N_w S_w,\, w \in \mathcal W\) and \(b = \sum_{w \in \mathcal W}\, N_w S_w^2\), where \(S_w,\, w \in \mathcal W\), denote stratum standard deviations of study variable and \(N_w,\, w \in \mathcal W\), are the strata sizes (see e.g. Sarndal et al. (1993), Result 3.7.2, p. 103).

Apart from dopt and nopt, stratallo provides var_tst and var_tst_si functions that compute a value of variance \(D^2_{st}\). The var_tst_si is a simple wrapper of var_tst that is dedicated for the case of simple random sampling without replacement design in each stratum. Furthermore, the package comes with two predefined, artificial populations with 507 and 969 strata. These are stored in pop507 and pop969 objects respectively.

Minimization of the variance with dopt function

The dopt function solves the following two types of the allocation problem, formulated in the language of mathematical optimization.

Problem 1 (one-sided upper bounds constraints)
Given numbers \(a_w > 0,\, M_w > 0,\, w \in \mathcal W\) and \(b,\, n < \sum_{w \in \mathcal W}\, M_w\), \[\begin{align*} \underset{\mathbf x\in (0, +\infty)^{H}}{\mathrm{minimize ~\,}} & \quad f(\mathbf x) = \sum_{w \in \mathcal W} \tfrac{a_w^2}{x_w} - b \\ \mathrm{subject ~ to} & \quad \sum_{w \in \mathcal W} x_w = n \\ & \quad x_w \le M_w, \quad \forall w \in \mathcal W, \end{align*}\] where \(\mathbf x= (x_w)_{w \in \mathcal W}\) is the optimization variable.

Problem 2 (one-sided lower bounds constraints)
Given numbers \(a_w > 0,\, m_w > 0,\, w \in \mathcal W\), and \(b,\, n > \sum_{w \in \mathcal W} m_w\), \[\begin{align*} \underset{\mathbf x\in (0, +\infty)^{H}}{\mathrm{minimize ~\,}} & \quad f(\mathbf x) = \sum_{w \in \mathcal W} \tfrac{a_w^2}{x_w} - b \\ \mathrm{subject ~ to} & \quad \sum_{w \in \mathcal W} x_w = n \\ & \quad x_w \geq m_w, \quad \forall w \in \mathcal W, \end{align*}\] where \(\mathbf x= (x_w)_{w \in \mathcal W}\) is the optimization variable.

User of dopt can choose whether the solution computed will be for Problem 1 or Problem 2 This is achieved with the proper use of m and M arguments of the function. In case of Problem 1, user provides the values of upper bounds with M argument, while leaving m as NULL. Similarly, for Problem 2, user provides the values of lower bounds with m argument, while leaving M as NULL. If both m and M are specified as NULL (default), the dopt returns the value of Tschuprov-Neyman allocation that minimizes variance \(D^2_{st}\) under constraints on total sample size \(\sum_{w \in \mathcal W} x_w = n\), and it is given by \[ x_w = a_w \frac{n}{\sum_{w \in \mathcal W} a_w}, \quad w \in \mathcal W \] There are four different algorithms available to use for Problem 1, rNa (default), sga, sgaplus, coma. All these algorithms, except sgaplus, are described in detail in Wesołowski et al. (2021). The sgaplus is defined in Wójciak (2019) as Sequential Allocation (version 1) algorithm.

The optimization Problem 2 is solved by the LrNa that in principle is based on the rNa and it is introduced in Wójciak (2022).

Minimization of the total sample size nopt function

The nopt function solves the following minimum sample size allocation problem, formulated in the language of mathematical optimization.

Problem 3
Given numbers \(a_w > 0,\, M_w > 0,\, w \in \mathcal W\), and \(b,\, D > \sum_{w \in \mathcal W} \tfrac{a_w^2}{M_w} - b > 0\), \[\begin{align*} \underset{\mathbf x\in (0, +\infty)^{H}}{\mathrm{minimize ~\,}} & \quad n(\mathbf x) = \sum_{w \in \mathcal W} x_w \\ \mathrm{subject ~ to} & \quad \sum_{w \in \mathcal W} \tfrac{a_w^2}{x_w} - b = D \\ & \quad x_w \le M_w, \quad \forall w \in \mathcal W, \end{align*}\] where \(\mathbf x= (x_w)_{w \in \mathcal W}\) is the optimization variable.

The algorithm that solves Problem 3 is based on the LrNa and it is described in Wójciak (2022).

Installation

You can install the released version of stratallo package from CRAN with:

install.packages("stratallo")

Examples

These are basic examples that show how to use dopt and nopt functions to solve optimal sample allocation problems for an example population with 4 strata.

library(stratallo)

Function dopt

# Define example population
N <- c(3000, 4000, 5000, 2000) # Strata sizes.
S <- c(48, 79, 76, 17) # Standard deviations of a study variable in strata.
a <- N * S
M <- c(100, 90, 70, 80) # Upper bounds constraints imposed on the sample sizes in strata.
all(M <= N)
#> [1] TRUE
n <- 190 # Total sample size.
n < sum(M)
#> [1] TRUE

# Solution to Problem 1.
opt <- dopt(n = n, a = a, M = M)
opt
#> [1] 34.979757 76.761134 70.000000  8.259109
sum(opt) # Equals to n.
#> [1] 190
all(opt <= M) # Does not violate upper bounds constraints.
#> [1] TRUE
# Variance of the pi-estimator that corresponds to a given optimal allocation.
var_tst_si(opt, N, S)
#> [1] 4035156476

m <- c(50, 120, 1, 1) # Lower bounds constraints imposed on the sample sizes in strata.
n > sum(m)
#> [1] TRUE

# Solution to Problem 2.
opt <- dopt(n = n, a = a, m = m)
opt
#> [1]  50.000000 120.000000  18.357488   1.642512
sum(opt) # Equals to n.
#> [1] 190
all(opt >= m) # Does not violate lower bounds constraints.
#> [1] TRUE
# Variance of the pi-estimator that corresponds to a given optimal allocation.
var_tst_si(opt, N, S)
#> [1] 9755319333

# Tschuprov-Neyman allocation (no inequality constraints).
opt <- dopt(n = n, a = a)
opt
#> [1] 31.304348 68.695652 82.608696  7.391304
sum(opt) # Equals to n.
#> [1] 190
# Variance of the pi-estimator that corresponds to a given optimal allocation.
var_tst_si(opt, N, S)
#> [1] 3959066000

Function nopt


a <- c(3000, 4000, 5000, 2000)
b <- 70000
M <- c(100, 90, 70, 80)
D <- 1e6 # Variance constraint.

opt <- nopt(D, a, b, M)
sum(opt)
#> [1] 183.1776

References

Neyman, J. (1934), “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection,” Journal of the Royal Statistical Society, 97, 558–606.
Sarndal, C.-E., Swensson, B., and Wretman, J. (1993), Model Assisted Survey Sampling, Springer.
Tschuprov, A. A. (1923), “On the Mathematical Expectation of the Moments of Frequency Distributions in the Case of Correlated Observations,” Metron, 2, 461–493, 646–683.
Wesołowski, J., Wieczorkowski, R., and Wójciak, W. (2021), “Optimality of the Recursive Neyman Allocation,” Journal of Survey Statistics and Methodology. https://doi.org/10.1093/jssam/smab018. https://arxiv.org/abs/2105.14486.
Wójciak, W. (2019), “Optimal Allocation in Stratified Sampling Schemes,” MSc Thesis, Warsaw University of Technology. http://home.elka.pw.edu.pl/~wwojciak/msc_optimal_allocation.pdf.
Wójciak, W. (2022), “Minimum Sample Size Allocation in Stratified Sampling Under Constraints on Variance and Strata Sample Sizes.” https://doi.org/10.48550/arXiv.2204.04035.