On sample size for low proportions

Intro

When dealing with labor force statistics, a key variable for the design of an unemployment household survey is the the status of individuals in the labor force. For governments, it is of interest to provide a set of indicators intended to measure and track the occupation of the citizens of the country (or region). For example, you can obtain estimates of the current unemployment rate (measured monthly or quarterly); also, the net change between two periods and the gross flows between categories of employment among periods are also of interest.

You can find three types of design developed to address the particular features of labor force studies adequately. The first one is known as repeated surveys, where similar measurements are made at different points of time to different people each time. The second one is known as panel surveys, where measurements are made at different points in time to the same people each time. The third one is known as rotative surveys, where elements are included and followed in the sample for a specific period, as they leave the sample, new elements are added.

A common rule of thumb for computing sample size claims that as your design variable is dichotomous (depending on the employment status), the variance of that kind of variables finds its maximum when the success probability is 0.5. So, compute your sample size with those parameters in mind. However, if public policies in a country are focused on getting the unemployment rate low through some government interventions affecting (positively) the labor force, and if those strategies are effective, then the probability success of the design variable changes and it may affect the sample size of household surveys.

In this document, we focus our attention on the sample size induced by controlling for the absolute margin of error. There are some other approaches for computing sample sizes, for example, based on controlling for the coefficient of variation. However, as the proportion is getting lower, the sample size increases substantially. Nevertheless, when controlling for the absolute margin of error, as the variance function behind this approach is symmetrical around 0.5, you can find that the same sample size needed to fulfill quality requirements for any proportion \(P_d\) is the same that the one required to satisfy quality requirements for its additive complement \(1-P_d\).

This document provides various examples typifying some scenarios that can be found in practice. The calculations are done employing the R statistical software [@R]. By using the samplesize4surveys library [@GUTIss4s], specifically the ss4p and ss4dp functions, you can compute proper sample sizes for the above scenarios. The last version of the library can be found in the github repository.

library(devtools)
install_github("psirusteam/samplesize4surveys")
library(samplesize4surveys)

Repeated Surveys

Firstly, let us recall that the confidence interval for a proportion \(P_d\) is function of the estimator \(\hat{P_d}\), the normal percentile \(z_{1-{\alpha/2}}\) subject to a given statistical confidence (\(1-\alpha\)), and the variance of the estimator \(\hat{V}(\hat{P}_d)\). Hence, the confidence interval will be given by:

\[ IC(P_d) = \hat{P}_d \pm z_{1-{\alpha/2}}\sqrt{\hat{V}(\hat{P}_d)} \] Note that the last term on the former expression is usually referred as the absolute margin of error \(\varepsilon = z_{1-{\alpha/2}}\sqrt{\hat{V}(\hat{P}_d)}\). When dealing with a repeated survey, the sample size formulae you have to consider is as follows:

\[n_0 \geq \frac{z^2}{\varepsilon^2}S^2\]

Where \[S^2 = P_d (1- P_d) * Deff\]

Note that \(P_d\) is an estimate of the proportion we are interested in, and \(Deff\) is the design effect induced by the complex sampling design. As we are dealing with a finite population, we have to consider the finite population correction factor which yields to

\[n \geq \frac{n_0}{1 + \frac{n_0}{N}}\]

As we are estimating a proportion, we have to consider which values are suitable for the absolute margin of error \(\varepsilon\). For example:

  • First scenario: if the unemployment rate is low, say \(\hat{P}_d = 0.05\) and the margin of error is around \(\varepsilon = 0.0025\), then the confidence interval would be \(CI = 0.05 \pm 0.0025 = (0.0475, 0.0525)\), and as shown in figure , the required sample size is around 55169.
ss4p(N = 1000000, P = 0.05, DEFF = 2, conf = 0.95, 
     cve = 0.03, me = 0.0025, plot = T)
\label{fig:fig1}*Sample size computation for the first scenario*.

Figure 1: Sample size computation for the first scenario.

## $n.cve
## [1] 40512
## 
## $n.me
## [1] 55169
  • Second scenario: if the unemployment rate is high, say \(\hat{P}_d = 0.2\), and the margin of error is around \(\varepsilon = 0.01\), then the confidence interval would be \(CI = 0.2 \pm 0.01 = (0.19, 0.21)\), and as shown in figure , and the required sample size is 12144.
ss4p(N = 1000000, P = 0.20, DEFF = 2, conf = 0.95, 
     cve = 0.03, me = 0.01, plot = T)
\label{fig:fig2}*Sample size computation for the second scenario*.

Figure 2: Sample size computation for the second scenario.

## $n.cve
## [1] 8811
## 
## $n.me
## [1] 12144

Note that both scenarios yield to the same relative margin of error (RME), defined as

\[RME=\frac{\varepsilon}{\hat{P}_d}.\]

For the first one, we have \(RME=(0.0025/0.05)\%=5\%\), and for the second one, we have \(RME=(0.01/0.2)\%=5\%\). So, even for the same relative margin of error, the sample size must be higher if the phenomenon we are interested in has a low incidence in the finite population.

In fact, you can define an information function to know whether your sample size is enough to fulfill the quality requirements for a given proportion. This is useful because you don’t know exactly what value the proportion will take. Also, if the household survey attempt to estimate other proportions (as in a multipurpose survey) you will find quickly if your current sample size is suitable for the whole study.

The function e4p computes the margin of error and the coefficient of variation of the estimate under a complex design. Also provides some graphics aimed to analyze the behavior of the sample size through the whole range of values proportions may take. For example,

  • Third scenario: if the sample size is defined to be n=10000, and the proportion is around \(P_d = 0.2\), then the coefficient of variation will be 2.8%, and the margin of error will be 1.1%. From figure you may note that all of the estimated proportions will have a margin of error lesser than 1.4%.
e4p(N = 1000000, n = 10000, P = 0.20, DEFF = 2, conf = 0.95, plot = T) 
\label{fig:fig3}*Estimation of sampling errors for the third scenario*.

Figure 3: Estimation of sampling errors for the third scenario.

## $cve
## [1] 2.814249
## 
## $Margin_of_error
## [1] 1.103166
  • Fourth scenario: if the sample size is defined to be n=40000, and the proportion is around \(P_d = 0.05\), then the coefficient of variation will be 3.0%, and the margin of error will be 0.2%. From figure you may note that all of the estimated proportions will have a margin of error lesser than 0.7%.
e4p(N = 1000000, n = 40000, P = 0.05, DEFF = 2, conf = 0.95, plot = T) 
\label{fig:fig4}*Estimation of sampling errors for the fourth scenario*.

Figure 4: Estimation of sampling errors for the fourth scenario.

## $cve
## [1] 3.019934
## 
## $Margin_of_error
## [1] 0.2959481

Also, notice that:

  1. For a given proportion \(\hat{P}_d\), the required sample size to achieve a particular margin of error is the same that for its additive complement \(1-\hat{P}_d\).

  2. As you can expect, if a sample size achieves the requirements for a proportion \(P_d\), it will achieve the quality requirements for any proportion higher than \(P_d\).

  3. For a given proportion \(\hat{P}_d\), the required sample size to achieve a particular coefficient of variation is not the same that for its additive complement \(1-\hat{P}_d\). Then, for a low proportion, you can find that with a given sample size the coefficient of variation will be higher than for its additive complement.

Other Surveys

Now we turn our attention to the net changes in the unemployment rate for two periods, \(\Delta = |P_{1d} - P_{2d}|\). The confidence interval for this parameter \(\Delta\) is function of: the estimator \(\hat{\Delta}\), the normal percentile \(z_{1-{\alpha/2}}\) subject to a given statistical confidence (\(1-\alpha\)), and the variance of the estimator \(\hat{V}(\Delta)\). Hence, the confidence interval will be given by:

\[ IC(\Delta) = \hat{\Delta} \pm z_{1-{\alpha/2}}\sqrt{\hat{V}(\Delta)} \]

The last term on the former expression will be defined as the absolute margin of error \(\varepsilon = z_{1-{\alpha/2}}\sqrt{\hat{V}(\Delta)}\). This kind of parameter may be estimated using a repeated, rotative or panel survey. However, as we may see later, there is a reduction on sample size if you try to estimate net changes from a rotative or panel survey. The proper expression for the sample size is as follows:

\[n_0 \geq \frac{z^2}{\varepsilon^2}S^2\]

Where \[S^2 = P_{1d} (1- P_{1d}) * P_{2d} (1- P_{2d}) * (1- T * R) * Deff\]

Note that \(P_{1d}\) and \(P_{2d}\) are estimates of the proportion we are interested in (at both periods), \(Deff\) is the design effect, \(T\) is the overlap sampling ratio1 between periods , and \(R\) is the correlation2 between periods. As we are dealing with a finite population, we consider the finite population correction factor which yields to a similar expression than the found in the former section.

As we are not estimating a proportion, but a net change, we have to consider which values are suitable for the absolute margin of error \(\varepsilon\). For example:

  • Fifth scenario: if we do not expect significant changes between periods, and the unemployment rates are a high, that is \(\Delta \approx |0.22 - 0.20| = 0.02\) and the margin of error is around \(\varepsilon = 0.001\), then the confidence interval would be \(CI = 0.02 \pm 0.001 = (0.019, 0.021)\), and the required sample size is around 96224.
ss4dp(N = 100000, P1 = 0.22, P2 = 0.20, DEFF = 2, 
      conf = 0.95, cve = 0.03, T = 0, R = 1, 
      me = 0.001, plot = TRUE)
\label{fig:fig2}*Sample size computation for the fifth scenario*.

Figure 5: Sample size computation for the fifth scenario.

## $n.cve
## [1] 94852
## 
## $n.me
## [1] 96224
  • Sixth scenario: if we do not expect significant changes between periods, and the unemployment rates are both low, that is \(\Delta \approx |0.05 - 0.03| = 0.02\) and the margin of error is around \(\varepsilon = 0.001\), then the confidence interval would be \(CI = 0.02 \pm 0.001 = (0.019, 0.021)\), and the required sample size is around 59536.
ss4dp(N = 100000, P1 = 0.05, P2 = 0.03, DEFF = 2, 
      conf = 0.95, cve = 0.03, T = 0, R = 1, 
      me = 0.002, plot = TRUE)
\label{fig:fig2}*Sample size computation for the sixth scenario*.

Figure 6: Sample size computation for the sixth scenario.

## $n.cve
## [1] 80973
## 
## $n.me
## [1] 59536
  • Seventh scenario: if we expect significant changes between periods, and the unemployment rates differ, that is \(\Delta \approx |0.05 - 0.20| = 0.15\) and the margin of error is around \(\varepsilon = 0.0075\), then the confidence interval would be \(CI = 0.15 \pm 0.0075 = (0.1425, 0.1575)\), and the required sample size is around 22083.
ss4dp(N = 100000, P1 = 0.05, P2 = 0.20, DEFF = 2, 
      conf = 0.95, cve = 0.03, T = 0, R = 1, 
      me = 0.0075, plot = TRUE)
\label{fig:fig2}*Sample size computation for the seventh scenario*.

Figure 7: Sample size computation for the seventh scenario.

## $n.cve
## [1] 17009
## 
## $n.me
## [1] 22083

All of the former scenarios yielded to the same RME, defined as

\[RME=\frac{\varepsilon}{\Delta}.\]

For the fifth scenario, we have \(RME=(0.0075/0.15)\%=5\%\), and for the latter ones, we have \(RME=(0.001/0.02)\%=5\%\). So, even for the same net change \(\Delta\), the sample size changes depending on the configuration of proportions. Of course, you may expect changes if the overlap sample portion \(T\) and the correlation \(R\) between periods changes. Also, notice that:

  1. We can find different configurations of proportions at both periods \(\hat{P}_{1d}\) and \(\hat{P}_{2d}\) that induce the same net change \(\Delta\), as shown in scenarios sixth and seventh.

  2. Contrary to what you should expect, if a sample size achieves the quality requirements for a net change \(\Delta = |\hat{P}_{1d} - \hat{P}_{2d}|\), it will not necessarily met the quality requirements for the same nominal value of the net change under a different configuration on the proportions involved, say \(\Delta = |\hat{P}_{3d} - \hat{P}_{4d}|\).

  3. To meet quality requirements, under the same RME, you need more sample size if you do not expect a significant changes on unemployment rates between periods.

  4. If the net change remains the same for both periods, to meet quality requirements, under the same RME, you need more sample size if the unemployment phenomenon is high.

Discussion

Based on the results of former sections, under a fixed RME (5% in all of the cases), we found that when attempting to estimate proportions (such as the raw unemployment rate):

  1. If the proportion is low, we anticipate a large sample size.
  2. If the proportion is high, we expect a small sample size.

Now, when attempting to estimate net changes of proportions (such as the annual or monthly change in unemployment rates), we found that:

  1. If rates are significantly different, we expect a small sample size.
  2. If rates are similar, and the proportions are both low, we require a moderate sample size.
  3. If rates are similar, and the proportions are both large, we await a large sample size.

It is earnest to claim that the RME is not a linear measure. If the proportion is around \(P_d=1\%\), for example, a RME of 5% will yield a confidence interval of \((0.01 \pm 0.0005)\); however, if the proportion is around \(P_d = 90\%\), a RME of 5% will induce a confidence interval of \((0.90 \pm 0.045)\). So, you will need a higher sample size for the first case, because of the confidence interval will be narrower.

When doing the computations of sample size by controlling for the absolute margin of error (\(\varepsilon=1\%\) for all of the scenarios), we found that for the first scenario, the needed sample size is 3637, as long as for the second scenario, the expected sample size is 12144.

If our interest is in sample size for net changes by controlling for the absolute margin of error (\(\varepsilon=1\%\) for all of the scenarios), we found that for the fifth scenario, we expect a sample size of 20304; for the sixth scenario, we expect a sample size of 5559; for the seventh scenario, we expect a sample size of 13751. In summary, if rates are significantly different, we expect a moderate sample size, if rates are similar, and the proportions are both low, we require a small sample size, if rates are similar, and the proportions are both large, we await a large sample size.


  1. As stated by @Kish_2004 [section 6.1], \(T\) refers to the portion of units that are followed each time at different periods. This way, a panel survey will yield \(T=1\), a repeated survey \(T=0\), and a rotative survey \(0<T<1\). For example, if \(T=0.5\), you may infer that both samples have in common 50% of the units.

  2. Note that \(R\) is typically varying from -1 to 1, and refers to the correlation of the same variable of interest between the periods considered. It is very unusual to find cases where \(R\) is non-positive. If so, the sample size must increase.

Related

comments powered by Disqus
UpDog logo Host your own website for free with UpDog.