The theory
Population
Sample
Sample selection
Estimation
Sample size

Population
Let U denote the population we want to investigate. Let N be the size of this population. We assign a sequence number to each element in the population. Hence, we can represent the population by the set

     

The target variable represents the phenomenon we want to investigate in the survey. We denote the target variable by Y. It as a value for each element in the population. These values are denoted by

     

For example, if the target variable is the income of a person, then Y1 is the income of person 1, Y2 is the income of person 2, etc.

Objective of the survey is to learn more about certain characteristics of the population. These characteristics usually take the form of population parameters.

One important population parameter is the population mean. The population mean of target variable Y is defined as

     

For example, if Y denotes the income of a person, the population mean is equal to the mean income in the population.

Objective of a survey can also be estimation of the population percentage of elements with a specific property. An example is the percentage of people having Internet access. To do this, a target variable is defined that can only assume 2 values:

  • Y = 1, if the element has the property;
  • Y = 0, if the element does not have the property.
The population percentage can now be written as

     

Another population parameter to be mentioned is the population variance. This quantity measures the amount of variation of the values of the target variable. The population variance is equal to

     

The population variance is one of the factors determining the accuracy of estimates. The population variance can also be seen as a measure of the homogeneity of the population. For example this variance is equal to 0 of everyone in the population has the same income. Large income differences will result in a large population variance.


Sample
Let n denote the sample size. A sample of size n that has been selected (without replacement) from the population U of size N, is denoted by the series of indicators

     

The indicator ak assumes that value 1 of element k has been selected, and otherwise it assumes the value 0.Since probability sampling has been applied to select the sample, the indicators are random variables. By definition, the sample size n is equal to the sum of indicators:

     

The values of the target variable can be measured only for the sample elements. These values are used to estimate population parameters. The available values are denoted by

     

Note the use of the convention to write population quantities in upper case (e.g. the population size N) and sample quantities in lower case (e.g. the sample size n). So y1 is the value if the first element in the sample.

It is assumed here that the values of the target variables can be measured for all elements in the sample.


Sample selection
It is only possible to draw reliable conclusions about the population as a whole if the sample has been selected by a random selection procedure. This guarantees that no groups in the population are systematically over- or under-represented in the sample.

A randomizer is a device to draw a probability sample in a fair and objective way. A randomizer must have the following properties:

  • The device can be used repeatedly;
  • Every time the device is activated, it produces a number in the range from 1 up to and including N, where the value of N is known;
  • Every time again, each of the outcomes is equally probable. Knowledge of previous outcomes does not help to improve prediction of subsequent outcomes. In other words, every prediction system fails;
Random numbers can be obtained in practice in, for example, the following ways:
  • Consulting a table with random numbers (for small samples);
  • Using a calculator with a random number function (for small samples);
  • Using a computer (for large samples);
The table below contains random numbers. The are grouped in in sequences of five digits.

0082263134040802937368731 3428241827948801150507677
7977119758620628125911215 4216770001783647438810001
5861441056098692774612931 9301856160395349334087194
7128749101033304546852358 6265833674268791722749102
1207376580286011441057528 0403628540910018912794058

Example: a sample from a membership list

Suppose a sample of 10 members has to be drawn from the membership list of a club. The club has 682 members. Therefore, you need 10 random numbers in the range from 1 up to and including 682. Choose a arbitrary starting point in the table above and follow some route through the numbers. Take groups of three subsequent digits and use them as a three-digit numbers. If this number is larger than 682, ignore it and try a next number. If the number does not exceed 682, this is the sequence number of the next element in the sample.

If you process the table row by row, and take the first three digits of each group of five digits, the following sequence of numbers is obtained:

     008, 631, 040, 293, 687, 342, 418, 948, 115, 076, 797, 197, 620.

The numbers 687, 948 and 797 are larger than 682. Therefore they are not used. Therefore, the selected sequence numbers are:

     8, 631, 40, 293, 342, 418,115, 76, 197 and 620.

Many programming languages and calculators have the possibility to generate random numbers. Often they have a routine to generate random values in the interval [0, 1). They value 0 may occur, but not the value 1. This routine can be applied to generate random integers in the range from 1 to N:

  1. Draw a random value from [0, 1).
  2. Multiply the outcome by the population size N.
  3. Round the result down (truncate) to the nearest integer.
  4. Add 1 to this integer.
Suppose, the computer routine produces the following values:

     0,12073 0,76580 0,28601 0,14410.

Application of the algorithm above for N = 682 produces the following sequence numbers:

     83, 523, 196, 99

A table with random numbers can also be used for the algorithm above. Take series of digits, and see these as fractional parts of decimal number between 0 and 1. Suppose you use the first 4 groups of five digits. This produces the values

     0,00822 0,63134 0,04080 0,29373

Application of te algorithm (with N = 682) produces the numbers

     6, 431, 28, 201.

Sampling without replacement is preferred. This implies an element can be selected in the sample at most once. The random selection procedures described above do not prevent elements to be selected several times ina sample. If this happens, the number is ignored and a next attempt is made.


Estimation
You can measure the values of the target variable for the sample elements. These are the values coming available for estimation purposes. These values are denoted by

     

You use these sample values to estimate population parameter. The recipe to compute such an estimate is called an estimator. Preferably, estimators have some specific properties:

  • The estimator must be unbiased. If sample selection and computation of the estimator is repeated a large number of times, the estimates should be on average close to the true value of the population parameter. The condition of unbiasedness guarantees that an estimator never systematically under- or over-estimates the population parameter.
  • The estimator must be precise. This means the variation of the possible outcomes must be small.

An unbiased and precise estimator will always produce estimates that are close to the value to be estimated.

Estimation of a population mean

If the sample is selected with equal probabilities and without replacement, the sample mean

     

is an unbiased estimator of the population mean. The precision of the estimator is indicated by means of the variance of the estimator. In case of simple random sampling without replacement, the variance of the sample mean is equal to

     

where

     

is the population variance. The estimator is precise of its variance is small. The magnitude of the variance is determined by two factors:

  • The population variance. The variance will small of the population is homogeneous. Then the estimator will be precise.
  • The sample size. A larger sample size will lead to a more precise estimator.
To be able to say something about the precision of the estimator, you need the value of the population variance. Unfortunately, this value is usually not available. This problem can be solved by estimating the population variance using the sample data. The sample variance

     

is an unbiased estimator of the population variance. Therefore

     

is an unbiased estimator of variance of the estimator.

Estimation of a population percentage

To estimate a population percentage, a target variable is introduced only assuming two values 1 (if an element has a specific property) and 0 (if it does not have that property). The population mean of this variable is equal to the fraction of elements having the property. Multiplication by 100 results in the percentage of elements having the property. If this percentage is denoted by P, then

     

First, you estimate the population mean. The sample mean is unbiased estimator for this quantity. In this case, the sample mean is equal to the fraction of elements in the sample having the property. Multiplication by 100 gives to population percentage. This quantity is denoted by

     

Since the sample mean is an unbiased estimator of the population mean, the sample percentage is an unbiased estimator of the population percentage.

The variance of the sample percentage is equal to

     

You can estimate this variance, using the sample data, with the formula

     

Confidence interval

Interpretation if the variance of an estimator is not simple. What is a large value and what is a small value? A better indicator of the precision is the confidence interval. The first step to determine a confidence interval is computation of the standard error of an estimator. The standard error of an estimator is defined as

     

The standard error is estimated in practice by replacing the population variance in the expression by the estimated population variance:

     

The confidence interval is characterized by a lower bound and an upper bound. These bounds are computed using the available sample data. The bounds are computed such that the probability that the interval contains the value of the population parameter is at least equal to a pre-defined (large) probability 1 - α The quantity 1 - α is called the confidence level.

Often, α is taken equal to 0.05. Consequently, the confidence level is equal to 0.95. This can be interpreted as follows: If sample selection and the subsequent computation of the estimator is repeated a large number of times, the confidence interval will contain on average in 95 out of 100 cases the population parameter.

So the statement that the confidence interval contains the population value, will be wrong in 5% of the cases. To say it differently: you run a risk of 5% to draw a wrong conclusion.

You are free in choosing the value of the confidence level. If you want to draw a more reliable conclusion from the survey data, you take a smaller value of α For example, you could set α to 0.01. You pay a price for a higher reliability: the resulting confidence interval will be wider. So, your conclusion is less precise. This is a dilemma you always have: either a reliable conclusion with a low precision, or a less reliable conclusion with a higher precision.

It is simple to determine the bounds of the confidence interval. The midpoint of the interval is the estimate itself (e.g. the sample mean or the sample percentage). The upper bound is obtained by adding the margin of error M. The lower bound is obtained by subtracting the margin of error. The margin of error M is equal to the standard error of the estimate, multiplied by some constant. The constant is equal to 1.96 for a confidence level of 0.95.

For estimating the population mean, the 95% confidence is equal to

     

For estimating the population percentage, the 95% confidence is equal to

     

The standard error is unknown in practice. Therefore this quantity is replaced by its sample based estimator.


Sample size
An important question in the survey design phase is always how large the sample size must be. the sample size to be determined is denoted by n.

There is no simple answer to this question. There is a relation between the sample size and the precision of the estimator: a larger sample size will result in a more precise estimators. Once you have determined how an estimator must be, you can compute the corresponding sample size.

Sample size for estimating the population mean

Suppose, you impose the condition that the margin of error of the confidence may not exceed a specified value M. You can translate this condition in the inequality

     

For large values of the population size N, the inequality can be simplified to

     

Problem is that both expressions contain the (square root of) unknown value of the population variance. Sometimes it is possible to use an estimate from a previous survey. Or maybe there is some indication of its value from a small test survey. If no indication at all of the value of S is available, the following rules of thumb may be helpful:

  • The values of the target variable have a more or less normal distribution over an interval of length L. Hence, L will be approximately equal to 6S. You can substitute the value 0.17xL for S.
  • The values of the target variable have a more or less homogeneous distribution over an interval of length L. Consequently, S will be approximately equal to 0.3xL.
  • The values of the target variable have a more or less exponential distribution over an interval of length L. There are many small values and only a few large values. Consequently, S will be approximately equal to 0.4xL.
  • S assumes its larget value of half of the values are concentrated at lower end of the interval of length L, and the other half of the values is concentrated at the upper end. In this case, S will be equal to 0.5xL.

Sample size for estimating the population percentage

Suppose M is margin of error of the confidence interval for the population percentage that may not be exceeded. For example, a value of M = 2 means that the difference between the sample percentage and the population percentage may not exceed 2.

Given the margin of error M, you can compute the minimal sample size with the expression

     

P is unknown. This is the value to be estimated in the survey. Sometimes some indication is available from a previous version of the survey, or some other survey. This indication can be substituted. Is no information at all available about P, the worst case value P = 50 can be substituted. This produces a sample size producing an estimator that is precise enough for any value of P.

Example 1:

Population of size N = 40,000, P = 50, M = 3 (difference not larger than 3%):

     

Example 2:

Population of size N = 400, P = 50, M = 5 (difference not larger than 5%):

     

Approximation 1:

If the population size N is very large, say N > 10,000, the expression of the sample size can be reduced to

     

Approximation 2:

If the population size N is very large, say N > 10,000, and P completely unknown, the expression of the sample size can be reduced to