Optimal Stratification of Sampling Frames for Multipurpose Sampling Surveys

In the field of stratified sampling design, this package offers an approach for the determination of the best stratification of a sampling frame, the one that ensures the minimum sample cost under the condition to satisfy precision constraints in a multivariate and multidomain case. This approach is based on the use of the genetic algorithm: each solution (i.e. a particular partition in strata of the sampling frame) is considered as an individual in a population; the fitness of all individuals is evaluated applying the Bethel-Chromy algorithm to calculate the sampling size satisfying precision constraints on the target estimates. Functions in the package allows to: (a) analyse the obtained results of the optimisation step; (b) assign the new strata labels to the sampling frame; (c) select a sample from the new frame accordingly to the best allocation. Functions for the execution of the genetic algorithm are a modified version of the functions in the 'genalg' package.


Changes in Version 1.3

o The optimization of population frame is run in parallel if different domains are considered. To this end, the parameter 'parallel' can be set to TRUE in the 'optimizeStrata' function. If not specified, n-1 of total available cores are used OR if number of domains < (n-1) cores, then number of cores equal to number of domains are used.

o A new function 'KmeansSolution' produces an initial solution using the kmeans algorithm by clustering atomic strata considering the values of the means of target variables in them. Also, if the parameter 'nstrata' is not indicated, the optimal number of clusters is determined inside each domain, and the overall solution is obtained by concatenating optimal clusters obtained in domains. By indicating this solution as a suggestion to the optimization step, this may greatly speed the convergence to the optimal solution.

o A new function 'selectSampleSystematic' has been added. It allows to select a stratified sample with the systematic method, that is a selection that begins selecting the first unit by an initial randomly chosen starting point, and proceeding in selecting other units by adding an interval that is the inverse of the sampling rate in the stratum. This selection method can be useful if associated to a particular ordering of the selection frame, where the ordering variable(s) can be considered as additional stratum variable(s).

o It is now possible to handle "anticipated variance" by introducing a model linking a proxy variable whose values are available for all units in the sampling frame, with the target variable whose values are not available. In this implementation only linear model can be chosen. When calling the 'buildStrataDF' function, a dataframe is given, containing three parameters (beta, sigma2 and gamma) for each couple target / proxy. On the basis of these parameters, means and standard deviations in sampling strata are calculated accordingly to given formulas.

Changes in Version 1.2

o The crossover function in the genetic algorithm has been modified by considering the "grouping" version of this algorithm: instead of mixing chromosomes in an indifferentiate way, groups of them in one parent (representing already aggregated strata) are attributed to the other parent when generating a child, preserving their composition. Moreover, parents are selected with a probability proportional to their fitness (in previous version selection was completely at random). In many cases this can greatly speed the convergence to an optimal solution.

o A new function 'adjustSize' has been added. It allows to adjust the sample size and related allocation in strata on the basis of an externally indicated overall sample size. The adjustment of the sample size is perfomed by increasing or decreasing it proportionally in each optimized stratum.

o A new function 'buildFrameDF' has been added. It allows to create a 'sampling frame' dataframe by indicating the dataset in which the information on all the units are contained, the identifier, the X variables, the Y variable and the variable that indicates the domains of interest.

o In function 'optimizeStrata' now the 'initialStrata' parameter is a vector, whose length is equal to the number of strata in the different domains.

o All outputs are written to an .\output subdirectory.

Changes in Version 1.1

o Function 'memoise' from the same package (now required) is applied before each evaluation in order to save processing time. This may largely increase the efficiency of the algorithm

o A 'recode' function is applied on every generated solution in order to recode a genotype of n genes with k<=n distinct alleles 1, 2, ..., k in such a way that the distinct alleles of the recoded genotype appear in the natural order 1, 2, ..., k. This avoids to consider as distinct two solutions that are equivalent but make use of a different coding

Changes in Version 1.0-4

o Bug fix for old releases

Changes in Version 1.0-3

o Modified the output to the console or to the file of results: instead of all the solutions, only the optimal value for each generation is visualised

o Now the visualisation of the trend in the optimal and mean values is optional: the plot can be avoided by setting showPlot = FALSE when calling optimizeStrata. It can be advisable when the number of iterations is very high

Changes in Version 1.0-2

o Bug fix for old releases.

Changes in Version 1.0-1

o Bug fix for old releases.

o The object returned by function "optimizeStrata" is no more a dataframe but a list: * the first element of the list is the solution vector (solution$indices) * the second element of the list is the dataframe containing aggregated strata (solution$aggr_strata)

o In all the functions that previously produced .csv files and .pdf plots in the working directory, as a default this is no more the current behaviour. To write these files, it is now necessary to set the "writeFiles" flag to TRUE

Changes in Version 1.0

o Bug fix for old releases.

o Two new functions: * "evalSolution", to evaluate the found solution in terms of expected target variables precision and bias obtainable by samples drawn from the otpimized frame; * "tuneParameters", to determine the best combination of values to assign to the parameters necessary for the execution of the genetic algorithm used for the optimization of the frame stratification.

o A demonstration on the use of the "tuneParameters" function is in the vignette "tuneParameters.pdf" contained in the \inst\doc folder.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


1.4 by Giulio Barcaroli, 2 months ago


Report a bug at https://github.com/barcaroli/SamplingStrata/issues

Browse source code at https://github.com/cran/SamplingStrata

Authors: Giulio Barcaroli , Marco Ballin , Hanjo Odendaal , Daniela Pagliuca , Egon Willighagen , Diego Zardetto

Documentation:   PDF Manual  

Task views: Official Statistics & Survey Methodology

GPL (>= 2) license

Depends on memoise, doParallel, pbapply, formattable

Suggests knitr, rmarkdown

See at CRAN