A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis. The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the data set. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models. Data are synthesised via the function syn() which can be largely automated, if default settings are used, or with methods defined by the user. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthesised data. For a description of the implemented method see Nowok, Raab and Dibben (2016)
catgroupsfor syn(). Variables in
numtocatare converted into categorical variables with breaks determined by their distribution and
catgroupsgives the target number of groups for each variable. The data with the categorical versions are then synthesised and finally the synthesised variables in
numtocatare created from bootstrap samples within the categories. This feature was developed for the second stage of "ipf" and "catall" but can be used with any method. Variables in
numtocatmust have a method suitable for categorical data.
numtocatparameter of syn() if you want to keep the categorical variables in the synthetic data.
statwith possible values "percents" or "counts" allows tables and plots to display counts instead of percentages in groups.
print.tablesis to print if up to a 3-way table;
useNAto include or exclude NA values from tables;
print.statsto allow a choice of what statistics to be printed and the default is to print only the Voas-Williamson statistic with a simpler format.
mincriterionfor syn.ctree() changed to
models = TRUE, models are stored for the variables in the visit sequence and for missing values in the continous variables.
cpparameter in syn.cart() changed to 1e-8.
print.tablesparameter in utility.tab() changed to
cont.na, have their elements named.
smooth.varsparameter added to sdc() function which allows smoothing of numeric variables in the synthesised dataset.
models = TRUEin syn(), for
ployregcoefficients of the fitted model are returned.
population.inferenceis TRUE or FALSE (see vignette on inference). Also it now includes p-values and stars as are shown for lm() and glm().
mselparameter of print.summary.fit.synds() prints a table of estimates rather than a lsiting of each fit in detail.
return.resultparameter replaced by
print.coefwith slightly different functionality - analysis-specific utility measures are always printed but you can choose whether to print or not model estimates.
timesargument corrected (lists of numbers coerced to numbers).
modelsset to TRUE.
uniques.excludefor the sdc() function, which can be used to remove some variables from the identification of uniques.
drop.not.used = TRUE.
methodchanged to "cart".
minnumlevelschanged to -1 (during synthesis numeric variables are not changed to factors regardless of the number of distinct values).
polrmethod increased to 1000 (
maxitparameter). Message if the limit is reached.
maxfaclevelsin not generated if
methodfor the factor is set to "sample" or "nested".
ysepdivin SD2011 dataset changed from
ruleshave been extended and include e.g. initial and closing round bracket.
contrastsattribute for factors synthesised using parametric method.
coef.diff) and confidence interval overlap (
drop.not.usedchanged to FALSE.
denomare specified as named lists, e.g. rules = list(marital = "age < 18") and do not have to be specified for all variables.
funname.argnamearguments, e.g. ctree.minbucket = 5; they are function-specific;
minbucketremoved from arguments.
fit.synds); it replaced two separate functions.
return.plotfor compare() method for class
mselfor compare() method for class
synds, which allows comparison for pooled or selected data set(s). Results for multiple synthetic data sets can be plotted on the same graph.
nrowfor compare() method for class
ncoldetermine number of plots per screen.
plot.nafor compare() method for class
syndsis no longer required and missing data categories for numeric variables are ploted on the same plot as non-missing values.
objectof lm.synds() and glm.synds() functions changed to
fit.syndsgives by default combined coefficient estimates only.
fit.syndsgives combined coefficient estimates and their standard errors.
syndswith multiple synthetic data sets provides by default summaries that are calculated by averaging summary values for all synthetic data copies.
obs.dataof compare.fit.synds() function changed to
eventfor variables with missing data.
maxfaclevelscan be increased.
semicontparameter that allows to define spike(s) for semi-continuous variables in order to synthesise them separately.
cubertnormmethods for synthesis by linear regression after natural logarithm, square root or cube root transformation of a dependent variable.
seedargument for syn() function.
nlevelmaxhave been changed to