A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis. The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the data set. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models. Data are synthesised via the function syn() which can be largely automated, if default settings are used, or with methods defined by the user. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthesised data. For a description of the implemented method see Nowok, Raab and Dibben (2016)
BUG FIXES
NEW FEATURES
numtocat
and catgroups
for syn(). Variables in numtocat
are converted into categorical variables with breaks determined by their
distribution and catgroups
gives the target number of groups for each
variable. The data with the categorical versions are then synthesised and
finally the synthesised variables in numtocat
are created from bootstrap
samples within the categories. This feature was developed for the second
stage of "ipf" and "catall" but can be used with any method. Variables
in numtocat
must have a method suitable for categorical data.numtocat
parameter
of syn() if you want to keep the categorical variables in the synthetic data.stat
with possible values "percents" or
"counts" allows tables and plots to display counts instead of percentages in
groups.CHANGES
print.tables
is to print if up to a 3-way table;useNA
to include or exclude NA values from tables;print.stats
to allow a choice of what statistics to be
printed and the default is to print only the Voas-Williamson statistic with
a simpler format.mincriterion
for syn.ctree() changed to 0.9
.models = TRUE
, models are stored for the variables
in the visit sequence and for missing values in the continous variables.BUG FIXES
BUG FIXES
var
.CHANGES
BUG FIXES
NEW FEATURES
CHANGES
cp
parameter in syn.cart() changed to 1e-8.print.tables
parameter in utility.tab() changed to TRUE
.cont.na
, have their
elements named.BUG FIXES
NEW FEATURES
CHANGES
smooth.vars
parameter added to sdc() function which allows smoothing of
numeric variables in the synthesised dataset.models = TRUE
in syn(), for logreg
and ployreg
coefficients of the
fitted model are returned.population.inference
is TRUE or FALSE (see vignette on inference).
Also it now includes p-values and stars as are shown for lm() and glm().msel
parameter of print.summary.fit.synds() prints a table of estimates
rather than a lsiting of each fit in detail.BUG FIXES
seed
for syn.strata().CHANGES
return.result
parameter replaced by print.coef
with slightly different
functionality - analysis-specific utility measures are always printed but you
can choose whether to print or not model estimates.BUG FIXES
CHANGES
BUG FIXES
times
argument corrected (lists of numbers coerced to numbers).NEW FEATURES
models
set to TRUE.uniques.exclude
for the sdc() function, which can be used to
remove some variables from the identification of uniques.CHANGES
drop.not.used = TRUE
.method
changed to "cart".minnumlevels
changed to -1 (during synthesis numeric variables are
not changed to factors regardless of the number of distinct values).polyreg
and
polr
method increased to 1000 (maxit
parameter). Message if the limit is
reached.maxfaclevels
in not generated if method
for the factor
is set to "sample" or "nested".ymarr
and ysepdiv
in SD2011 dataset changed
from yy
to yyyy
.BUG FIXES
rules
have been
extended and include e.g. initial and closing round bracket.filetype
in write.syn()BUG FIXES
contrasts
attribute for factors synthesised using parametric method.NEW FEATURES
coef.diff
) and confidence interval overlap (ci.overlap
).CHANGES
coefplot
package.drop.not.used
changed to FALSE.CHANGES
visit.sequence
.rules
, rvalues
, cont.na
, semicont
, smoothing
, event
,
denom
are specified as named lists, e.g. rules = list(marital = "age < 18")
and do not have to be specified for all variables.funname.argname
arguments, e.g. ctree.minbucket = 5; they are
function-specific; minbucket
removed from arguments.synds
and
fit.synds
); it replaced two separate functions.return.plot
for compare() method for class fit.synds
.msel
for compare() method for class synds
, which
allows comparison for pooled or selected data set(s). Results for multiple
synthetic data sets can be plotted on the same graph.nrow
for compare() method for class synds
; nrow
and ncol
determine number of plots per screen.plot.na
for compare() method for class synds
is no longer
required and missing data categories for numeric variables are ploted
on the same plot as non-missing values.object
of lm.synds() and glm.synds() functions changed to data
.fit.synds
gives by default combined coefficient
estimates only.fit.synds
gives combined coefficient
estimates and their standard errors.synds
with multiple synthetic data sets
provides by default summaries that are calculated by averaging summary
values for all synthetic data copies.obs.data
of compare.fit.synds() function changed to data
.surv.ctree
and cart.bboot
changed to survctree
and cartbboot
.BUG FIXES
denom
and event
for variables with missing data.maxfaclevels
can be increased.NEW FEATURES
semicont
parameter that allows to define spike(s)
for semi-continuous variables in order to synthesise them separately.lognorm
, sqrtnorm
and cubertnorm
methods for synthesis by linear
regression after natural logarithm, square root or cube root transformation
of a dependent variable.seed
argument for syn() function.CHANGES
populationInference
, visitSequence
, predictorMatrix
,
contNA
, defaultMethod
, printFlag
and nlevelmax
have been changed to
population.inference
, visit.sequence
, predictor.matrix
, cont.na
,
default.method
, print.flag
and minnumlevels
respectively.BUG FIXES
family
in compare.fit.synds().