A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis. The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the data set. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models. Data are synthesised via the function syn() which can be largely automated, if default settings are used, or with methods defined by the user. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthesised data. For a description of the implemented method see Nowok, Raab and Dibben (2016)

NEW FEATURES

- New methods "catall" and "ipf" that synthesise a group of variables together at the start of the synthesis.
- New parameters
`numtocat`

and`catgroups`

for syn(). Variables in`numtocat`

are converted into categorical variables with breaks determined by their distribution and`catgroups`

gives the target number of groups for each variable. The data with the categorical versions are then synthesised and finally the synthesised variables in`numtocat`

are created from bootstrap samples within the categories. This feature was developed for the second stage of "ipf" and "catall" but can be used with any method. Variables in`numtocat`

must have a method suitable for categorical data. - New function numtocat.syn() can be used to group numeric variables into
categories. It can be used before synthesis to create a new data frame that
can then be synthesised. It should be used instead of the
`numtocat`

parameter of syn() if you want to keep the categorical variables in the synthetic data. - The function nested.syn() can now be used with continuous variables nested within categories. It also allows smoothing of the non-missing data.
- New function codebook.syn() that can be used to check features of data before synthesis.
- New parameter to compare.synds()
`stat`

with possible values "percents" or "counts" allows tables and plots to display counts instead of percentages in groups.

CHANGES

- Some improvements to utility.tab():
- default for
`print.tables`

is to print if up to a 3-way table; - a new parameter
`useNA`

to include or exclude NA values from tables; - a new parameter
`print.stats`

to allow a choice of what statistics to be printed and the default is to print only the Voas-Williamson statistic with a simpler format.

- default for
- Format of printed output from utility.gen() has been changed to emphasise the ratio to the expected as the most important measure.
`mincriterion`

for syn.ctree() changed to`0.9`

.- If the syn() parameter
`models = TRUE`

, models are stored for the variables in the visit sequence and for missing values in the continous variables.

BUG FIXES

- utility.tab() and utility.gen() for synthetic data generated from syn.strata().
- Smoothing for "sample" method.

BUG FIXES

- Compatibility with the new version of ggplot2 (2.3.0).
- Check in multi.compare() for missing argument
`var`

. - Check in sdc() for data type of variables to be smoothed.
- Size of $syn when k < n and all synthesising methods without predictors.

CHANGES

- In padModel.syn() the default for newly defined synthesising methods is NOT to create dummy variables from factors before fitting models.

BUG FIXES

- Extraction of variable name from a tranformed dependant variable in glm.synds(), lm.synds() and multinom.synds().
- In syn.polyreg() numerical variables used as predictors for the multinom() function are scaled so as to have a range (0,1) to improve convergence. An extra warning is included if all the preicted values are in one category, which can result when the model fails to iterate due to sparse data.

NEW FEATURES

- New version of utility.gen() and print.utility.gen(). These now implement
the methods described in Snoke et. al. and include the following:
- appropriate method of getting the null distribution is selected based on synthesis details;
- warning messages if CART models fail to split;
- allow a seed value to be set and stored for resampling methods;
- printing Z scores from logit models using the average fit over different syntheses.

- New function multinom.synds() to fit multinomial models to synthetic data using the multinom() function from the package nnet.

CHANGES

- Default
`cp`

parameter in syn.cart() changed to 1e-8. - Default
`print.tables`

parameter in utility.tab() changed to`TRUE`

. - All syn() result components that are lists, e.g.
`cont.na`

, have their elements named.

BUG FIXES

- Call to classIntervals() in utility.tab() uses style = "fisher" as default to avoid problems for variables with a small number of unique values.
- utility.tab() computes breaks for grouping from the combined data rather than just from the observed. This avoids problems when synthetic values are outside the range of the observed ones.
- In compare.fit.synds() the quantity real.varcov needs to be multiplied by sigma^2 when fitting.function is "lm".

NEW FEATURES

- Vignette on inference from fitted models (we are grateful to Joerg Drechsler for his comments).
- New function syn.satcat allows saturated categorical models to be fitted from all possible interactions of the predictor variables.

CHANGES

- utility.tab() returns p-values for utility measures rather than standardised versions of the measures.
`smooth.vars`

parameter added to sdc() function which allows smoothing of numeric variables in the synthesised dataset.- If
`models = TRUE`

in syn(), for`logreg`

and`ployreg`

coefficients of the fitted model are returned. - Output from print.summary.fit.synds() is labelled differently according to
whether
`population.inference`

is TRUE or FALSE (see vignette on inference). Also it now includes p-values and stars as are shown for lm() and glm(). - Warning messages are given in summary.fit.synds() when inference is attempted from an invalid model where synthsis has not conditioned on any variables that have been left unsynthesised.
- The
`msel`

parameter of print.summary.fit.synds() prints a table of estimates rather than a lsiting of each fit in detail. - Several changes in compare.fit.synds() are described in detail in the new vignette on inference ('Inference from fitted models in synthpop'). These include extensions of the lack-of-fit tests and confidence interval overlap measures for different options.

BUG FIXES

- replicated.uniques() for a single variable.
- syn.cart() for logical variables without missings (thanks to bug report by Ruben Arslan).
- syn() can be used without loading the package (thanks to reported issue by Ruben Arslan)
- Missing data factor level as a stratum in syn.strata()
`seed`

for syn.strata().- print.fit.synds() for syn.strata() object.

CHANGES

- For consistency reasons tab.utility() changed to utility.tab() and utility.synds() to utility.gen().
- utility.gen() under major revision and temporarily unavailable.
- Revision of utility.tab() function: a measure of fit proposed by Voas and Williams and one proposed by Freeman and Tukey are calculated; continous variables are categorised using classIntervals() function.
- Revision of compare.fit.synds() function: new lack-of-fit measures and mean
values for existing analysis-specific utility measures (confidence interval
overlap and absolute standardized difference between coefficient);
`return.result`

parameter replaced by`print.coef`

with slightly different functionality - analysis-specific utility measures are always printed but you can choose whether to print or not model estimates.

BUG FIXES

- compare.synds() for integer variables without missing values (thanks to bug report by Joerg Drechsler).

CHANGES

- Update of the vignette.

BUG FIXES

- compare.synds() for variables with NA values in observed but not in synthetic data returns correct value (0) for NA category in synthetic data.
- Invalid
`times`

argument corrected (lists of numbers coerced to numbers).

NEW FEATURES

- Storing results of CART models when
`models`

set to TRUE. - Function syn.strata() for stratified synthesis.
- Function multi.compare() for multivariate comparison of synthesised and observed data.
- Synthesising method "nested" for a variable nested within another variable.
- Tabular utility function tab.utility() for comparing contingency tables from observed and synthesized data.
- Parameter
`uniques.exclude`

for the sdc() function, which can be used to remove some variables from the identification of uniques. - Function replicated.uniques() returns a number of unique individuals in the original data set ($no.uniques).

CHANGES

- Synthetic values of collinear variables are derived based on the one that is synthesised first and their method is set to "collinear". They do not have to be removed prior to synthesis.
- Synthesising method for constant variables is set to "constant" and the
variables are not removed from the synthesised data set when
`drop.not.used = TRUE`

. - Default synthesising
`method`

changed to "cart". - Default
`minnumlevels`

changed to -1 (during synthesis numeric variables are not changed to factors regardless of the number of distinct values). - Coefficient estimates and their confidence intervals are ploted in the same order as they are presented in a tabular form.
- No message on the seed value used (it is stored in the result object).
- Formula of the model to be fitted using glm.synds() or lm.synds() can be specified outside the function.
- Massage for sdc() on number of replicated uniques also when it is equal to zero.
- Maximum number of iterations for a multinomial model used in
`polyreg`

and`polr`

method increased to 1000 (`maxit`

parameter). Message if the limit is reached. - write.syn() saves complete synds object into a file synobject_filename.RData.
- Error on exceeding
`maxfaclevels`

in not generated if`method`

for the factor is set to "sample" or "nested". - For constant variables method is changed to "constant".
- Year format for variables
`ymarr`

and`ysepdiv`

in SD2011 dataset changed from`yy`

to`yyyy`

.

BUG FIXES

- Types and placement of special signs that are allowed in
`rules`

have been extended and include e.g. initial and closing round bracket. - compare.synds() provides output for logical variables.
- Synthesis of logical variables with missing values.
- Message about a change of method for a variable without predictors.
- Check for
`filetype`

in write.syn()

BUG FIXES

- No calling var(x) on a factor x (in checks).
- No
`contrasts`

attribute for factors synthesised using parametric method. - Misspelled vector name (nlevels) replaced with a correct one (nlevel).

NEW FEATURES

- A new function utility.synds() for distributional comparison of synthesised data with the original (observed) data using propensity scores.
- New measures for comparing model estimates based on synthesised and observed
data implemented in compare.fit.synds() function: standardized differences
in coefficient values(
`coef.diff`

) and confidence interval overlap (`ci.overlap`

).

CHANGES

- No dependency on
`coefplot`

package. - Default for
`drop.not.used`

changed to FALSE.

CHANGES

- Both variable names and their column indices can be used in
`visit.sequence`

. - Arguments
`rules`

,`rvalues`

,`cont.na`

,`semicont`

,`smoothing`

,`event`

,`denom`

are specified as named lists, e.g. rules = list(marital = "age < 18") and do not have to be specified for all variables. - Optional arguments can be passed to synthesising functions by specifying
`funname.argname`

arguments, e.g. ctree.minbucket = 5; they are function-specific;`minbucket`

removed from arguments. - Smoothing is possible for numeric variables when synthesised with the method "sample".
- compare() is a generic function with two methods (for class
`synds`

and`fit.synds`

); it replaced two separate functions. - New argument
`return.plot`

for compare() method for class`fit.synds`

. - New argument
`msel`

for compare() method for class`synds`

, which allows comparison for pooled or selected data set(s). Results for multiple synthetic data sets can be plotted on the same graph. - New argument
`nrow`

for compare() method for class`synds`

;`nrow`

and`ncol`

determine number of plots per screen. - Argument
`plot.na`

for compare() method for class`synds`

is no longer required and missing data categories for numeric variables are ploted on the same plot as non-missing values. - Argument
`object`

of lm.synds() and glm.synds() functions changed to`data`

. - print() method for class
`fit.synds`

gives by default combined coefficient estimates only. - summary() method for class
`fit.synds`

gives combined coefficient estimates and their standard errors. - summary() method for class
`synds`

with multiple synthetic data sets provides by default summaries that are calculated by averaging summary values for all synthetic data copies. - Argument
`obs.data`

of compare.fit.synds() function changed to`data`

. - Method
`surv.ctree`

and`cart.bboot`

changed to`survctree`

and`cartbboot`

.

BUG FIXES

`denom`

and`event`

for variables with missing data.`maxfaclevels`

can be increased.- Continuous variables with missing data when zero is a non-missing value.
- Synthesis of a single variable (with or without auxiliary predictors) now works.

NEW FEATURES

- Function sdc() for statistical disclosure control of the synthesised data set(s); function replicated.uniques() to determine which unique units in the synthesised data set(s) replicates unique units in the original data set.
- Function read.obs() to import original data sets form external files.
- Function write.syn() to export synthetic data sets to external files and create a text file with information about the synthesis.
- syn() has new
`semicont`

parameter that allows to define spike(s) for semi-continuous variables in order to synthesise them separately. `lognorm`

,`sqrtnorm`

and`cubertnorm`

methods for synthesis by linear regression after natural logarithm, square root or cube root transformation of a dependent variable.`seed`

argument for syn() function.

CHANGES

- Revised output of summary.fit.synds() and compare.fit.synds(); standard errors of Z scores corrected (se(Z.syn)) (thanks to Joerg Drechsler).
- Figures for compare.fit.synds() and compare.synds() functions plotted using ggplot2 functions.
- period.separated or alllowercase naming convention has been adopted and
parameter names
`populationInference`

,`visitSequence`

,`predictorMatrix`

,`contNA`

,`defaultMethod`

,`printFlag`

and`nlevelmax`

have been changed to`population.inference`

,`visit.sequence`

,`predictor.matrix`

,`cont.na`

,`default.method`

,`print.flag`

and`minnumlevels`

respectively. - Default for drop.pred.only changed to FALSE.

BUG FIXES

- Rounding procedure (thanks to bug report by Joerg Drechsler).
- Warning about extra disregarded argument
`family`

in compare.fit.synds().