Automatic Codebooks from Metadata Encoded in Dataset Attributes

Easily automate the following tasks to describe data frames: Summarise the distributions, and labelled missings of variables graphically and using descriptive statistics. For surveys, compute and summarise reliabilities (internal consistencies, retest, multilevel) for psychological scales. Combine this information with metadata (such as item labels and labelled values) that is derived from R attributes. To do so, the package relies on 'rmarkdown' partials, so you can generate HTML, PDF, and Word documents. Codebooks are also available as tables (CSV, Excel, etc.) and in JSON-LD, so that search engines can find your data and index the metadata. The metadata are also available at your fingertips via RStudio Addins.

Travis-CI BuildStatus CRANstatus Downloads codecov DOI

Automatic Codebooks from Survey Metadata Encoded in Attributes


Easily automate the following tasks to describe data frames: computing reliabilities (internal consistencies, retest, multilevel) for psychological scales, summarise the distributions of scales and items graphically and using descriptive statistics, combine this information with metadata (such as item labels and labelled values) that is derived from R attributes. To do so, the package relies on ‘rmarkdown’ partials, so you can generate HTML, PDF, and Word documents. Codebooks are also available as tables (CSV, Excel, etc.).

Generate markdown codebooks from the attributes of the variables in your data frame

RStudio and a few of the tidyverse package already usefully display the information contained in the attributes of the variables in your data frame. The haven package also manages to grab variable documentation from SPSS or Stata files.

RStudio Addin

If the RStudio data viewer scrolls slow for your taste, or you’d like to keep the variable labels in view while working, use our RStudio Addins (ideally assigned to a keyboard shortcut) to see and search variable and value labels in the viewer pane.

Codebook generation

The codebook package takes those attributes and the data and tries to produce a good-looking codebook, i.e. a place to get an overview of the variables in a dataset. The codebook processes single items, but also “scales”, i.e. psychological questionnaires that are aggregated to extract a construct. For scales, the appropriate reliability coefficients (internal consistencies for single measurements, retest reliabilities for repeated measurements, multilevel reliability for multilevel data) are computed. For items and scales, the distributions are summarised graphically and numerically.

This package integrates tightly with formr (, an online survey framework and especially the data frames produced and marked up by the formr R package. However, codebook is completely independent of it.


Confer the help or: See the vignette for a quick example of an HTML document generated using codebook, or below for a copy-pastable rmarkdown document to get you started.

Use as a webapp

If you don’t want to install the codebook package, you can just upload an annotated dataset in a variety of formats (R, SPSS, Stata, …) here:

Use locally


Run the following in R.


Or to get the latest development version:


Then run the following to get started:



To cite the package, you can cite the preprint, but to make your codebook traceable to the version of the package you used, you might also want to cite the archived package DOI.


from study metadata. doi:10.31234/


Arslan, R. C. (2018). Automatic codebooks from survey metadata (2018). URL DOI

How to use

Here’s a simple rmarkdown template, that you could use to get started. The resulting codebook will be an HTML file, but you can also choose to generate PDFs or Word files by fiddling with the output settings.

title: "Codebook"
    toc: true
    toc_depth: 4
    toc_float: true
    code_folding: 'hide'
    self_contained: true
    toc: yes
    toc_depth: 4
    latex_engine: xelatex
```{r setup}
  warning = TRUE, # show warnings during codebook generation
  message = TRUE, # show messages during codebook generation
  error = TRUE, # do not interrupt codebook generation in case of errors,
                # usually makes debugging easier, and sometimes half a codebook
                # is better than none
  echo = FALSE  # don't show the R code
pander::panderOptions("table.split.table", Inf)
Here, we import data from formr
formr_connect(email = credentials$email, password = credentials$password)
codebook_data <- formr_results("s3_daily")
But we can also import data from e.g. an SPSS file.
codebook_data <- rio::import("s3_daily.sav")
Sometimes, the metadata is not set up in such a way that codebook
can leverage it fully. These functions help fix this.
```{r codebook}
library(codebook) # load the package
# omit the following lines, if your missing values are already properly labelled
codebook_data <- detect_missing(codebook_data,
    only_labelled = TRUE, # only labelled values are autodetected as
                                   # missing
    negative_values_are_missing = FALSE, # negative values are NOT missing values
    ninety_nine_problems = TRUE,   # 99/999 are missing values, if they
                                   # are more than 5 MAD from the median
# If you are not using formr, the codebook package needs to guess which items
# form a scale. The following line finds item aggregates with names like this:
# scale = scale_1 + scale_2R + scale_3R
# identifying these aggregates allows the codebook function to
# automatically compute reliabilities.
# However, it will not reverse items automatically.
codebook_data <- detect_scales(codebook_data)
Now, generating a codebook is as simple as calling codebook from a chunk in an
rmarkdown document.

Code of conduct for contributing


codebook 0.8.0


  • removed three vignettes
  • calculate reliability using userfriendlyscience instead of Cronbach's Alpha and correlations
  • make it easier to generate compact codebooks
  • hide machine-readable metadata in details tags (toggle to view)
  • plot number of characters for character variables
  • update explanations in web app slightly
  • reduce survey-specific language


  • make it less likely that unique/private values are disclosed (e.g., free text)

codebook 0.7.6


  • changed vignette titles (one was duplicated)

codebook 0.7.5


  • import/export knit_print generic from knitr

codebook 0.7.4


  • Function new_codebook_rmd creates a new file in your working directory with a codebook template.
  • Function metadata can be used to set dataset-level metadata before rendering a codebook (valid attributes will carry over to JSON-LD representation)
  • Compliance with Google Dataset Search, see examples


  • removed zap_label because haven 2.0.0 has this function
  • added several functions to add JSON-LD compliant metadata and to show it in the codebook
  • removed some non-standard attributes from the JSON-LD metadata so that datasets will be indexed in Google Dataset Search
  • work with haven 2.0.0's changed class names
  • play nice with userfriendlyscience::makeScales attributes
  • improved binning and wrapping in plot_labelled
  • removed the mice dependency to reduce the number of dependencies


  • detect_missing reset variable label with the new haven version (only between and 0.7.0, never on CRAN)
  • reverse_labelled_values mislabelled values, if there were labelled missing values (numbers were correct)

codebook 0.6.3


  • Vignettes for
    • documenting the expected attribute structure, how to add metadata in R
    • importing metadata from SPSS or Stata files
    • importing metadata from Qualtrics as made available by qualtRics package
  • Importing some functions from labelled package to add metadata
  • Default method for haven::as_factor when labelled class is absent


  • Changed the scale summary, so that Likert plots and distributions are shown on the first tab. Reliability now hidden under "Reliability details".
  • removed unnecessary readr dependency.


  • summarising factors in a table
  • turning off components of the codebook without empty strings being echoed
  • allow using variable and value labels in the absence of the labelled class (as imported by rio for example)

codebook 0.6.2


  • Three RStudio Addin Shinyapps to browse variable labels and codebook.

Bug fix

  • Specify a mice dependency that doesn't break degenerate test cases.

codebook 0.5.9


  • plot_labelled now makes better plots for numeric variables
  • codebook generation has been parallelised using the future package. By calling e.g. plan(multicore(workers = 4)) before the codebook function, the computation of reliabilities and the generation of scale and item summaries will happen in parallel. For this to work with plots, you have to choose a graphics device in knitr that supports parallelisation, by calling e.g. opts_chunk$set(dev = "CairoPNG").
  • for variables that store multiple multiple choice values comma-separated, we now separate the values before plotting, if that item attribute attributes(item)$item$type contains "multiple"
  • make it easier to trace which variable in a dataset cannot be summarised
  • added and document aggregate_and_document_scale for people who don't import data via and want reliabilities to be calculated automatically
  • use rio to import all kinds of file formats in the webapp

Bug fixes

  • fix bugs in plot_labelled
  • fix bugs when variables are entirely missings
  • escape HTML in various labels, use safe names for anchors, figures
  • reliability functions no longer garble names
  • require skimr >= 1.0.2 and ggplot2 >= 2.0.0

codebook 0.5.8

  • don't write files into anything but tempdir

codebook 0.5.7

  • changed description and documentation

codebook 0.5.6

  • changed license to MIT

codebook 0.5.5

  • improved documentation
  • more tests

codebook 0.4.4

  • wrote some tests
  • tried to please goodpractice::gp()
  • removed some cruft

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.