Manipulate and Explore UK Biobank Data

A set of tools to create a UK Biobank <> dataset from a UKB fileset (.tab, .r, .html), visualize primary demographic data for a sample subset, query ICD diagnoses, retrieve genetic metadata, read and write standard file formats for genetic analyses.

CRAN_Status_Badge Travis-CI Build Status

After downloading and decrypting your UK Biobank (UKB) data with the supplied UKB programs, you have multiple files that need to be brought together to give you a dataset to explore. The data file has column names that are edited field-codes from the UKB data showcase. ukbtools makes it easy to collapse the multiple UKB files into a single dataset for analysis, in the process giving meaningful names to the variables. The package also includes functionality to retrieve ICD diagnoses, explore a sample subset in the context of the UKB sample, and collect genetic metadata.


# Install from CRAN
# Install latest development version
devtools::install_github("kenhanscombe/ukbtools", dependencies = TRUE)

Prerequisite: Make a UKB fileset

Download§ then decrypt your data and create a "UKB fileset" (.tab, .r, .html):

ukb_unpack ukbxxxx.enc key
ukb_conv ukbxxxx.enc_ukb r
ukb_conv ukbxxxx.enc_ukb docs

ukb_unpack decrypts your downloaded ukbxxxx.enc file, outputting a ukbxxxx.enc_ukb file. ukb_conv with the r flag converts the decrypted data to a tab-delimited file and an R script ukbxxxx.r that reads the tab file. The docs flag creates an html file containing a field-code-to-description table (among others).

§ Full details of the data download and decrypt process are given in the Using UK Biobank Data documentation.

Make a UKB dataset

The function ukb_df() takes two arguments, the stem of your fileset and the path, and returns a dataframe with usable column names. This will take a few minutes. The rate-limiting step is reading and parsing the code in the UKB-generated .r file - not ukb_df per se.

my_ukb_data <- ukb_df("ukbxxxx")

You can also specify the path to your fileset if it is not in the current directory. For example, if your fileset is in a subdirectory of the working directory called data

my_ukb_data <- ukb_df("ukbxxxx", path = "/full/path/to/my/data")

Note: You can move the three files in your fileset after creating them with ukb_conv, but they should be kept together. ukb_df() automatically updates the read call in the R source file to point to the correct directory (the current directory by default, or a directory specified by path).

Other tools

All tools are described on the ukbtools webpage and in the package vignette "Explore UK Biobank Data"

vignette("explore-ukb-data", package = "ukbtools")

For a list of all functions

help(package = "ukbtools")



Corrected functionality:

Updated earlier typo/change that made ukb_df incorrectly convert all column types to character (caused by replacing stringr::str_interp to stringr::str_c when passing internal coumn type vector to data.table::fread, without updating argument)

Note. Correction is in development version - will upload to CRAN ASAP.

ukbtools 0.11.1

Test data:

Added example UKB data, ukbXXXX.r, ukbXXXX.html to test the 'read' and 'summarise' functionality ukb_df, ukb_df_field, and ukb_context. See the section "An example fileset" in the vignette for details.

Updated functionality:

ukb_icd_freq_by with freq.plot = TRUE plots a barplot for categorical reference variables, and plots diagnosis frequencies at the midpoint of each group for quatitative reference variables.


The ukbtools webpage has been rebuilt with pkgdown and includes the vignette under the Articles tab.

ukbtools 0.11.0

Updated functionality:

  • ukb_df: Replaced readr::read_tsv with data.table::fread for faster read. Also includes an n_threads argument passed to data.table::fread, which may make read faster. Column names now include field code to ensure names are unique (UK Biobank sometimes use the same description for more than one variable)

Defunct functionality:

  • Added defunct message to ukb_gen_meta, ukb_gen_pcs, ukb_gen_excl, ukb_gen_rel, ukb_gen_het, ukb_gen_excl_to_na, and ukb_gen_write_plink_excl. ukb_defunct explains why these have become defunct and where to get UK Biobank genetic (meta)data.

New functionality:

  • Since the UKB changed the way they serve up genetic metadata, the following work with files described in UKB Resource 531: ukb_gen_sqc_names supplies column names for the separately downloaded sample QC file; ukb_gen_rel_count does the same as before (a count of levels of relatedness or a plot) but with separately downloaded relatedness data; ukb_gen_related_with_data returns subset of relatedness data in which both IDs have data on a phenotype of interest; ukb_gen_samples_to_remove returns a list of individuals to exclude in order to remove relatedness (one possible solution to a maximal subset problem).

ukbtools 0.10.1

Bug fix:

  • ukb_icd_freq_by: corrected order by levels of reference.var in the optional plot. (order in the default dataframe returned was correct.)

  • ukb_df: corrected tab file path update in r source file. Specifically, made regular expression more specific (1 case reported of regular expression matching word elsewhere in the source file.). Also, replaced utils::read.delim with readr::read_tsv for faster read, with progress bar.

ukbtools 0.10.0

New functionality:

  • ukb_icd_freq_by returns frequency for one or more ICD diagnoses by levels of a reference variable and includes an optional plot

  • ukb_df_full_join (a thin wrapper around dplyr::full_join) recursively called on a list of UKB datasets

  • ukb_df_duplicated_names to identify duplicated names within a dataset. The variable prefix (constructed from its description), index, and array should make the column name unique. However, typos in UKB documentation that give two variables the do not increment index/array have been observed. You will want to identify these and update them appropriately. We expect the occurrence of such duplicates will be rare.

Updated functionality:

  • ukb_icd_diagnosis now takes one or more individual ids and returns a dataframe with a potential message noting ids with no diagnoses

  • ukb_icd_keyword accepts a character vector of one or more "keywords" and returns all ICD descriptions including any of the keywords

ukbtools 0.9.0

  • beta release to CRAN. Feature complete but may contain unknown bugs.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.11.3 by Ken Hanscombe, a year ago

Browse source code at

Authors: Ken Hanscombe [aut, cre]

Documentation:   PDF Manual  

GPL-2 license

Imports data.table, dplyr, purrr, readr, ggplot2, XML, magrittr, grid, tibble, tidyr, scales, stringr, foreach, parallel, doParallel

Suggests knitr, rmarkdown

See at CRAN