Manipulate and Explore UK Biobank Data

A set of tools to create a UK Biobank < http://www.ukbiobank.ac.uk/> dataset from a UKB fileset (.tab, .r, .html), visualize primary demographic data for a sample subset, query ICD diagnoses, retrieve genetic metadata, read and write standard file formats for genetic analyses.


CRAN_Status_Badge Travis-CI Build Status

After downloading and decrypting your UK Biobank (UKB) data with the supplied UKB programs, you have multiple files that need to be brought together to give you a dataset to explore. The data file has column names that are edited field-codes from the UKB data showcase. ukbtools makes it easy to collapse the multiple UKB files into a single dataset for analysis, in the process giving meaningful names to the variables. The package also includes functionality to retrieve ICD diagnoses, explore a sample subset in the context of the UKB sample, and collect genetic metadata.

Installation

 
# Install from CRAN
install.packages("ukbtools")
 
# Install latest development version
devtools::install_github("kenhanscombe/ukbtools", dependencies = TRUE)
 

Prerequisite: Make a UKB fileset

Download§ then decrypt your data and create a "UKB fileset" (.tab, .r, .html):

ukb_unpack ukbxxxx.enc key
ukb_conv ukbxxxx.enc_ukb r
ukb_conv ukbxxxx.enc_ukb docs
 

ukb_unpack decrypts your downloaded ukbxxxx.enc file, outputting a ukbxxxx.enc_ukb file. ukb_conv with the r flag converts the decrypted data to a tab-delimited file ukbxxxx.tab and an R script ukbxxxx.r that reads the tab file. The docs flag creates an html file containing a field-code-to-description table (among others).

§ Full details of the data download and decrypt process are given in the Using UK Biobank Data documentation.

Make a UKB dataset

The function ukb_df() takes two arguments, the stem of your fileset and the path, and returns a dataframe with usable column names. This will take a few minutes. The rate-limiting step is reading and parsing the code in the UKB-generated .r file - not ukb_df per se.

 
library(ukbtools)
 
my_ukb_data <- ukb_df("ukbxxxx")
 

You can also specify the path to your fileset if it is not in the current directory. For example, if your fileset is in a subdirectory of the working directory called data

 
my_ukb_data <- ukb_df("ukbxxxx", path = "/full/path/to/my/data")
 

Note: You can move the three files in your fileset after creating them with ukb_conv, but they should be kept together. ukb_df() automatically updates the read call in the R source file to point to the correct directory (the current directory by default, or a directory specified by path).

Other tools

All tools are described on the ukbtools webpage and in the package vignette "Explore UK Biobank Data"

 
vignette("explore-ukb-data", package = "ukbtools")
 

For a list of all functions

 
help(package = "ukbtools")
 

News

ukbtools 0.11.2.9000

Corrected functionality:

Updated earlier typo/change that made ukb_df incorrectly convert all column types to character (caused by replacing stringr::str_interp to stringr::str_c when passing internal coumn type vector to data.table::fread, without updating argument)

Note. Correction is in development version 0.11.2.9000 - will upload to CRAN ASAP.

ukbtools 0.11.1

Test data:

Added example UKB data ukbXXXX.tab, ukbXXXX.r, ukbXXXX.html to test the 'read' and 'summarise' functionality ukb_df, ukb_df_field, and ukb_context. See the section "An example fileset" in the vignette for details.

Updated functionality:

ukb_icd_freq_by with freq.plot = TRUE plots a barplot for categorical reference variables, and plots diagnosis frequencies at the midpoint of each group for quatitative reference variables.

Webpage:

The ukbtools webpage has been rebuilt with pkgdown and includes the vignette under the Articles tab.

ukbtools 0.11.0

Updated functionality:

  • ukb_df: Replaced readr::read_tsv with data.table::fread for faster read. Also includes an n_threads argument passed to data.table::fread, which may make read faster. Column names now include field code to ensure names are unique (UK Biobank sometimes use the same description for more than one variable)

Defunct functionality:

  • Added defunct message to ukb_gen_meta, ukb_gen_pcs, ukb_gen_excl, ukb_gen_rel, ukb_gen_het, ukb_gen_excl_to_na, and ukb_gen_write_plink_excl. ukb_defunct explains why these have become defunct and where to get UK Biobank genetic (meta)data.

New functionality:

  • Since the UKB changed the way they serve up genetic metadata, the following work with files described in UKB Resource 531: ukb_gen_sqc_names supplies column names for the separately downloaded sample QC file; ukb_gen_rel_count does the same as before (a count of levels of relatedness or a plot) but with separately downloaded relatedness data; ukb_gen_related_with_data returns subset of relatedness data in which both IDs have data on a phenotype of interest; ukb_gen_samples_to_remove returns a list of individuals to exclude in order to remove relatedness (one possible solution to a maximal subset problem).

ukbtools 0.10.1

Bug fix:

  • ukb_icd_freq_by: corrected order by levels of reference.var in the optional plot. (order in the default dataframe returned was correct.)

  • ukb_df: corrected tab file path update in r source file. Specifically, made regular expression more specific (1 case reported of regular expression matching word elsewhere in the source file.). Also, replaced utils::read.delim with readr::read_tsv for faster read, with progress bar.

ukbtools 0.10.0

New functionality:

  • ukb_icd_freq_by returns frequency for one or more ICD diagnoses by levels of a reference variable and includes an optional plot

  • ukb_df_full_join (a thin wrapper around dplyr::full_join) recursively called on a list of UKB datasets

  • ukb_df_duplicated_names to identify duplicated names within a dataset. The variable prefix (constructed from its description), index, and array should make the column name unique. However, typos in UKB documentation that give two variables the do not increment index/array have been observed. You will want to identify these and update them appropriately. We expect the occurrence of such duplicates will be rare.

Updated functionality:

  • ukb_icd_diagnosis now takes one or more individual ids and returns a dataframe with a potential message noting ids with no diagnoses

  • ukb_icd_keyword accepts a character vector of one or more "keywords" and returns all ICD descriptions including any of the keywords

ukbtools 0.9.0

  • beta release to CRAN. Feature complete but may contain unknown bugs.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("ukbtools")

0.11.3 by Ken Hanscombe, 5 months ago


https://kenhanscombe.github.io/ukbtools/


Browse source code at https://github.com/cran/ukbtools


Authors: Ken Hanscombe [aut, cre]


Documentation:   PDF Manual  


GPL-2 license


Imports data.table, dplyr, purrr, readr, ggplot2, XML, magrittr, grid, tibble, tidyr, scales, stringr, foreach, parallel, doParallel

Suggests knitr, rmarkdown


See at CRAN