Genome Interval Arithmetic in R

Read and manipulate genome intervals and signals. Provides functionality similar to command-line tool suites within R, enabling interactive analysis and visualization of genome-scale data.


valr provides tools to read and manipulate genome intervals and signals, similar to the BEDtools suite. valr enables analysis in the R/RStudio environment, leveraging modern R tools for a terse, expressive syntax. Compute-intensive algorithms are implemented in Rcpp/C++, and many methods take advantage of the speed and grouping capability provided by dplyr.

The latest stable version can be installed from CRAN:

install.packages('valr')

The latest development version can be installed from github:

devtools::install_github('jayhesselberth/valr')

Why another tool set for interval manipulations? Based on our experience teaching genome analysis, we were motivated to develop interval arithmetic software that faciliates genome analysis in a single environment (RStudio), eliminating the need to master both command-line and exploratory analysis tools.

valr can currently be used for analysis of pre-processed data in BED and related formats. We plan to support BAM and VCF files soon via tabix indexes.

The functions in valr have similar names to their BEDtools counterparts, and so will be familiar to users coming from the BEDtools suite. Similar to pybedtools, valr has a terse syntax:

library(valr)
library(dplyr)
 
snps <- read_bed(valr_example('hg19.snps147.chr22.bed.gz'), n_fields = 6)
genes <- read_bed(valr_example('genes.hg19.chr22.bed.gz'), n_fields = 6)
 
# find snps in intergenic regions
intergenic <- bed_subtract(snps, genes)
# find distance from intergenic snps to nearest gene
nearby <- bed_closest(intergenic, genes)
 
nearby %>%
  select(starts_with('name'), .overlap, .dist) %>%
  filter(abs(.dist) < 5000)

Remote databases can be accessed with db_ucsc() (to access the UCSC Browser) and db_ensembl() (to access Ensembl databases).

# access the `refGene` tbl on the `hg38` assembly
ucsc <- db_ucsc('hg38')
tbl(ucsc, 'refGene')

valr includes helpful glyphs to illustrate the results of specific operations, similar to those found in the BEDtools documentation. For example, bed_glyph() can be used to illustrate result of intersecting x and y intervals with bed_intersect():

x <- tibble::tribble(
  ~chrom, ~start, ~end,
  'chr1', 25,     50,
  'chr1', 100,    125
)
 
y <- tibble::tribble(
  ~chrom, ~start, ~end,
  'chr1', 30,     75
)
 
bed_glyph(bed_intersect(x, y))

valr can be used in RMarkdown documents to generate reproducible work-flows for data processing. Because valr is reasonably fast, it can be for exploratory analysis with RMarkdown, and for interactive analysis using shiny.

Function names are similar to their their BEDtools counterparts, with some additions.

  • BED and related files are read with read_bed(), read_bed12(), read_bedgraph(), read_narrowpeak() and read_broadpeak().

  • Genome files containing chromosome name and size information are loaded with read_genome().

  • VCF files are loaded with read_vcf().

  • Remote databases can be accessed with db_ucsc() and db_ensembl().

  • Intervals are ordered with bed_sort().

  • Interval coordinates are adjusted with bed_slop() and bed_shift(), and new flanking intervals are created with bed_flank().

  • Nearby intervals are combined with bed_merge() and identified (but not merged) with bed_cluster().

  • Intervals not covered by a query are created with bed_complement().

  • Find overlaps between two sets of intervals with bed_intersect().

  • Apply functions to selected columns for overlapping intervals with bed_map().

  • Remove intervals based on overlaps between two files with bed_subtract().

  • Find overlapping intervals within a window with bed_window().

  • Find the closest intervals independent of overlaps with bed_closest().

  • Generate random intervals from an input genome with bed_random().

  • Shuffle the coordinates of input intervals with bed_shuffle().

  • Random sampling of input intervals is done with the sample_ function family in dplyr.

  • Calculate significance of overlaps between two sets of intervals with bed_fisher() and bed_projection().

  • Quantify relative and absolute distances between sets of intervals with bed_reldist() and bed_absdist().

  • Quantify extent of overlap between two sets of intervals with bed_jaccard().

  • Visualize the actions of valr functions with bed_glyph().

  • Constrain intervals to a genome reference with bound_intervals().

  • Subdivide intervals with bed_makewindows().

  • Convert BED12 to BED6 format with bed12_to_exons().

  • Calculate spacing between intervals with interval_spacing().

News

valr 0.1.1

  • test / vignette guards for Suggested RMySQL

  • fixed memory leak in absdist.cpp

  • fixed vignette entry names

valr 0.1.0

  • initial release on CRAN

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("valr")

0.1.2 by Jay Hesselberth, 12 days ago


http://github.com/rnabioco/valr/


Report a bug at https://github.com/rnabioco/valr/issues


Browse source code at https://github.com/cran/valr


Authors: Jay Hesselberth [aut, cre], Kent Riemondy [aut], Ryan Sheridan [ctb]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports dplyr, lazyeval, readr, stringr, tibble, tidyr, broom, ggplot2

Suggests knitr, rmarkdown, testthat, microbenchmark, covr, RMySQL, purrr

Linking to Rcpp, BH, dplyr

System requirements: C++11


See at CRAN