Last updated on 2021-02-25
by Giovanni Montana
Great advances have been made in the field of genetic analysis over the last years. The availability of millions
of single nucleotide polymorphisms (SNPs) in widely available databases, coupled with major advances in SNP genotyping
technology that reduce costs and increase throughput, are enabling a host of studies aimed at elucidating the genetic basis
of complex disease. The focus in this task view is on R packages implementing statistical methods and algorithms for the
analysis of genetic data and for related population genetics studies.
A number of R packages are already available and many more are most likely to be developed in the near future.
Please send your comments and suggestions to the task view maintainer.
implements classes and methods for representing genotype and haplotype data, and has several
functions for population genetic analysis (e.g. functions for estimation and testing of
Hardy-Weinberg and linkage disequilibria, etc.).
A few population genetics functions are also implemented in
fits models for genotypic disequilibria. Whilst
provides graphical representation of disequilibria via ternary plots (also known as de Finetti diagrams).
package provides functions for Biodemographical analysis, e.g.
calculates the Fst from the conditional kinship matrix. The
adegenet package implements a number of different methods for analysing population structure using multivariate
statistics, graphics and spatial statistics.
The hierfstat package allows the estimation of hierarchical F-statistics from haploid or diploid genetic data with any numbers of levels in the hierarchy.
The Phylogenetics view has more detailed information,
the most important packages are also mentioned here.
Phylogenetic and evolution analyses can be performed via
provides Ornstein-Uhlenbeck models for phylogenetic comparative hypotheses.
estimates phylogenetic trees and networks using maximum likelihood, maximum parsimony, distance
methods and Hadamard conjugation.
There are few native packages for performing parametric or non-parametric linkage analysis
from within R itself, the calculations must be performed using external packages. However,
there are a number of ancillary R packages that facilitate interface with these stand-alone
programs and using the results for further analysis and presentation.
uses Identity By Descent (IBD) Non-Parametric Linkage (NPL) statistics for related pairs calculated
externally to test for genetic linkage with covariates by regression modelling.
Whilst not official R packages one software suite in particular is worthy of mention.
is a C++ program for genome wide linkage analysis that supports R-based plug-ins via Rserve allowing
users to utilise the rich suite of statistical functions in R for analysis.
Packages in this category develop methods for the analysis of experimental crosses
to identify markers contributing to variation in quantitative traits.
implement both likelihood-based and Bayesian methods for inbred crosses and recombinant inbred
provides several functions and a data structure for QTL mapping, including a function
for genome-wide scans.
builds on the qtl by including functions for the modelling and summary of QTL intervals from the
full linkage map.
Packages in this category provide statistical methods to test associations between individual genetic markers
and a phenotype.
is a package for genetic data analysis of both population and family data; it contains functions for sample
size calculations, probability of familial disease aggregation, kinship calculation, and some tests for linkage
and association analyses. Among the other functions,
estimates haplotype frequencies from genotype data, and
implements a Bayesian genomic control statistics for association studies. For family data,
offers an implementation of the Transmission/Disequilibrium Test (TDT) for extended marker haplotypes.
Linkage Disequilibrium and haplotype mapping
A number of packages provide haplotype estimation for unrelated individuals with ambiguous haplotypes
(due to unknown linkage phase) and allow testing for associations between the estimated haplotypes and
phenotypes (including co-variates) under a GLM framework.
performs likelihood inference of trait associations with haplotypes in GLMs.
implements transmission/disequilibrium tests for extended marker haplotypes.
Genome-Wide Association Studies (GWAS)
With recent technical advances in high-throughput genotyping technologies the possibility of performing
Genome-Wide Association Studies is now a feasible strategy. A number of packages are available to facilitate
the analysis of these large data sets.
provides a GUI to the powerful PBAT software which performs family and population based family and
population based studies. The software has been implemented to take advantage of parallel processing, which
vastly reduces the computational time required for GWAS.
is another package for carrying out GWAS analysis. It provides descriptive statistics of the data
(inlcuding patterns of missing data) and tests for Hardy-Weinberg equilibrium. Single-point analyses with binary
or quantitative traits are implemented via generalized linear models, and multiple SNPs can be anlaysed for
haplotypic associations or epistasis.
Implements classes and methods for large-scale SNP association studies.
qvalue on Bioconductor
implements False Discovery Rate; the main function
estimates the q-values from a list of p-values.
multtest on Bioconductor
also offers several non-parametric bootstrap and permutation resampling-based multiple testing procedures.
Importing Sequence Data
There are utilities in the
package to import sequence data from various sources, including files of aligned sequences in mase, clustal,
phylip, fasta and msf format which will be of utility to some population genetic analysis. Users interested in
using R for sequence data and bioinformatics are also referred to the