An unsupervised fully-automated pipeline for transcriptome analysis or a supervised option to identify characteristic genes from predefined subclasses. We rely on the 'pamr' < http://www.bioconductor.org/packages//2.7/bioc/html/pamr.html> clustering algorithm to cluster the Data and then draw a heatmap of the clusters with the most significant genes and the least significant genes according to the 'pamr' algorithm. This way we get easy to grasp heatmaps that show us for each cluster which are the clusters most defining genes.
Transcriptomic is the large-scale identification of gene expression across multiple samples. Gene expression mirrored functional aspects and included important information about biological functions and pathway activation. Their analysis can either uncover molecular functions on the one side and improve classification of large cohorts for improved clinical understanding on the other side. This tool aimed to design a standard-pipeline to integrate classification and functional aspect and generate a visual output to integrate transcriptomic data, clinical information and Gene Set Enrichment Analysis for functional aspects.
The pipeline was designed to integrate following aspects:
Reproducibility: Analysis needs to be easily reproduced by external researchers.
Easy-to-Use: The pipeline was designed to be user-friendly and applicable for non-expert users.
Compatible: The pipeline should be feasible for array based transcriptomic data as well as RNA sequencing outputs. For further clinical interpretation, external traits need to be easily integrated and included in the analysis.
Install with devtools
install.packages("devtools") library(devtools) install_github("falafel19/AutoPipe")
A function for unsupervised Clustering of the data
#Load data with Gene ENTREZ in rownames and samples in colnames data(y) dim(data) #Optional: Read in clinical Infos with samples in rownames UnSuperClassifier(data,clinical_data=NULL,thr=2)
This function produces a plot with a Heatmap using a supervised clustering algorithm which the user choses. with a the mean Silhouette width plotted on the right top corner and the Silhouette width for each sample on top. On the right side of the plot the n highest and lowest scoring genes for each cluster will added. And next to them the coressponding pathways (see Details)
##load the org.Hs.eg Library library(org.Hs.eg.db) #' ## load data data(rna) me=rna ## calculate best number of clusters res<-TopPAM(me, max_clusters = 8, TOP=1000) me_TOP=res[] number_of_k=res[] ## Compute top genes of each cluster, with "TRw" samples with a negative Silhouette widths could be cut-off File_genes=Groups_Sup(me_TOP, me=me, number_of_k,TRw=-1) groups_men=File_genes[] me_x=File_genes[] # groups_men contain informations of each sample and cluster, this could be adapted in case of a supervised analysis o_g<-Supervised_Cluster_Heatmap(groups_men = groups_men, gene_matrix=me_x, method="PAMR",show_sil=TRUE,print_genes=TRUE, TOP = 1000,GSE=TRUE,plot_mean_sil=TRUE,sil_mean=res[]) #Validate with Consensus Cluster or tSNE cons_clust(me_x,max_clust=8, TOP=1000) AutoPipe.tSNE(me=me_x)
D. H. Heiland & K. Daka, Translational Research Group, Medcal-Center Freiburg, University of Freiburg