Our package coseq is now available on Bioconductor — check it out for co-expression analyses of RNA-seq data after applying appropriate transformations and using Gaussian mixture models, feedback is welcome! The related manuscript detailing the method implemented in coseq may be found in this BioRxiv :
- Rau, A. and Maugis-Rabusseau, C. (2016) Transformation and model choice for RNA-seq co-expression analysis. To appear in Briefings in Bioinformatics,
Here’s the tl;dr version:
- After applying an appropriate transformation, Gaussian mixture models represent a rich, flexible, and well-characterized class of models to identify groups of co-expressed genes from RNA-seq data. In particular, they directly account for per-cluster correlation structures among samples, which are observed to be quite strong in typical RNA-seq data.
- Normalized expression profiles, rather than raw counts, are recommended for co-expression analyses of RNA-seq data. Because these data are compositional in nature, an additional transformation (e.g., arcsine or logit) is required prior to fitting a Gaussian mixture model.
- Penalized model selection criteria like the BIC or ICL can be used to select both the
number of clusters present and the appropriate transformation to use; in the latter case, an additional term based on the Jacobian of the transformation is added to the criterion, yielding a corrected BIC or ICL that can be used to directly compare two transformations.