I just came back from the talk, “Statistical Methods for Analysis of Gut Microbiome Data” , given by Professor Hongzhe Lee from University of Pennysylvania.

I learned this new biological name: Microbiome—-as extended human genomes.

A microbiome is the totality of microbes, their genetic elements (genomes), and environmental interactions in a particular environment. The term “microbiome” was coined by Joshua Lederberg, who argued that microorganisms inhabiting the human body should be included as part of the human genome, because of their influence on human physiology. The human body contains over 10 times more microbial cells than human cells.

There are several research methods:

Targeted amplicon sequencing

Targeted amplicon sequencing relies on having some expectations about the composition of the community that is being studied. In target amplicon sequencing a phylogenetically informative marker is targeted for sequencing. Such a marker should be present in ideally all the expected organisms. It should also evolve in such a way that it is conserved enough that primers can target genes from a wide range of organisms while evolving quickly enough to allow for finer resolution at the taxonomic level. A common marker for human microbiome studies is the gene for bacterial 16S rRNA (i.e. “16S rDNA”, the sequence of DNA which encodes the ribosomal RNA molecule). Since ribosomes are present in all living organisms, using 16S rDNA allows for DNA to be amplified from many more organisms than if another marker were used. The 16S rDNA gene contains both slowly evolving regions and fast evolving regions; the former can be used to design broad primers while the latter allow for finer taxonomic distinction. However, species-level resolution is not typically possible using the 16S rDNA. Primer selection is an important step, as anything that cannot be targeted by the primer will not be amplified and thus will not be detected. Different sets of primers have been shown to amplify different taxonomic groups due to sequence variation.

Targeted studies of eukaryotic and viral communities are limited and subject to the challenge of excluding host DNA from amplification and the reduced eukaryotic and viral biomass in the human microbiome.

After the amplicons are sequenced, molecular phylogenetic methods are used to infer the composition of the microbial community. This is done by clustering the amplicons into operational taxonomic units (OTUs) and inferring phylogenetic relationships between the sequences. An important point is that the scale of data is extensive, and further approaches must be taken to identify patterns from the available information. Tools used to analyze the data include VAMPS, QIIME and mothur.

Metagenomic sequencing

Metagenomics is also used extensively for studying microbial communities. In metagenomic sequencing, DNA is recovered directly from environmental samples in an untargeted manner with the goal of obtaining an unbiased sample from all genes of all members of the community. Recent studies use shotgun Sanger sequencing or pyrosequencing to recover the sequences of the reads. The reads can then be assembled into contigs. To determine the phylogenetic identity of a sequence, it is compared to available full genome sequences using methods such as BLAST. One drawback of this approach is that many members of microbial communities do not have a representative sequenced genome.

Despite the fact that metagenomics is limited by the availability of reference sequences, one significant advantage of metagenomics over targeted amplicon sequencing is that metagenomics data can elucidate the functional potential of the community DNA. Targeted gene surveys cannot do this as they only reveal the phylogenetic relationship between the same gene from different organisms. Functional analysis is done by comparing the recovered sequences to databases of metagenomic annotations such as KEGG. The metabolic pathways that these genes are involved in can then be predicted with tools such as MG-RAST, CAMERA[42] and IMG/M.

RNA and protein-based approaches

Metatranscriptomics studies have been performed to study the gene expression of microbial communities through methods such as the pyrosequencing of extracted RNA. Structure based studies have also identified non-coding RNAs (ncRNAs) such as ribozymes from microbiota.[45]Metaproteomics is a new approach that studies the proteins expressed by microbiota, giving insight into its functional potential.

He analyzed two statistical methods based on the first technology listed above:

  1. Kernel-based regression to test the effect of Microbiome composition on an outcome
  2. Sparse Dirichlet-Multinomial regression for Taxon-level analysis

The following is the abstract of this talk:

With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. In some cases, we may have a large number of covariates and identification of the relevant covariates and their associated bacterial taxa becomes important. In this talk, I present several statistical methods for analysis of the human microbiome data, including exploratory analysis methods such as generalized UniFrac distances and graph-constrained canonical correlations and statistical models for the count data and simplex data. In particular, I present a sparse group variable selection method for Dirichlet-multinomial regression to account for overdispersion of the counts and to impose a sparse group L1 penalty to encourage both group-level and within-group sparsity. I demonstrate the application of these methods with an on-going human gut microbiome study to investigate the association between nutrient intake and microbiome composition. Finally, I present several challenging statistical and computational problems in analysis of shotgun metagenomics data.