You are currently browsing the monthly archive for April 2012.

  1. LDA explained
  2. Counting the total number of…
  3. Significance Test for Kendall’s Tau-b
  4. dimension reduction in ABC [a review’s review]
  5. 9 essential LaTeX packages everyone should use
  6. Linguistic Notation Inside of R Plots! about knitr
  7. knitr Elegant, flexible and fast dynamic report generation with R
  8. knitr Performance Report-Attempt 1
  9. knitr Performance Report-Attempt 2
  10. Question: Why you need perl/python if you know R/Shell [NGS data analysis]
  11. SPAMS (SPArse Modeling Software) now with Python and R
  12. Large-scale Inference and empirical Bayes, they are related with multiple testing
  13. My setup about some softwares and editors
  14. Fancy HTML5 Slides with knitr and pandoc
  15. John talks about Random is as random does
  16. MCMC at ICMS (1)
  17. MCMC at ICMS (2)
  18. MCMC at ICMS (3)
  19. John Cook: Why and How People Use R
  20. An Introduction to 6 Machine Learning Models
  21. Machine Learning: Algorithms that Produce Clusters
  22. Dirichlet Process for dummies

I just came back from the talk, “Statistical Methods for Analysis of Gut Microbiome Data” , given by Professor Hongzhe Lee from University of Pennysylvania.

I learned this new biological name: Microbiome—-as extended human genomes.

A microbiome is the totality of microbes, their genetic elements (genomes), and environmental interactions in a particular environment. The term “microbiome” was coined by Joshua Lederberg, who argued that microorganisms inhabiting the human body should be included as part of the human genome, because of their influence on human physiology. The human body contains over 10 times more microbial cells than human cells.

There are several research methods:

Targeted amplicon sequencing

Targeted amplicon sequencing relies on having some expectations about the composition of the community that is being studied. In target amplicon sequencing a phylogenetically informative marker is targeted for sequencing. Such a marker should be present in ideally all the expected organisms. It should also evolve in such a way that it is conserved enough that primers can target genes from a wide range of organisms while evolving quickly enough to allow for finer resolution at the taxonomic level. A common marker for human microbiome studies is the gene for bacterial 16S rRNA (i.e. “16S rDNA”, the sequence of DNA which encodes the ribosomal RNA molecule). Since ribosomes are present in all living organisms, using 16S rDNA allows for DNA to be amplified from many more organisms than if another marker were used. The 16S rDNA gene contains both slowly evolving regions and fast evolving regions; the former can be used to design broad primers while the latter allow for finer taxonomic distinction. However, species-level resolution is not typically possible using the 16S rDNA. Primer selection is an important step, as anything that cannot be targeted by the primer will not be amplified and thus will not be detected. Different sets of primers have been shown to amplify different taxonomic groups due to sequence variation.

Targeted studies of eukaryotic and viral communities are limited and subject to the challenge of excluding host DNA from amplification and the reduced eukaryotic and viral biomass in the human microbiome.

After the amplicons are sequenced, molecular phylogenetic methods are used to infer the composition of the microbial community. This is done by clustering the amplicons into operational taxonomic units (OTUs) and inferring phylogenetic relationships between the sequences. An important point is that the scale of data is extensive, and further approaches must be taken to identify patterns from the available information. Tools used to analyze the data include VAMPS, QIIME and mothur.

Metagenomic sequencing

Metagenomics is also used extensively for studying microbial communities. In metagenomic sequencing, DNA is recovered directly from environmental samples in an untargeted manner with the goal of obtaining an unbiased sample from all genes of all members of the community. Recent studies use shotgun Sanger sequencing or pyrosequencing to recover the sequences of the reads. The reads can then be assembled into contigs. To determine the phylogenetic identity of a sequence, it is compared to available full genome sequences using methods such as BLAST. One drawback of this approach is that many members of microbial communities do not have a representative sequenced genome.

Despite the fact that metagenomics is limited by the availability of reference sequences, one significant advantage of metagenomics over targeted amplicon sequencing is that metagenomics data can elucidate the functional potential of the community DNA. Targeted gene surveys cannot do this as they only reveal the phylogenetic relationship between the same gene from different organisms. Functional analysis is done by comparing the recovered sequences to databases of metagenomic annotations such as KEGG. The metabolic pathways that these genes are involved in can then be predicted with tools such as MG-RAST, CAMERA[42] and IMG/M.

RNA and protein-based approaches

Metatranscriptomics studies have been performed to study the gene expression of microbial communities through methods such as the pyrosequencing of extracted RNA. Structure based studies have also identified non-coding RNAs (ncRNAs) such as ribozymes from microbiota.[45]Metaproteomics is a new approach that studies the proteins expressed by microbiota, giving insight into its functional potential.

He analyzed two statistical methods based on the first technology listed above:

  1. Kernel-based regression to test the effect of Microbiome composition on an outcome
  2. Sparse Dirichlet-Multinomial regression for Taxon-level analysis

The following is the abstract of this talk:

With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. In some cases, we may have a large number of covariates and identification of the relevant covariates and their associated bacterial taxa becomes important. In this talk, I present several statistical methods for analysis of the human microbiome data, including exploratory analysis methods such as generalized UniFrac distances and graph-constrained canonical correlations and statistical models for the count data and simplex data. In particular, I present a sparse group variable selection method for Dirichlet-multinomial regression to account for overdispersion of the counts and to impose a sparse group L1 penalty to encourage both group-level and within-group sparsity. I demonstrate the application of these methods with an on-going human gut microbiome study to investigate the association between nutrient intake and microbiome composition. Finally, I present several challenging statistical and computational problems in analysis of shotgun metagenomics data.

Today, there will be a talk,  Imaginary Geometry and the Gaussian Free Field, given by Jason Miller from Microsoft Research. I just googled it and found the following interesting materials:

  1. Gaussian free fields for mathematicians
  2. Gaussian free field and conformal field theory: In these expository lectures, it gives an elementary introduction to conformal field theory in the context of probability theory and complex analysis. It considers statistical fields, and defines Ward functionals in terms of their Lie derivatives. Based on this approach, it explains some equations of conformal field theory and outline their relation to SLE theory.
  3. SLE and the free field: Partition functions and couplings
  4. Schramm-Loewner evolution (SLE). See slides by Tom Alberts and 2006 ICM slides by Oded Schramm and St. Flour Lecture Notes by Wendelin Werner . See also Ito’s lemma notes .

The meaning of the term ”Biological Replicate”  unfortunately often does not get adequately addressed in many publications. “Biological Replicate” can have multiple meanings, depending upon the context of the study. A general definition could be that biological replicates are when the same type of organism is grown/treated under the same conditions. For example, if one was performing a cell-based study, then different flasks containing the same type of cell (and preferably the exact same lineage and passage number) which have been grown under the same conditions could be considered biological replicates of one another. The definition becomes a bit trickier when dealing with higher-order organisms, especially humans. This may be an entire discussion in and of itself, but in this case, it is important to note that one does not have a well-defined lineage or passage number for humans. Indeed, it is basically impossible to ensure that all of your samples for one treatment or control have been exposed to the same external factors. In this case, one must do all that is possible to accurately portray and group these organisms; thus, one should group according to such traits as gender, age, and other well-established cause-effect traits (smokers, heavy drinkers, etc.).

Also, it may be helpful to outline the contrast between biological and technical replicates. Though people have varying definitions of technical replicates, perhaps the purest form of technical replicate would be when the exact same sample (after all preparatory techniques) is analyzed multiple times. The point of such a technical replicate would be to establish the variability (experimental error) of the analysis technique (mass spectrometry, LC, etc.), thus allowing one to set confidence limits for what is significant data. This is in contrast to the reasoning behind a biological replicate, which is to establish the biological variability which exists between organisms which should be identical. Knowing the inherent variability between “identical” organisms allows one to decide whether observed differences between groups of organisms exposed to different treatments is simply random or represents a “true” biological difference induced by such treatment.

Biological Factor: Single biological parameter controlled by the investigator. For example, genotype, diet, environmental stimulus, age, etc.

Treatment or Treatment Level: An exact value for a biological factor; for example, stress, no-stress, young, old, drug-treated, placebo, etc.

Condition: A single combination of treatments; for example, strain1/stressed/time10, young/drug-treated, etc.

Sample: An entity which has a single condition and is measured experimentally; for example serum from a single mouse, a sample drawn from a pool of yeast, a sample of pancreatic beta cells pooled from 5 diabetic animals, the third blood sample taken from a participant in a drug study.

Biological Measurement: A value measured on a collection of samples; for example, abundance of protein x, abundance of phospho-protein y, abundance of transcript z.

Experiment: A collection of biological measurements on two or more samples.

Replicate: Two sets of measurements, either within a single experiment or in two different experiments, where measurements are made on samples in the same condition.

Technical Replicates: Replicates that share the same sample; i.e. the measurements are repeated.

Biological Replicates: Replicates where different samples are used for both replicates

Question: Technical/Biological Replicates in RNA-Seq For Two Cell Lines

I have a question around the meaning of “biological replicate” in the context of applying RNA-seq to compare two cell lines. Apologies if this is an overly naeve question.

We have two human cell lines, one of which was derived from the other. Both have different phenotypes, and we want to use RNA-seq to explore the genetic underpinnings of the difference.

If we generate one cDNA library for each sample, and sequence each library on two lanes of an Illumina GA flowcell, I understand we will have “technical replicates”. In this scenario, we can expect very little difference between the two replicates in a sample. If we were to use something like DESeq to call differential expression, it would be inappropriate to treat our technical replicates as replicates in DESeq, since that would likely lead to a large list of DE calls that don’t reflect biological differences.

So, I’d like to know if it possible within our model to have “biological replicates” with which we can use DESeq to call biologically meaningful differential expression.

So, two questions:

(1) If we grow up two sets of cells from each of our two cell lines, generate separate cDNA libraries (4 in total), and sequence them on separate lanes, would these be considered “biological replicates” in the sense that it would be appropriate to treat them as replicates within something like DESeq. I suspect not, since the fact that both replicates in a sample derive from a single cell line within a short period of time will mean that they will be very similar anyway, almost as similar as the technical replicate scenario. Perhaps we would need entirely separate cell lines to be considered biological replicates.

(2) In general, how would others address this – does it seem a better approach to go with separate cells and separate libraries, or would this entail extra effort for effectively no benefit?


Two “biological replicate” are two samples that should be identical (as much as you can/want control) but are biologically separated (different cells, different organisms, different populations, colonies…)

You want to check the difference between cell Line A and cell line B. Let’s start assuming they are identical. Even if they are, by random fluctuation, technical issues, intrinsic slightly different environments… you will never observe that all genes have exactly the same expression. You find differences but can’t conclude if they are inevitable fluctuation or result of an actual difference.

So, you want to have 2 independent populations from A and two independent populations from B and then see how the variability WITHIN A1 and A2 compare to B1 and B2. The RNA levels from A1 and A2 WILL NOT be the same because… because biological system are far from being deterministic. They might be very similar, but different.

because A and B would be on different plates (their environment) I would seed A1 and A2, B1 and B2 the same day on 4 distinct (but as similar as possible) dishes, grow them together in the same condition to minimize external influence, and then collect at once from the 4 cell lines, extract RNA…

Since the cost is not growing cell lines, but sequencing, I would recommend to do 4 independent replicates for A and for B (or any other cell lines you may be interested in) in ONE GO, and then freeze the sample or the RNA. Even better, if you could have somebody to give you the lines called alfa, bravo, charlie, delta… (make sure they keep track of what they are in a safe place 😉 ) so that you are not biased while seeding, growing and manipulating the lines, that would be even better!

There is a workshop for this: 

Object, functional and structured data: towards next generation kernel-based methods – ICML 2012 Workshop


Information Geometry is applying differential geometry to families of probability distributions, and so to statistical models. Information does however play two roles in it: Kullback-Leibler information, or relative entropy, features as a measure of divergence (not quite a metric, because it’s asymmetric), and Fisher information takes the role of curvature.

One very nice thing about information geometry is that it gives us very strong tools for proving results about statistical models, simply by considering them as well-behaved geometrical objects. Thus, for instance, it’s basically a tautology to say that a manifold is not changing much in the vicinity of points of low curvature, and changing greatly near points of high curvature. Stated more precisely, and then translated back into probabilistic language, this becomes the Cramer-Rao inequality, that the variance of a parameter estimator is at least the reciprocal of the Fisher information. As someone who likes differential geometry, and now is interested in statistics, I find this very pleasing.

As a physicist, I have always been somewhat bothered by the way statisticians seem to accept particular parametrizations of their models as obvious and natural, and build those parameterization into their procedures. In linear regression, for instance, it’s reasonably common for them to want to find models with only a few non-zero coefficients. This makes my thumbs prick, because it seems to me obvious that if I regressed on arbitrary linear combinations of my covariates, I have exactly the same information (provided the transformation is invertible), and so I’m really looking at exactly the same model — but in general I’m not going to have a small number of non-zero coefficients any more. In other words, I want to be able to do coordinate-free statistics. Since differential geometry lets me do coordinate-free physics, information geometry seems like an appealing way to do this. There are various information-geometric model selection criteria, which I want to know more about; I suspect, based purely on this disciplinary prejudice, that they will out-perform coordinate-dependent criteria.

[From Information Geometry]

The following from the abstract of the tutorial given by Shun-ichi Amari, RIKEN Brain Science Institute in Algebraic Statistics 2012.

We give fundamentals of information geometry and its applications. We often treat a family of probability distributions for understanding stochastic phenomena in the world. When such a family includes n free parameters, it is parameterized by a real vector of n dimensions. This is regarded as a manifold, where the parameters play a role of a coordinate system. A natural question arises: What is the geometrical structure to be introduced in such a manifold. Geometrical structure gives, for example, a distance measure between two distributions and a geodesic line connecting two distributions. The second question is to know how useful the geometry is for understanding properties of statistical inference and designing new algorithms for inference.

The first question is answered by the invariance principle such that the geometry should be invariant under coordinate transformations of random variables. More precisely, it should be invariant by using sufficient statistics as random variables. It is surprising that this simple criterion gives a unique geometrical structure, which consists of a Riemannian metric and a family of affine connections which define geodesics.

The unique Riemannian metric is proved to be the Fisher information matrix. The invariant affine connections are not limited to the Riemannian (Levi-Civita) connection but include the exponential and mixture connections, which are dually coupled with respect to the metric. The connections are dually flat in typical cases such as exponential and mixture families.

A dually flat manifold has a canonical divergence function, which in our case is the Kullback-Leibler divergence. This implies that the KL-divergence is induced from the geometrical flatness. Moreover, there exist two affine coordinate systems, one is the natural or canonical parameters and the other is the expectation parameters in the case of an exponential family. They are connected by the Legendre transformation. A generalized Pythagorean theorem holds with respect to the canonical divergence and the pair of dual geodesics. A generalized projection theorem is derived from it.

These properties are useful for elucidating and designing algorithms of statistical inference. They are used not only for evaluating the higher-order characteristics of estimation and testing, but also for elucidating machine learning, pattern recognition and computer vision. We further study the procedures of semi-parametric estimation together with estimating functions. It is also applied to the analysis of neuronal spiking data by decomposing the firing rates of neurons and their higher-order correlations orthogonally.

The dually flat structure is useful for optimization of various kinds. A manifold needs not be connected with probability distributions. The invariance criterion does not work in such cases. However, a convex function plays a fundamental role in such a case, and we obtain a Riemannian metric together with two dually coupled flat affine connections, connected by the Legendre transformation. The Pythagorean theorem and projection theorem play again a fundamental role in applications of information geometry.

Dropbox is an efficient way to synchronize folders between various computers (Windows, Linux, Mac…). It is free up to 2Go. I use it. If you want to try and use the following link, we both get an extra 0.5Go free…

In the morning, there was a talk given by Subhadeep Mukhopadhyay (Deep) about the “LP Comoment : Concepts and Applications—Finding Patterns in Large Data Sets”. It’s a pretty interesting talk. Two things I want to share here:

One is: “Noise has no pattern, whatever the noise is.” Here since we are looking for patterns from the data. If there is a mechanism which can identify the pattern correctly whatever the noise is, then it is absolutely a good mechanism. From the talk, at least the speaker claimed, the method they proposed can make this happen, which is pretty cool.

The other is about the two cultures Breiman (2001) reminded statisticians awareness of:
1. Parametric modeling culture, pioneered by R.A.Fisher and Jerzy Neyman;
2. Algorithmic predictive culture, pioneered by machine learning research.
Now the speaker claimed that their method is the third one: Nonparametric , Quantile based, Information Theoretic Modeling.

Thus based on the above two things, I am really interested in their method. The following is what I want to study:

Emanuel Parzen wrote lots of papers about this, which is related with the quantize theory.

【update】There is a paper about this topic written by the speaker.

  1. Social Network Analysis with R
  2. Publicly available large data sets for database research
  3. Around the blogs in 80 hours and Random Thoughts (some are about sequencing data)
  4. Change margins of a single page (latex)
  5. Bootstrap example
  6. Exciting News on Three Dimensional Manifolds
  7. Dr. Perou on Next Generation Sequencing Technology
  8. analyzing-complex-plant-genomes-with-the-newest-next-generation-dna-sequencing-techniques
  9. RNA-Seq Methods & March Twitter Roundup
  10. Introduction to Statistical Thought
  11. An R programmer looks at Julia
  12. The slides and video can help you get a flavor of the language Julia.
  13. Why and How People Use R
  14. Wang, Landau, Markov, and others…
  15. Linear mixed models in R
  16. Least Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding
  17. Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
  18. C++ at Facebook
  19. Calling C++ from R
  20. C++ Renaissance
  21. Why haven’t we cured cancer yet? (Revisited): Personalized medicine versus evolution
  22. Getting ppt figures into LaTeX
  23. Latex Allergy Cured by knitr
  24. Melbourne R Users
  25. sixty two-minute r twotorials

Blog Stats

  • 103,521 hits

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 523 other followers

Twitter Updates