You are currently browsing the monthly archive for March 2012.

  1. GraphLab v2 @ Big Learning Workshop
  2. Basic Introduction to ggplot2
  3. Bayesian statistics made simple
  4. Courses in CS this spring
  5. A Numerical Tour of Signal Processing
  6. Reading List for Feb and March 2012 This is about the materials on concentration and geometric techniques used in compressed sensing.
  7. simulated annealing for Sudokus
  8. Djalil talks about A random walk on the unitary groupBrownian Motion and From seductive theory to concrete applications (which got Nuit Blanche thinking about writing this entryWhose heart doesn’t sink at the thought of Dirac being inferior to Theora ?)
  9. Lectures on Gaussian approximations with Malliavin calculus
  10. Useful R snippets
  11. Special Section: Minimax Shrinkage Estimation: A Tribute to Charles Stein
  12. Excellent Papers for 2011
  13. Creating a designer’s CV in LaTeX
  14. Is NGS the Answer? 
  15. Sequence Analysis Methods Not Just for Sequence Data
  16. DNA Variant Analysis of Complete Genomics’ Next-Generation Sequencing Data
  17. Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process
  18. Best Written Paper
  19. Online SVD/PCA resources
  20. Probabilistic Topic Models

Some important computing skills (in my opinion) that would really help an applied statistician with his/her work:

  • Knowledge of powerful statistical software : I recommend R since it is freeware and there is a huge group of people developing useful routines/packages. Of course, SAS and other languages would serve a similar purpose as well though they do not have the benefit of vast online communities (in terms of getting help and/or useful programs that have already been written and can be shared for free).
  • Fast compiled programming language : C or C++ , though FORTRAN is a reasonable option as well.
    It is harder to get things to work correctly in C or C++ versus a higher level language like R but even a poorly optimized piece of C software, for instance, is likely to be much faster than an R program.
  • Scripting languages for fast data/text manipulation and for automating the running of programs : Python appears to be the easiest to read/program in and most intuitive (to me at least). Perl is another powerful option. The ability to write scripts in Python or Perl is also a great way for you to automate any computer work such as a big simulation study where you want to run your algorithm for many different parameter values. Also, assuming you write your scripts well, every time you discover a mistake (this will happen often), you will only need to modify a small piece of code instead of going through the laborious process of editing dozens or hundreds of different files/programs etc. Organizing your work with scripts will also mean you will be less likely to make mistakes in the first place!
  • Mathematical writing with LaTeX : If you are plan to do a Ph.D. or write papers with significant mathematical content, LaTeX is definitely worth learning. It is mandatory in many departments for thesis writing but even if it is not, your mathematical writing looks much, much better (especially when compared to any Microsoft-based mathematical writing tools).
  • A flexible, intelligent text editor : emacs is my favorite though there are many other good editors. Emacs can be made to ‘understand’ R and latex easily (via ESS (emacs speaks statistics) and auctex respectively), which can be very useful. Such editors can save you enormous time and energy on a daily basis by helping you avoid errors by providing syntax highlighting, automatic parentheses matching and indentation, among other tools.

Note that my recommendations above share a common feature: they are all freeware!

Investing time into picking up these skills can pay off in a big way. The best (quickest and most fun) way to learn is by using them while working on assignments or project, not by trying to read books or tutorials; books and tutorials may be useful while you are trying to figure out how to solve the problems posed by your assignment or project.


Recently we are holding a journal club on RNA-Seq data analysis and this is a promising area to work on. Here I want to list some good papers for future reading:

  1. Julia Salzman, Hui Jiang and Wing Hung Wong (2011), Statistical Modeling of RNA-Seq Data. Statistical Science 2011, Vol. 26, No. 1, 62-83. doi: 10.1214/10-STS343. (We are done with this paper.)
  2. Turro E, Su S-Y, Goncalves A, Coin LJM, Richardson S and Lewin A (2011). Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biology. 12:R13. journal page. (RNA-seq produces sequence information that can be used for genotyping and phasing of haplotypes, thus permitting inferences to be made about the expression of each of the two parental haplotypes of a transcript in a diploid organism. )
  3. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation, Jingyi Jessica Li, Ci-Ren Jiang, James B. Brown, Haiyan Huang, and Peter J. Bickel. ( SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. )
  4. Dalpiaz, D., He, X., and Ma, P. (2012) Bias correction in RNA-Seq short-read counts using penalized regression , Statistics in Biosciences , DOI: 10.1007/s12561-012-9057-6. [Software]
  5. M. Nicolae and S. Mangul and I.I. Mandoiu and A. Zelikovsky, Estimation of alternative splicing isoform frequencies from RNA-Seq data, Algorithms for Molecular Biology 6:9, 2011, pdf preprint, publisher url, bibtex (In this paper it presents a novel expectation-maximization algorithm for inference of isoform- and
    gene-specific expression levels from RNA-Seq data.)
  6. There is a special issue for DNA-Seq, especially the paper: Statistical Issues in the Analysis of ChIP-Seq and RNA-Seq Data
  7. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks
  8. Sensitive Gene Fusion Detection Using Ambiguously Mapping RNA-Seq Read Pairs (Paired-end whole transcriptome sequencing provides evidence for fusion transcripts. However, due to the repetitiveness of the transcriptome, many reads have multiple high-quality mappings. Previous methods to find gene fusions either ignored these reads or required additional longer single reads. This can obscure up to 30% of fusions and unnecessarily discards much of the data. We present a method for using paired-end reads to find fusion transcripts without requiring unique mappings or additional single read sequencing.) Availability: A C++ and Python implementation of the method demonstrated in this paper is available at
  1. Stanford Unsupervised Feature Learning and Deep Learning Tutorial
  2. What does a compressive sensing approach bring to the table ?
  3. What is Mahalanobis distance?
  4. Large scale SVM (support vector machine)
  5. Abstractions
  6. Monkeying with Bayes’ theorem
  7. Coming to agreement on philosophy of statistics
  8. probit posterior mean
There will be a talk,  “Landscape of Random Functions in Many Dimensions via Random Matrix Theory”, next week given by  Antonio Auffinger from the University of Chicago.
Abstract: How many critical values a typical Morse function have on a high dimensional manifold? Could we say anything about the topology of its level sets? In this talk I will survey a joint work with Gerard Ben Arous and Jiri Cerny that addresses these questions in a particular but fundamental example. We investigate the landscape of a general Gaussian random smooth function on the N-dimensional sphere. These corresponds to Hamiltonians of well-known models of statistical physics, i.e spherical spin glasses. Using the classical Kac-Rice formula, this counting boils down to a problem in Random Matrix Theory. This allows us to show an interesting picture for the complexity of these random Hamiltonians, for the bottom of the energy landscape, and in particular a strong correlation between the index and the critical value. We also propose a new invariant for the possible transition between the so-called 1-step replica symmetry breaking and a Full Replica symmetry breaking scheme and show how the complexity function is related to the Parisi functional.
This topic is kind of a combination of my majors, differential geometry, probability and statistics. I am interested in this although I can imagine that it is hard.

Q: [From Mathangi Gopalakrishnan] Generating different sets of data, and within each data set, perform some calculations based on the fit of each data. The “for” loops does the job but it take days for simulations to end.

How to improve the efficiency of the coding? The R-help suggested vectorizing the operations, but how to do them for this problem.

The idea for this code is

for ( i in 1: 1000)
Generate a set of data.
For each data set, fit the dataset, perform calculations, keep them.
for(j in 1:1000)
Use the fitted values to generate 1000 sets of data,
Perform calculations and compare with the previous calculations.

A1[From Janice Brodsky]: Loops run slowly in R.  You want to take a look at the “apply” functions (lapply, sapply, etc.).  Here are some webpages with info about them:

A2[From Anne-Sophie Charest]: I also recommend the function “replicate”, which is very similar to lapply and such. For example:

replicate(40, rnorm(100)) – returns a matrix of 40 columns where each column is a different draw of 100 N(0,1) observations
Similarly, lapply(1:40, rnorm, 100) will return a list with 40 items, each with a vector of 100 N(0,1) draws.
I prefer the synthax of replicate when doing simulations like yours.

In your case you could do something like this:

replicate(1000, function1)
where function1 is the function which does all you want to do for one of the dataset

withing function1, at some point use replicate(1000, function2)
where function2 is the function which generates one dataset from the fitted values (to be passed as argument) and does the calculations you want in the second loop

If this is still too slow, you can also look into parallel computing. Some info at

A3[From Patrick Rhodes]: The suggestions listed thus far are excellent and should be used (really – you should be able to use some apply functions here).  Whether you use them or stay with your for loops, you almost certainly will have performance issues when you nest 1000 sets of data within another loop of 1000 sets of data – that’s 1 million sets of data that the computer has to track (internally it will have to keep memory pointers, allocate stacks and several other tasks).  Additionally, the amount of internal memory you have – no matter how great – will be insufficient, thus forcing your computer to use your hard drive to compensate, which slows things down tremendously.

There are a couple of routes you might experiment with:
1) *If* you choose to stay with nested loops, run garbage collection after each outer loop.  See here for more documentation:

Once you are done with an outer loop run (i.e. on that set of data, it has then run the inner loop 1000 times), remove that set of data and run garbage collection.  By doing that, it *should* free up all of the stack space and pointers and whatever other resources the computer was using to track it (hopefully).

But again, nested loops are awful – the amount of time to allocate, re-allocate, etc. is tremendous.

I would not recommend this approach, however – ever.

2) Consider using the ‘by’ function.  In this event, you will run the outer loop without any inner loop, creating the 1000 outer data sets.  Store those data sets in a vector.  Then use the by function on that structure to run your inner 1000 calculations.  Again, consider using garbage collection as you do.  Without seeing the code, it’s difficult to know how this would work exactly.

Using this method might avoid some re-allocation issues, but I’m not sure without experimenting.

Again, not recommended.

3) Use apply functions.  Honestly, however, underneath the hood these are written in C (as are the for loops) and still will have looping to manage.  But, you have the advantage of these being written by professional engineers which would have optimized their performance.  And it saves you from re-inventing the wheel.


There are almost certainly some other ways to optimize your code that we can’t see here.  Are you printing out a bunch of stuff on each run?  That will slow it down significantly.  Perhaps consider storing results to a log file that you append to on each run.  Your code may also have some optimization issues within the loops – perhaps you are performing your calculations in a way that take significant resources without realizing it.

To see how your changes would work, drop your outer loop to 10 data sets as well as your inner data sets (10).  Time how long each run takes as you optimize and you will eventually get a fast iteration.  Once you have it optimized, then go back to your original sizes.

But use the apply functions.

A4[From John Dawson]: The suggestions that have been made thus far are good, but two thoughts:

1) The superiority of the apply series over loops is not what it used to be. More recent versions of R may not exhibit order of magntiude differences in runtime between the two, depending on what kind of computations are being done.

2) If you really want to improve speed, consider outsourcing the computation to a ‘lower-level’ language such as C or C++. An excellent package for doing so in R is Rcpp. With Rcpp, you don’t have to worry about assigning memory or knowing too much underlying C++, since most of the command structure is built to be R-like. For a day’s worth of investment in learning how to use the package, you might see two orders of manitude improvement in runtime – I did in one of my projects.

Response[From Mathangi Gopalakrishnan]: Thank you all for the quick and clear responses.

Use of “replicate” solves some of my problems, and as suggested by Patrick I am in the process of optimizing my runs.

Hope to invest some time learning about Rcpp package.

Blog Stats

  • 103,521 hits

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 523 other followers

Twitter Updates