Q: [From Mathangi Gopalakrishnan] Generating different sets of data, and within each data set, perform some calculations based on the fit of each data. The “for” loops does the job but it take days for simulations to end.

How to improve the efficiency of the coding? The R-help suggested vectorizing the operations, but how to do them for this problem.

The idea for this code is

for ( i in 1: 1000)
{
Generate a set of data.
For each data set, fit the dataset, perform calculations, keep them.
for(j in 1:1000)
{
Use the fitted values to generate 1000 sets of data,
Perform calculations and compare with the previous calculations.
}
}

A1[From Janice Brodsky]: Loops run slowly in R.  You want to take a look at the “apply” functions (lapply, sapply, etc.).  Here are some webpages with info about them:

http://www.biostat.jhsph.edu/~rpeng/biostat776/lecture4.pdf
http://lmf-ramblings.blogspot.com/2007/07/using-lapply-in-r.html
http://www.u.arizona.edu/~hirano/520_2008/R3.pdf
http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/

A2[From Anne-Sophie Charest]: I also recommend the function “replicate”, which is very similar to lapply and such. For example:

replicate(40, rnorm(100)) – returns a matrix of 40 columns where each column is a different draw of 100 N(0,1) observations
Similarly, lapply(1:40, rnorm, 100) will return a list with 40 items, each with a vector of 100 N(0,1) draws.
I prefer the synthax of replicate when doing simulations like yours.

In your case you could do something like this:

replicate(1000, function1)
where function1 is the function which does all you want to do for one of the dataset

withing function1, at some point use replicate(1000, function2)
where function2 is the function which generates one dataset from the fitted values (to be passed as argument) and does the calculations you want in the second loop

If this is still too slow, you can also look into parallel computing. Some info at http://cran.r-project.org/web/views/HighPerformanceComputing.html.

A3[From Patrick Rhodes]: The suggestions listed thus far are excellent and should be used (really – you should be able to use some apply functions here).  Whether you use them or stay with your for loops, you almost certainly will have performance issues when you nest 1000 sets of data within another loop of 1000 sets of data – that’s 1 million sets of data that the computer has to track (internally it will have to keep memory pointers, allocate stacks and several other tasks).  Additionally, the amount of internal memory you have – no matter how great – will be insufficient, thus forcing your computer to use your hard drive to compensate, which slows things down tremendously.

There are a couple of routes you might experiment with:
1) *If* you choose to stay with nested loops, run garbage collection after each outer loop.  See here for more documentation:  http://stat.ethz.ch/R-manual/R-devel/library/base/html/gc.html

Once you are done with an outer loop run (i.e. on that set of data, it has then run the inner loop 1000 times), remove that set of data and run garbage collection.  By doing that, it *should* free up all of the stack space and pointers and whatever other resources the computer was using to track it (hopefully).

But again, nested loops are awful – the amount of time to allocate, re-allocate, etc. is tremendous.

I would not recommend this approach, however – ever.

2) Consider using the ‘by’ function.  In this event, you will run the outer loop without any inner loop, creating the 1000 outer data sets.  Store those data sets in a vector.  Then use the by function on that structure to run your inner 1000 calculations.  Again, consider using garbage collection as you do.  Without seeing the code, it’s difficult to know how this would work exactly.

Using this method might avoid some re-allocation issues, but I’m not sure without experimenting.

Again, not recommended.

3) Use apply functions.  Honestly, however, underneath the hood these are written in C (as are the for loops) and still will have looping to manage.  But, you have the advantage of these being written by professional engineers which would have optimized their performance.  And it saves you from re-inventing the wheel.

=================

There are almost certainly some other ways to optimize your code that we can’t see here.  Are you printing out a bunch of stuff on each run?  That will slow it down significantly.  Perhaps consider storing results to a log file that you append to on each run.  Your code may also have some optimization issues within the loops – perhaps you are performing your calculations in a way that take significant resources without realizing it.

To see how your changes would work, drop your outer loop to 10 data sets as well as your inner data sets (10).  Time how long each run takes as you optimize and you will eventually get a fast iteration.  Once you have it optimized, then go back to your original sizes.

But use the apply functions.

A4[From John Dawson]: The suggestions that have been made thus far are good, but two thoughts:

1) The superiority of the apply series over loops is not what it used to be. More recent versions of R may not exhibit order of magntiude differences in runtime between the two, depending on what kind of computations are being done.

2) If you really want to improve speed, consider outsourcing the computation to a ‘lower-level’ language such as C or C++. An excellent package for doing so in R is Rcpp. With Rcpp, you don’t have to worry about assigning memory or knowing too much underlying C++, since most of the command structure is built to be R-like. For a day’s worth of investment in learning how to use the package, you might see two orders of manitude improvement in runtime – I did in one of my projects.

Response[From Mathangi Gopalakrishnan]: Thank you all for the quick and clear responses.

Use of “replicate” solves some of my problems, and as suggested by Patrick I am in the process of optimizing my runs.

Hope to invest some time learning about Rcpp package.

Advertisements