 Some R Resources for GLMs
 失联搜救中的统计数据分析
 The gap between data mining and predictive models
 Data Mining, machine learning and statistics.
 useR! 2014 is underway with 16 tutorials
 What is Scalable Machine Learning?
 rlist：基于list在R中处理非关系型数据
 The perfect candidate
 The Leek group guide to giving talks
 38 Seminal Articles Every Data Scientist Should Read
 Deep Learning – important resources for learning and understanding
 Twenty rules for good graphics + Ten Simple Rules for Better Figures
 Git Cookbook
 Making Your Code Citable
 biblatex for statisticians
 Do your “data janitor work” like a boss with dplyr
 Interview with Nick Chamandy, statistician at Google
 You and Your Research + video
 Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
 A Survival Guide to Starting and Finishing a PhD
 Six Rules For Wearing Suits For Beginners
 Why I Created C++
 More advice to scientists on blogging
 Software engineering practices for graduate students
 Statistics Matter
 What statistics should do about big data: problem forward not solution backward
 How signals, geometry, and topology are influencing data science
 The Bounded Gaps Between Primes Theorem has been proved
 A noncomprehensive list of awesome things other people did this year.
 Jake VanderPlas writes about the Big Data Brain Drain from academia.
 Tomorrow’s Professor Postings
 Best Practices for Scientific Computing
 Some tips for new researchoriented grad students
 3 Reasons Every Grad Student Should Learn WordPress
 How to Lie With Statistics (in the Age of Big Data)
 The Geometric View on Sparse Recovery
 The Mathematical Shape of Things to Come
 A Guide to Python Frameworks for Hadoop
 Statistics, geometry and computer science.
 How to Collaborate On GitHub
 Step by step to build my first R Hadoop System
 Open Sourcing a Python Project the Right Way
 Data Science MD July Recap: Python and R Meetup
 git 最近感悟
 10 Reasons Python Rocks for Research (And a Few Reasons it Doesn’t)
 Effective Presentations – Part 2 – Preparing Conference Presentations
 Doing Statistical Research
 How to Do Statistical Research
 Learning new skills
 How to Stand Out When Applying for An Academic Job
 Maturing from student to researcher
 False discovery rate regression (cc NSA’s PRISM)
 Job Hunting Advice, Pt. 3: Networking
 Getting Started with Git
This post is for JSM2013. I will put useful links here and I will update this post during the meeting.
 Big Data Sessions at JSM
 Nate Silver addresses assembled statisticians at this year’s JSM
 Data scientist is just a sexed up word for statistician
What I have learned from this meeting (Key words of this meeting):
Big Data, Bayesian, Statistical Efficiency vs Computational Efficiency
I was in Montreal from Aug 1st to Aug 8th for JSM2013 and traveling.
(Traveling in Quebec: Olympic Stadium; Underground City; Quebec City; Montreal City; basilique NortreDame; China Town)
(Talks at JSM2013: Jianqing Fan; Jim Berger; Nate Silver; Tony Cai; Han Liu; Two Statistical Peters)
(My Presentation at JSM2013)
The following is the list for the talks I was there:
 Aug 4th
 2:05 PM Analyzing Large Data with R and MonetDB — Thomas Lumley, University of Auckland
 2:25 PM Empirical Likelihood and UStatistics in Survival Analysis — Zhigang Zhang, Memorial SloanKettering Cancer Center ; Yichuan Zhao, Georgia State University
 2:50 PM Joint Unified Confidence Region for the Parameters of Branching Processes with Immigration — Pin Ren ; Anand Vidyashankar, George Mason University
 3:05 PM TimeVarying Additive Models for Longitudinal Data — Xiaoke Zhang, University of California Davis ; Byeong U. Park, Seoul National University ; JaneLing Wang, UC Davis
 3:20 PM Leveraging as a Paradigm for Statistically Informed LargeScale Computation — Michael W. Mahoney, Stanford University
 4:05 PM Joint Estimation of Multiple Dependent Gaussian Graphical Models — Yuying Xie, The University of North Carolina at Chapel Hill ; Yufeng Liu, The University of North Carolina ; William Valdar, UNCCH Genetics
 4:30 PM Computational Strategies in Regression of Big Data — Ping Ma, University of Illinois at UrbanaChampaign
 4:55 PM Programming with Big Data in R — George Ostrouchov, Oak Ridge National Laboratory ; WeiChen Chen, Oak Ridge National Laboratory ; Drew Schmidt, University of Tennessee ; Pragneshkumar Patel, University of Tennessee
 5:20 PM Inference and Optimalities in Estimation of Gaussian Graphical Model — Harrison Zhou, Yale University
 Aug 5th
 99 Mon, 8/5/2013, 8:30 AM – 10:20 AM CC710a
 Introductory Overview Lecture: Twenty Years of Gibbs Sampling/MCMC — Other Special Presentation
 8:35 AM Gibbs Sampling and Markov Chain Monte Carlo: A Modeler’s Perspective — Alan E. Gelfand, Duke University
 9:25 AM The Theoretical Underpinnings of MCMC — Jeffrey S. Rosenthal, University of Toronto
 10:15 AM Floor Discussion
 166 * Mon, 8/5/2013, 10:30 AM – 12:20 PM CC520c
 Statistical Learning and Data Mining: Winners of Student Paper Competition — Topic Contributed Papers
 10:35 AM Multicategory AngleBased Large Margin Classification — Chong Zhang, UNCCH ; Yufeng Liu, The University of North Carolina
 10:55 AM Discrepancy Pursuit: A Nonparametric Framework for HighDimensional Variable Selection — Li Liu, Carnegie Mellon University ; Kathryn Roeder, CMU ; Han Liu, Princeton University
 11:15 AM PenPC: A TwoStep Approach to Estimate the Skeletons of HighDimensional Directed Acyclic Graphs — Min Jin Ha ; Wei Sun, UNC Chapel Hill ; Jichun Xie, Temple University
 11:35 AM An Underdetermined PeacemanRachford Splitting Algorithm with Application to Highly Nonsmooth Sparse Learning Problems— Zhaoran Wang, Princeton University ; Han Liu, Princeton University ; Xiaoming Yuan, Hong Kong Baptist University
 11:55 AM Latent Supervised Learning — Susan Wei, UNC
 12:15 PM Floor Discussion
 220 Mon, 8/5/2013, 2:00 PM – 3:50 PM CC710b
 2:05 PM Statistics Meets Computation: Efficiency TradeOffs in High Dimensions — Martin Wainwright, UC Berkeley
 3:35 PM Floor Discussion
 267 Mon, 8/5/2013, 4:00 PM – 5:50 PM CC517ab
 4:05 PM JSM Welcomes Nate Silver — Nate Silver, FiveThirtyEight.com
 209305 Mon, 8/5/2013, 6:00 PM – 8:00 PM IMaisonneuve, JSM Student Mixer, Sponsored by Pfizer — Other Cmte/Business, ASA , Pfizer, Inc.
 268 Mon, 8/5/2013, 8:00 PM – 9:30 PM CC517ab
 8:05 PM Ars Conjectandi: 300 Years Later — Hans Rudolf Kunsch, Seminar fur Statistik, ETH Zurich
 99 Mon, 8/5/2013, 8:30 AM – 10:20 AM CC710a
 Aug 6th
 280 * Tue, 8/6/2013, 8:30 AM – 10:20 AM CC510a
 Statistical Inference for Large Matrices — Invited Papers
 8:35 AM Conditional Sparsity in Large Covariance Matrix Estimation — Jianqing Fan, Princeton University ; Yuan Liao, University of Maryland ; Martina Mincheva, Princeton University
 9:05 AM Multivariate Regression with Calibration — Lie Wang, Massachusetts Institute of Technology ; Han Liu, Princeton University ; Tuo Zhao, Johns Hopkins University
 9:35 AM Principal Component Analysis for HighDimensional NonGaussian Data — Fang Han, Johns Hopkins University ; Han Liu, Princeton University
 10:05 AM Floor Discussion
 325 * ! Tue, 8/6/2013, 10:30 AM – 12:20 PM CC520b
 Modern Nonparametric and HighDimensional Statistics — Invited Papers
 10:35 AM Simple Tiered Classifiers — Peter Gavin Hall, University of Melbourne ; Jinghao Xue, University College London ; Yingcun Xia, National University of Singapore
 11:05 AM Sparse PCA: Optimal Rates and Adaptive Estimation — Tony Cai, University of Pennsylvania
 11:35 AM Statistical Inference in Compound Functional Models — Alexandre Tsybakov, CRESTENSAE
 12:05 PM Floor Discussion
 392 Tue, 8/6/2013, 2:00 PM – 3:50 PM CC710a
 Introductory Overview Lecture: Big Data — Other Special Presentation
 2:05 PM The Relative Size of Big Data — Bin Yu, Univ of California at Berkeley
 2:55 PM Divide and Recombine (D&R) with RHIPE for Large Complex Data — William S. Cleveland, Purdue Universith
 3:45 PM Floor Discussion
 445 Tue, 8/6/2013, 4:00 PM – 5:50 PM CC517ab
 Deming Lecture — Invited Papers
 4:05 PM Industrial Statistics: Research vs. Practice — Vijay Nair, University of Michigan
 280 * Tue, 8/6/2013, 8:30 AM – 10:20 AM CC510a
 Aug 7th
 10:35 AM Bayesian and Frequentist Issues in LargeScale Inference — Bradley Efron, Stanford University
 11:20 AM Criteria for Bayesian Model Choice with Application to Variable Selection — Jim Berger, Duke University ; Susie Bayarri, University of Valencia ; Anabel Forte, Universitat Jaume I ; Gonzalo GarciaDonato, Universidad de CastillaLa Mancha
 571 Wed, 8/7/2013, 2:00 PM – 3:50 PM CC511c
 Statistical Methods for HighDimensional Sequence Data — Invited Papers
 2:05 PM Linkage Disequilibrium in Sequencing Data: A Blessing or a Curse? — Alkes L. Price, Harvard School of Public Health
 2:25 PM Statistical Prioritization of Sequence Variants — Lisa Joanna Strug, The Hospital for Sick Children and University of Toronto ; Weili Li, The Hospital for Sick Children and University of Toronto
 2:45 PM On Some Statistical Issues in Analyzing WholeGenome Sequencing Data — Dan Liviu Nicolae, The University of Chicago
 3:05 PM Statistical Methods for Studying Rare Variant Effects in NextGeneration Sequencing Association Studies — Xihong Lin, Harvard School of Public Health
 3:25 PM Adjustment for Population Stratification in Association Analysis of Rare Variants — Wei Pan, University of Minnesota ; Yiwei Zhang, University of Minnesota ; Binghui Liu, University of Minnesota ; Xiaotong Shen, University of Minnesota
 3:45 PM Floor Discussion
 612 Wed, 8/7/2013, 4:00 PM – 5:50 PM CC517ab
 COPSS Awards and Fisher Lecture — Invited Papers
 4:05 PM From Fisher to Big Data: Continuities and Discontinuities — Peter Bickel, University of California – Berkeley
 5:45 PM Floor Discussion
 Aug 8th
 621 Thu, 8/8/2013, 8:30 AM – 10:20 AM CC516d
 Recent Advances in Bayesian Computation — Invited Papers
 8:35 AM An Adaptive Exchange Algorithm for Sampling from Distribution with Intractable Normalizing Constants — Faming Liang, Texas A&M University
 9:00 AM Efficiency of Markov Chain Monte Carlo for Bayesian Computation — Dawn B Woodard, Cornell University
 9:25 AM Scalable Inference for Hierarchical Topic Models — John W. Paisley, University of California, Berkeley
 9:50 AM Augmented Particle Filters — Yuguo Chen, University of Illinois at UrbanaChampaign
 10:15 AM Floor Discussion
 661 * ! Thu, 8/8/2013, 10:30 AM – 12:20 PM CC710b
 Patterns and Extremes: Developments and Review of Spatial Data Analysis — Invited Papers
 10:35 AM Multivariate MaxStable Spatial Processes — Marc G. Genton, KAUST ; Simone Padoan, Bocconi University of Milan ; Huiyan Sang, TAMU
 10:55 AM Approximate Bayesian Computing for Spatial Extremes — Robert James Erhardt, Wake Forest University ; Richard Smith, The University of North Carolina at Chapel Hill
 621 Thu, 8/8/2013, 8:30 AM – 10:20 AM CC516d
This is from a post Connected objects and a reconstruction theorem:
A common theme in mathematics is to replace the study of an object with the study of some category that can be built from that object. For example, we can
 replace the study of a group
with the study of its category of linear representations,
 replace the study of a ring with the study of its category of modules,
 replace the study of a topological space with the study of its category of sheaves,
and so forth. A general question to ask about this setup is whether or to what extent we can recover the original object from the category. For example, if is a finite group, then as a category, the only data that can be recovered from is the number of conjugacy classes of , which is not much information about . We get considerably more data if we also have the monoidal structure on , which gives us the character table of (but contains a little more data than that, e.g. in the associators), but this is still not a complete invariant of . It turns out that to recover we need the symmetric monoidal structure on ; this is a simple form of Tannaka reconstruction.
The evidence in large medical data sets is direct, but indirect as well – and there is just too much of the indirect evidence to ignore. If you want to prove that your drug of choice is good or bad your evidence is not just how it does, it is also how all the other drugs do. And that is a crucial point that doesn’t fit easily into the frequentist world, which is a world of direct evidence (very often, but not always); and it also doesn’t fit extremely well into the formal Bayesian world, because the indirect information isn’t actually the prior distribution, it is evidence of a prior distribution, and that in some sense is not as neat. Neatness counts in science. Things that people can understand and really manipulate are terribly important.
“So I have been very interested in massive data sets not because they are massive but because they seem to offer opportunities to think about statistical inferences from the ground up again.”
The Fisher–Pearson –Neyman paradigm dating from around 1900 was, he says, “like a light being switched on. But it is so beautiful and so almost airtight that it is pretty hard to improve on; and that means that it is very hard to rethink what is good or bad about
statistics.
“Fisher of course had this wonderful view of how you do what I would call smallsample inference. You tend to get very smart people trying to improve on this kind of area, but you really cannot do that very well because there is a limited amount that is available to work on. But now suddenly there are these problems that have a different flavour. It really is quite different doing ten thousand estimates at once. There is evidence always lurking around the edges. It is hard to say where that evidence is, but it’s there. And if you ignore it you are just not going to do a good job.
“Another way to say it is that a Bayesian prior is an assumption of an infinite amount of past relevant experience. It is an incredibly powerful assumption, and often a very useful assumption for moving forward with complicated data analysis. But you cannot forget that you have just made up a whole bunch of data.
“So of course the trick for Bayesians is to do their ‘making up’ part without really influencing the answer too much. And that is really
tricky in these higherdimensional problems.”
From http://wwwstat.stanford.edu/~ckirby/brad/other/2010Significance.pdf
 Machine Learning, Big Data, Deep Learning, Data Mining, Statistics, Decision & Risk Analysis, Probability, Fuzzy Logic FAQ
 A Funny Thing Happened on the Way to Academia . . .
 Advice for students on the academic job market (2013 edition)
 Perspective: “Why C++ Is Not ‘Back’”
 Is Fourier analysis a special case of representation theory or an analogue?
 The Beauty of Bioconductor
 The State of Statistics in Julia
 Open Source Misfeasance
 Book review: The Signal and The Noise
 Should the Cox Proportional Hazards model get the Nobel Prize in Medicine?
 The most influential data scientists on Twitter
 Here is an interesting review of Nate Silver’s book. The interesting thing about the review is that it doesn’t criticize the statistical content, but criticizes the belief that people only use data analysis for good. This is an interesting theme we’ve seen before. Gelman also reviews the review.—–Simply Statistics
 Video : “Matrices and their singular values” (1976)
 Beyond Computation: The P vs NP Problem – Michael Sipser—This talk is arguably the very best introduction to computational complexity .
 What are some of your personal guidelines for writing good, clear code?
 How do you explain Machine learning and Data Mining to non CS people?
 Suggested New Year’s resolution: start a blog: A blog forces you to articulate your thoughts rather than having vague feelings about issues; You also get much more comfortable with writing, because you’re doing it rather than thinking about doing it; If other people read your blog you get to hear what they think too. You learn a lot that way.  Set aside time for your blog every day. Keep notes for yourself on bloggy subjects (write a oneline gmail to yourself with the subject “blog ideas”).
 The most influential data scientists on Twitter
 Tips on job market interviews
 The age of the essay
These days I have been working with computation and programming languages. I want to share something with you here.
 You cannot expect C++ to magically make your code faster. If speed is of concern, you need profiling to find the bottleneck instead of blind guessing.——Yan Zhou. Thus we have to learn to know how to profile an program in R, Matlab, C++, Python.
 When something complicated does not work, I generally try to restart with something simpler, and make sure it works.——Dirk Eddelbuettel.
 If you’re calling your function thousands or millions of times, then it might pay to closely examine your memory allocation strategies and figure out what’s temporary.——Christian Gunning.

No, your main issue is not thinking about the computation. As soon as you write something likearma::vec betahat = arma::inv(Inv)*arma::trans(D)*W*y;you are in theory land which has very little relationship to practical numerical linear algebra. If you want to perform linear algebra calculations like weighted least squares you should first take a bit of time to learn about numerical linear algebra as opposed to theoretical linear algebra. They are very different disciplines. In theoretical linear algebra you write the solution to a system of linear equations as above, using the inverse of the system matrix. The first rule of numerical linear algebra is that you never calculate the inverse of a matrix, unless you only plan to do toy examples. You mentioned sizes of 4000 by 4000 which means that the method you have chosen is doing thousands of times more work than necessary (hint: how do you think that the inverse of a matrix is calculated in practice? – ans: by solving n systems of equations, which you are doing here when you could be solving only one).Dirk and I wrote about 7 different methods of solving least squares problems in our vignette on RcppEigen. None of those methods involve taking the inverse of an n by n matrix.R and Rcpp and whatever other programming technologies come along will never be a “special sauce” that takes the place of thinking about what you are trying to do in a computation.——Douglas Bates. //[[Rcpp::depends(RcppEigen)]]
 #include <RcppEigen.h>

 typedef Eigen::MatrixXd Mat;
 typedef Eigen::Map<Mat> MMat;
 typedef Eigen::HouseholderQR<Mat> QR;
 typedef Eigen::VectorXd Vec;
 typedef Eigen::Map<Vec> MVec;

 // [[Rcpp::export]]

 Rcpp::List wtls(const MMat X, const MVec y, const MVec sqrtwts) {
 return Rcpp::List::create(Rcpp::Named(“betahat”) =
 QR(sqrtwts.asDiagonal()*X).solve(sqrtwts.asDiagonal()*y));
 }  Repeatedly calling an R function is probably not the smartest thing to do in an otherwise complex and hard to decipher program.—Dirk Eddelbuettel.
 Computers don’t do random things, unlike human beings. Something worked once, is very likely to work whatever times you repeat it as long as the input is the same (unless the function has side effect). So repeating it 1000 times is the same as once.——Yan Zhou

Yan Zhou: Here are a few things people usually do before asking in a mailing list (not just Rcpp list, any such lists like Rhelp, StackOverflow, etc).1. I write a program, it crashes,2. I find out the site of crash3. I make the program simpler and simpler until it is minimal and the crash is now reproducible.4. I still cannot figure out what is wrong with that four or five lines that crash the minimal example5. I ask
 It does not matter how stupid your questions are. We all asked silly questions before, that is how we learn. But it matters you put in effort to ask the right question. The more effort you put into it, the more specific question you ask and more helpful answers you get.——Yan Zhou.
In my office I have two NIPS posters on the wall, 2011 and 2012. But I have not been there and I am not computer scientist neither. But anyway I like NIPS without reason. Now it’s time for me to organize posts from others:
 NIPS ruminations I
 NIPS II: Deep Learning and the evolution of data models
 NIPS stuff…
 NIPS 2012
 NIPS 2012 Conference in Lake Tahoe, NV
 Thoughts on NIPS 2012
 The Big NIPS Post
 NIPS 2012 : day one
 NIPS 2012 : day two
 Spectral Methods for Latent Models
 NIPS 2012 Trends
And among all of the posts, there are several things I have to digest later on:
 One tutorial on Random Matrices, by Joel Tropp. People concluded in their posts that
Basically, break random matrices down into a sum of simpler, independent random matrices, then apply concentration bounds on the sum.—John Moeller. The basic result is that if you love your Chernoff bounds and Bernstein inequalities for (sums of) scalars, you can get almost exactly the same results for (sums of) matrices.—hal .
 “This year was definitely all about Deep Learning,” John Moeller said. The Geomblog mentioned that it’s been in the news recently because of the Google untrained search for youtube cats, the methods of deep learning (basically neural nets without lots of back propagation) have been growing in popularity over a long while. And we have to spend sometime to read Deep Learning and the evolution of data models, which is related with manifold learning.
 “Another trend that’s been around for a while, but was striking to me, was the detailed study of Optimization methods.”—The Geomblog. There are at least two different workshops on optimization in machine learning (DISC and OPT), and numerous papers that very carefully examined the structure of optimizations to squeeze out empirical improvements.
 Kernel distances: An introduction to kernel distance from The Geomblog. “Scott Aaronson (at his NIPS invited talk) made this joke about how nature loves ℓ2. The kernel distance is “essentially” the ℓ2 variant of EMD (which makes so many things easier). There’s been a series of papers by Sriperumbudur et al. on this topic, and in a series of works they have shown that (a) the kernel distance captures the notion of “distance covariance” that has become popular in statistics as a way of testing independence of distributions. (b) as an estimator of distance between distributions, the kernel distance has more efficient estimators than (say) the EMD because its estimator can be computed in closed form instead of needing an algorithm that solves a transportation problem and (c ) the kernel that optimizes the efficient of the twosample estimator can also be determined (the NIPS paper).”
 Spectral Methods for Latent Models: Spectral methods for latent variable models are based upon the method of moments rather than maximum likelihood.
Besides the papers mentioned in the above hot topics, there are some other papers from Memming‘s post:
 Graphical models via generalized linear models: Eunho introduced a family of graphical models with GLM marginals and Ising model style pairwise interaction. He said the PoissonMarkovRandomFields version must have negative coupling, otherwise the log partition function blows up. He showed conditions for which the graph structure can be recovered with high probability in this family.
 TCA: High dimensional principal component analysis for nongaussian data: Using an elliptical copula model (extending the nonparanormal), the eigenvectors of the covariance of the copula variables can be estimated from Kendall’s tau statistic which is invariant to the nonlinearity of the elliptical distribution and the transformation of the marginals. This estimator achieves close to the parametric convergence rate while being a semiparametric model.
Update: Make sure to check the lectures from the prominent 26th Annual NIPS Conference filmed @ Lake Tahoe 2012. Also make sure to check the NIPS 2012 Workshops, Oral sessions and Spotlight sessions which were collected for the Video Journal of Machine Learning Abstracts – Volume 3.
 Grad Student’s Guide to Good Coffee+Grad Student’s Guide to Good Tea
 Favorite Apps for Work and Life
 estimating a constant (not really)
 Reinforcement Learning in R: An Introduction to Dynamic Programming
 The Future of Machine Learning (and the End of the World?)
 10 Papers Every Programmer Should Read (At Least Twice)
 R in the Press
 On Chomsky and the Two Cultures of Statistical Learning
 Speech Recognition Breakthrough for the Spoken, Translated Word
 Frequentist vs Bayesian
 w4s – the awesomeness we’re experiencing
 Why is the Gaussian so pervasive in mathematics?
 C++ Blogs that you Regularly Follow
 An interview with Brad Efron about scientific writing. I haven’t watched the whole interview, but I do know that Efron is one of my favorite writers among statisticians.
 Slidify, another approach for making HTML5 slides directly from R. (1) It is still just a little too hard to change the theme/feel of the slides (2) The placement/insertion of images is still a little clunky, Google Docs has figured this out, if they integrated the best features of Slidify, Latex, etc. into that system, it will be great.
 Statistics is still the new hotness. Here is a Business Insider list about 5 statistics problems that will“change the way you think about the world”.
 New Yorker, especially the line,”statisticians are the new sexy vampires, only even more pasty” (via Brooke A.)
 The closed graph theorem in various categories
 Got spare time? Watch some videos about statistics
 About the first BorelCantelli lemma
 Yihui Xie—The Setup
 Best Practices for Scientific Computing
Recent Comments