This page is for the collection of useful posts from others














Machine Vision 4 Users

From David’s Twitter stream:

Gonzalo Vazquez-Vilar


Hal Daume III


Andrew Gelman

David Brady

Greg: MIT IAP ’11 radar course SAR example, imaging with coffee cans, wood, and the audio input from your laptop 
Gustavo (Greg’s student): Cleaner Ranging and Multiple Targets
Anand:  Privacy and entropy (needs improvement)
Vladimir (ISW): Albert Theuwissen Reports from EI 2011 – Part 1 and 3D Sensing Forum at ISSCC 2011

Jordan: Compressed sensing, compressed MDS, compressed clustering, and my talk tomorrow
Bob: Paper of the Day (Po’D): Performance Limits of Matching Pursuit Algorithms Edition
Terry: An introduction to measure theory
Arxiv blog: The Nuclear Camera Designed to Spot Hidden Radiation Sources 




John: Accelerated learning

ISW: Aptina Demos Wafer Level Camera Technology


Greg: Paper posted to IEEE explorer: An Ultrawideband (UWB) Switched-Antenna-Array Radar Imaging System 

Suresh: All FOCS talks are online

Arthur: Generating a quasi Poisson distribution, version 2

Franck: Polynomial Learning of Distribution Families

Tara N. Sainath wrote in the SLTC Newsletter, November 2010 on Sparse

Bob Sturm wrote about the recent Probabilistic Matching Pursuit algorithm featured here recently in :

Laurent Jacques wonders about a New class of RIP matrices ? What’s a submodular function you ask, I am glad you did, I had the same question. Here is an answer.

I mentioned Random Matrix Theory a while back, Terry Tao has some news results that he explains in Random matrices: Localization of the eigenvalues and the necessity of four moments. He makes a reference to the book An Introduction to Random Matrices by Greg Anderson, Alice Guionnet and Ofer Zeitouni. Of related interest:

Djalil Chafai:

Terry Tao

Gonzalo Vazquez Vilar

John Langford

Djalil Chafai

Sofia Dahl and Bob Sturm in

Andrew Gelman in

Alex Gittens in

Gonzalo Vazquez Vilar in

Frank Nielsen in

Andrew McGregor in

I also found the following noteworthy papers, enjoy!

  • First, I agree with Andrew, this essay by Mandelbrot is fascinating

    A maverick’s apprenticeship. The Wolf Prize for Physics. Edited by David Thouless. Singapore: World Scientific, 2004. [ PDF (154.4 KB) ]

Now on to the sites to check for any news on compressive sensing, here is the (incomplete list).





TheoreticalCS:(not yet working)




Fast algorithms for nonconvex compressive sensing by Rick Chartrand, LANL


here are blogs/papers to reflect on, enjoy!:

Back in December, I asked  What was the most interesting paper on Compressive Sensing you read in 2010 ? Here is a compilation of y’alls answers:

One of you readers recently let me know of the February Fourier Talks ( being held on February 17 and 18 at the University of Maryland Norbert Wiener Center for Harmonic Analysis and Application ( you anonymnous reader.

Recent entries I’ll probably be re-reading include:


 “…I found this idea of CS sketchy,…”


CS: Would you like that entry Supersized?  there are dozens of articles on compressive sensing


Open Source Software for iPad and iPhone



Compressed Sensing: the L1 norm finds sparse solutions


So-called Bayesian hypothesis testing is just as bad as regular hypothesis testing





Computer Science & Engineering

Machine Learning

Neuroscience & Biology

Finance and Econometrics

Seminars, Talks, and Conference Videos:

See for more links…



Computer Science & Engineering

Machine Learning

Neuroscience & Biology

Finance and Economics

Open Courseware Directories and Other Video Lecture Roundup Posts

Through Mark Davenport’s twitter stream, I learned about an extensive course entitled “An Introduction to Compressive Sensing” written by Richard Baraniuk, Mark Davenport, Marco Duarte, Chinmay Hegde  now available from wow, I’ll add it shortly to both the big picture and the Teaching Compressive Sensing page.

Ways to prove the fundamental theorem of algebra


Distinguished and Plenary Talks

The Institute of Advanced Studies’ Women and Mathematics series of lectures and seminars featured the following interesting presentations this year:


Geometric Tools for Identifying Structure in Large Social and Information Networks



Elementary Applied Topology draft textbook
Introduction to category theory
Mathematical model of walking

Statistics and machine learning

Machine learning demos
On the accuracy of statistical procedures in Excel 2007
R reference card for data mining
Wisdom of statistically manipulated crowds


Videos of talks by Friedman and Macintyre


Deviance, DIC, AIC, cross-validation, etc

The pervasive twoishness of statistics; in particular, the “sampling distribution” and the “likelihood” are two different models, and that’s a good thing



  1. GraphLab v2 @ Big Learning Workshop
  2. Basic Introduction to ggplot2
  3. Bayesian statistics made simple
  4. Courses in CS this spring
  5. A Numerical Tour of Signal Processing
  6. Reading List for Feb and March 2012 This is about the materials on concentration and geometric techniques used in compressed sensing.
  7. simulated annealing for Sudokus
  8. Djalil talks about A random walk on the unitary groupBrownian Motion and From seductive theory to concrete applications (which got Nuit Blanche thinking about writing this entryWhose heart doesn’t sink at the thought of Dirac being inferior to Theora ?)
  9. Lectures on Gaussian approximations with Malliavin calculus
  10. Useful R snippets
  11. Special Section: Minimax Shrinkage Estimation: A Tribute to Charles Stein
  12. Excellent Papers for 2011
  13. Creating a designer’s CV in LaTeX
  14. Is NGS the Answer? 
  15. Sequence Analysis Methods Not Just for Sequence Data
  16. DNA Variant Analysis of Complete Genomics’ Next-Generation Sequencing Data
  17. Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process
  18. Best Written Paper
  19. Online SVD/PCA resources
  20. Probabilistic Topic Models
  1. Social Network Analysis with R
  2. Publicly available large data sets for database research
  3. Around the blogs in 80 hours and Random Thoughts (some are about sequencing data)
  4. Change margins of a single page (latex)
  5. Bootstrap example
  6. Exciting News on Three Dimensional Manifolds
  7. Dr. Perou on Next Generation Sequencing Technology
  8. analyzing-complex-plant-genomes-with-the-newest-next-generation-dna-sequencing-techniques
  9. RNA-Seq Methods & March Twitter Roundup
  10. Introduction to Statistical Thought
  11. An R programmer looks at Julia
  12. The slides and video can help you get a flavor of the language Julia.
  13. Why and How People Use R
  14. Wang, Landau, Markov, and others…
  15. Linear mixed models in R
  16. Least Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding
  17. Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
  18. C++ at Facebook
  19. Calling C++ from R
  20. C++ Renaissance
  21. Why haven’t we cured cancer yet? (Revisited): Personalized medicine versus evolution
  22. Getting ppt figures into LaTeX
  23. Latex Allergy Cured by knitr
  24. Melbourne R Users
  25. sixty two-minute r twotorials
  1. LDA explained
  2. Counting the total number of…
  3. Significance Test for Kendall’s Tau-b
  4. dimension reduction in ABC [a review’s review]
  5. 9 essential LaTeX packages everyone should use
  6. Linguistic Notation Inside of R Plots! about knitr
  7. knitr Elegant, flexible and fast dynamic report generation with R
  8. knitr Performance Report-Attempt 1
  9. knitr Performance Report-Attempt 2
  10. Question: Why you need perl/python if you know R/Shell [NGS data analysis]
  11. SPAMS (SPArse Modeling Software) now with Python and R
  12. Large-scale Inference and empirical Bayes, they are related with multiple testing
  13. My setup about some softwares and editors
  14. Fancy HTML5 Slides with knitr and pandoc
  15. John talks about Random is as random does
  16. MCMC at ICMS (1)
  17. MCMC at ICMS (2)
  18. MCMC at ICMS (3)
  19. John Cook: Why and How People Use R
  20. An Introduction to 6 Machine Learning Models
  21. Machine Learning: Algorithms that Produce Clusters
  22. Dirichlet Process for dummies
  1. A Really Nice Talk About PDE, Numerics (and Pyramids)
  2. Analysis of Boolean Functions
  3. Next-generation genome sequencers compared
  4. why noninformative priors?
  5. Data Scientists Get Ranked
  6. 90+ Two-Minute Videos on R
  7. Turing Centennial Celebration – Day 1
  8. Turing Centennial Celebration – Day 2
  9. Turing Centennial Celebration – Day 3
  10. Online resources for handling big data and parallel computing in R
  11. Source R-Script from Dropbox
  12. Excel in Statistics and Operations Research
  13. Dynamic Content with RStudio, Markdown, and Marked.
  14. Five minute guide to LaTeX
  15. Interactive reports in R with knitr and RStudio
  16. What Programming language are they using ?
  17. Generating reports for different data sets using brew and knitr
  18. Reproducible research with markdown, knitr and pandoc
  19. Getting Started with R Markdown, knitr, and Rstudio 0.96
  20. My experiences with Rcpp
Note: the following 4-7 are from Simply Statistics.
  1. A Personal Perspective on Machine Learning
  2. The differing perspectives of statistics and machine learning
  3. Kernel Methods and Support Vector Machines de-Mystified
  4. I love this article in the WSJ about the crisis at JP Morgan. The key point it highlights is that looking only at the high-level analysis and summaries can be misleading, you have to look at the raw data to see the potential problems. As data become more complex, I think its critical we stay in touch with the raw data, regardless of discipline. At least if I miss something in the raw data I don’t lose a couple billion. Spotted by Leonid K.
  5. On the other hand, this article in the Times drives me a little bonkers. It makes it sound like there is one mathematical model that will solve the obesity epidemic. Lines like this are ridiculous: “Because to do this experimentally would take years. You could find out much more quickly if you did the math.” The obesity epidemic is due to a complex interplay of cultural, sociological, economic, and policy factors. The idea you could “figure it out” with a set of simple equations is laughable. If you check out their model this is clearly not the answer to the obesity epidemic. Just another example of why statistics is not math. If you don’t want to hopelessly oversimplify the problem, you need careful data collection, analysis, and interpretation. For a broader look at this problem, check out this article on Science vs. PR. Via Andrew J.
  6. Some cool applications of the raster package in R. This kind of thing is fun for student projects because analyzing images leads to results that are easy to interpret/visualize.
  7. Check out John C.’s really fascinating post on determining when a white-collar worker is great. Inspired by Roger’s post on knowing when someone is good at data analysis.
  8. knitR Performance Report 3 (really with knitr) and dprint
  9. Unix doesn’t follow the Unix philosophy
  10. Advice on writing research articles
  11. knitr Performance Report–Attempt 3
  12. Permutation tests in R
  13. Understanding Bayesian Statistics – By Michael-Paul Agapow
  14. knitr, Slideshows, and Dropbox
  15. Generate LaTeX tables from CSV files (Excel)
  16. The Tomato Genome
  17. Optimization
  18. Sichuan Agricultural University and LC Sciences Uncover the Epigenetics of Obesity
  19. How to Stay Current in Bioinformatics/Genomics
  20. Interactive HTML presentation with R, googleVis, knitr, pandoc and slidy
  21. The R-Podcast Episode 7: Best Practices for Workflow Management
  22. What is the point of statistics and operations research?
  23. Question: C/C++ libraries for bioinformatics?
  24. 5 Hidden Skills for Big Data Scientists
  25. Protocol – Computational Analysis of RNA-Seq


  1. An easy way to think about priors on linear regression
  2. Combining priors and downweighting in linear regression
  3. Metropolis Hastings MCMC when the proposal and target have differing support
  4. Slidify: Things are coming together fast
  5. How to Convert Sweave LaTeX to knitr R Markdown: Winter Olympic Medals Example
  6. Testing R Markdown with R Studio and posting it on
  7. Announcing The R markdown Package
  8. Announcing RPubs: A New Web Publishing Service for R
  9. Approximate Bayesian computation
  10. Load Packages Automatically in RStudio
  11. Practical advice for machine learning: bias, variance and what to do next
  12. The overview article on “Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis” associated with the invited talk at the upcoming PODS 2012 meeting is on the arXiv here.
  13. The monograph on “Randomized Algorithms for Matrices and Data” is available in NOW’s “Foundations and Trends in Machine Learning” series here, and it is also available on the arXiv here.
  14. Click here for information (including the slides and video!) on the Tutorial on “Geometric Tools for Identifying Structure in Large Social and Information Networks,” given originally at ICML10 and KDD10 and subsequently at many other places. (The slides are also linked to below.)
  15. The overview chapter on “Algorithmic and Statistical Perspectives on Large-Scale Data Analysis” is finally on the arXiv here; the book in which it will appear is in press; and a video of the associated talk is here.
  16. Recent teaching: Fall 2009: CS369M: Algorithms for Massive Data Set Analysis
  17. Confidence distributions
  18. Making a singular matrix non-singular
  19. Statistics Versus Machine Learning
  20. How to post R code on WordPress blogs
  21. Pro Tips for Grad Students in Statistics/Biostatistics (Part 1)
  22. Pro Tips for Grad Students in Statistics/Biostatistics (Part 2)
  23. Why You Shouldn’t Conclude “No Effect” from Statistically Insignificant Slopes
  24. For those interested in knitr with Rmarkdown to beamer slides
  25. Notes from A Recent Spatial R Class I Gave
  26. Sparse Bayesian Methods for Low-Rank Matrix Estimation and Bayesian Group-Sparse Modeling and Variational Inference – implementation
  27. The Battle of the Bayes
  28. Ockham Workshop, Day 1
  29. Ockham Workshop, Day 2
  30. Ockham Workshop, Day 3
  31. Ockham’s Razor
  32. Occam
  1. Simplicity is hard to sell
  2. Self-Repairing Bayesian Inference
  3. Praxis and Ideology in Bayesian Data Analysis
  4. In-consistent Bayesian inference
  5. Big Data Generalized Linear Models with Revolution R Enterprise
  6. Quants, Models, and the Blame Game
  7. Fun with the googleVis Package for R
  8. Topological Data Analysis
  9. The Winners of the LaTeX and Graphics Contest 
  10. Is Machine Learning Losing Impact?
  11. Machine Learning Doesn’t Matter?
  12. Components of Statistical Thinking and Implications for Instruction and Assessment
  13. Xiao-Li Meng and Xianchao Xie rethink asymptotics
  14. Higgs boson and five sigma
  15. What is the Statistics Department 25 Years From Now?
  16. Statistics: Your chance for happiness (or misery)
  17. Manifolds: motivation and definition
  18. Why Emacs is important to me? : ESS and org-mode
  19. Interesting Emacs linkfest
  20. Devs Love Bacon: Everything you need to know about Machine Learning in 30 minutes or less
  21. Visualizing Galois Fields
  22. Visualizing Galois Fields (Follow-up)
  23. Statistical Reasoning on iTunes U
  24. Computing log gamma differences
  25. Where to start if you’re going to revise statistics
  26. Power laws and the generalized CLT
  27. Open problems in next-gen sequence analysis
  28. More equations, less citations?
  29. Talk: Some Introductory Remarks on Bayesian Inference


  1. Getting Started with the WordPress Competition
  2. Simple Made Easy
  3. An Education TsunamiWill on-line courses destroy universities?
  4. Universities Reshaping Education on the Web
  5. Explanation or Prediction? An Amazing Quote from Phil Schrodt
  6. Should you apply PCA to your data?
  7. Which classifiers are fast enough for exploring medium-sized data?
  8. Quick classifiers for exploring medium-sized data (redux)
  9. Is C++ worth it?
  10. Unbiased estimators can be terrible
  11. Things You Should Never Do, Part I
  12. The Joel Test: 12 Steps to Better Code
  13. Methodologists’ Audience
  14. Bayesian Methodology in the Genetic Age
  15. Interview with Michael Hammel, author of The Artist’s Guide to GIMP
  16. Being Happy in Grad School
  17. 10 Fresh Tips for Finding Time to Blog
  18. A Quick Guide to Using Tumblr for Business
  19. Statistics Done Wrong
  20. Top N Reasons To Do A Ph.D. or Post-Doc in Bioinformatics/Computational Biology
  21. Interview(s) with Vladimir Voevodsky with an introduction on motivic homotopy along with the video and transcript.
  22. Are there examples of non-orientable manifolds in nature?
  23. Kolmogorov Complexity – A Primer
  24. Adventures at My First JSM (Joint Statistical Meetings) #JSM2012
  25. Yes, I was hacked. Hard.
  26. Does Julia have any hope of sticking in the statistical community?
  27. How Genome Sequencing is Revolutionizing Clinical Diagnostics, from the ISMB Conference
  28. Advice for an Undergraduate
  29. 4 things you should know about choosing examiners for your thesis
  30. The long tail of free online education : The author also plans to teach a class on graph partitioning, expander graphs, and random walks online in Winter 2013.
  31. Teaching the World to Search
  32. Beyond Pinterest and Instagram – ten visual social networks that should be on your radar
  33. Making Ubuntu 12.04 useable
  34. Basic Understanding of Compressed Sensing


  1. Towards Better PDF Management with the Filesystem
  2. What is life like for PhDs in computer science who go into industry?
  3. Online REPL for 17 programming languages
  4. Logistic regression vs. multiple regression—–Many statisticians seem to advise the use of logistic regression over multiple regression by invoking this logic: “A probability value can’t exceed 1 nor can it be less than 0. Since multiple regression often yields values less than 0 and greater than 1, use logistic regression.” While we can understand this argument, our feeling is that, in the applied fields we toil in, that argument is not a very practical one. In fact a seasoned statistics professor we know says (in effect): “What’s the big deal? If multiple regression yields any predicted values less than 0, consider them 0. If multiple regression yields any values greater than 1, consider them 1. End of story.” We agree.
  5. Scientific Python
  6. An everyday essential: the timer+My personal productivity rules
  7. Bill Thurston—by Terrace Tao; Bill Thurston, 1946-2012—by Peter Woit; Bill Thurston 1946-2012—by David Speyer.
  8.  Surviving a PhD: 10 top tips that shows how to survive your PhD
  9. How different PhD’s work:Differences and similarities between departments about PhD process
  10. Countdown Begins: Countdown starts for submission of the thesis
  11. PhD Life is Wonderful:Doing PhD at Warwick University is a wonderful experience
  12. Too Many Emails In Your Inbox: Use Outlook folders to manage your emails
  13. Introduction to REX Facility: Videos for introducing Wolfson Research Exchange and its facilities
  14. Power of Supervisors: Control,inner happiness and optimisim
  15.  Unorthodox Tools of a Researcher: Reflection and examples of unorthodox tools that helps you PhD period
  16. Homesickness and Culture Clashes: Homesickness of international students and cultural differences
  17. Choosing Your PhD Examiners: Tips for choosing the relevant examiners for PhD Viva
  18. Effective Research Tools: Examples of useful research tools
  19. PhD,Risks and Murphy’s Law: “Anything that can go wrong will go wrong” according to Murphy’s Law
  20. Will Data Scientists Be Replaced by Tools?
  21. Update: TeX Writer for iPad (+ LaTeX + AMS)
  22. Why physicists like models, and why biologists should
  23. The ENCODE project: lessons for scientific publication
  24. Perspectives From A Postdoc: What is a Postdoc?
  25. Chris Blattman gives advice on PhD students’ NSF applications
  26. ENCODE floods the news networks…
  27. Maybe mostly useful for me, but for other people with Tumblr blogs, here is a way to insert Latex.—From Simply Statistics
  28. Harvard Business school is getting in on the fun, calling the data scientist the sexy profession for the 21st century. Although I am a little worried that by the time it gets into a Harvard Business document, the hype may be outstripping the real promise of the discipline. Still, good news for statisticians! (via Rafa via Francesca D.’s Facebook feed).—From Simply Statistics
  29. The counterpoint is this article which suggests that data scientists might be able to be replaced by tools/software. I think this is also a bit too much hype for my tastes. Certain things will definitely be automated and we may even end up with a deterministic statistical machine or two. But there will continually be new problems to solve which require the expertise of people with data analysis skills and good intuition (link via Samara K.)—From Simply Statistics


  1. Grad Student’s Guide to Good Coffee+Grad Student’s Guide to Good Tea
  2. Favorite Apps for Work and Life
  3. estimating a constant (not really)
  4. Reinforcement Learning in R: An Introduction to Dynamic Programming
  5. The Future of Machine Learning (and the End of the World?)
  6. 10 Papers Every Programmer Should Read (At Least Twice)
  7. R in the Press
  8. On Chomsky and the Two Cultures of Statistical Learning
  9. Speech Recognition Breakthrough for the Spoken, Translated Word
  10. Frequentist vs Bayesian
  11. w4s – the awesomeness we’re experiencing
  12. Why is the Gaussian so pervasive in mathematics?
  13. C++ Blogs that you Regularly Follow
  14. An interview with Brad Efron about scientific writing. I haven’t watched the whole interview, but I do know that Efron is one of my favorite writers among statisticians.
  15. Slidify, another approach for making HTML5 slides directly from R.  (1) It is still just a little too hard to change the theme/feel of the slides (2) The placement/insertion of images is still a little clunky, Google Docs has figured this out, if they integrated the best features of Slidify, Latex, etc. into that system, it will be great.
  16. Statistics is still the new hotness. Here is a Business Insider list about 5 statistics problems that will“change the way you think about the world”.
  17. New Yorker, especially the line,”statisticians are the new sexy vampires, only even more pasty” (via Brooke A.)
  18. The closed graph theorem in various categories
  19. Got spare time? Watch some videos about statistics
  20. About the first Borel-Cantelli lemma
  21. Yihui Xie—-The Setup
  22. Best Practices for Scientific Computing


  1. Machine Learning, Big Data, Deep Learning, Data Mining, Statistics, Decision & Risk Analysis, Probability, Fuzzy Logic FAQ
  2. A Funny Thing Happened on the Way to Academia . . .
  3. Advice for students on the academic job market (2013 edition)
  4. Perspective: “Why C++ Is Not ‘Back’”
  5. Is Fourier analysis a special case of representation theory or an analogue?
  6. The Beauty of Bioconductor
  7. The State of Statistics in Julia
  8. Open Source Misfeasance
  9. Book review: The Signal and The Noise
  10. Should the Cox Proportional Hazards model get the Nobel Prize in Medicine?
  11. The most influential data scientists on Twitter
  12. Here is an interesting review of Nate Silver’s book. The interesting thing about the review is that it doesn’t criticize the statistical content, but criticizes the belief that people only use data analysis for good. This is an interesting theme we’ve seen before. Gelman also reviews the review.—–Simply Statistics
  13. Video : “Matrices and their singular values” (1976)
  14. Beyond Computation: The P vs NP Problem – Michael Sipser—-This talk is arguably the very best introduction to computational complexity .
  15. What are some of your personal guidelines for writing good, clear code?
  16. How do you explain Machine learning and Data Mining to non CS people?
  17. Suggested New Year’s resolution: start a blog:  A blog forces you to articulate your thoughts rather than having vague feelings about issues; You also get much more comfortable with writing, because you’re doing it rather than thinking about doing it; If other people read your blog you get to hear what they think too. You learn a lot that way. || Set aside time for your blog every day. Keep notes for yourself on bloggy subjects (write a one-line gmail to yourself with the subject “blog ideas”).
  18. The most influential data scientists on Twitter
  19. Tips on job market interviews
  20. The age of the essay


  1. Interview with Nick Chamandy, statistician at Google
  2. You and Your Researchvideo
  3. Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
  4. A Survival Guide to Starting and Finishing a PhD
  5. Six Rules For Wearing Suits For Beginners
  6. Why I Created C++
  7. More advice to scientists on blogging
  8. Software engineering practices for graduate students
  9. Statistics Matter
  10. What statistics should do about big data: problem forward not solution backward
  11. How signals, geometry, and topology are influencing data science
  12. The Bounded Gaps Between Primes Theorem has been proved
  13. A non-comprehensive list of awesome things other people did this year.
  14. Jake VanderPlas writes about the Big Data Brain Drain from academia.
  15. Tomorrow’s Professor Postings
  16. Best Practices for Scientific Computing
  17. Some tips for new research-oriented grad students
  18. 3 Reasons Every Grad Student Should Learn WordPress
  19. How to Lie With Statistics (in the Age of Big Data)
  20. The Geometric View on Sparse Recovery
  21. The Mathematical Shape of Things to Come
  22. A Guide to Python Frameworks for Hadoop
  23. Statistics, geometry and computer science.
  24. How to Collaborate On GitHub
  25. Step by step to build my first R Hadoop System
  26. Open Sourcing a Python Project the Right Way
  27. Data Science MD July Recap: Python and R Meetup
  28. git 最近感悟
  29. 10 Reasons Python Rocks for Research (And a Few Reasons it Doesn’t)
  30. Effective Presentations – Part 2 – Preparing Conference Presentations
  31. Doing Statistical Research
  32. How to Do Statistical Research
  33. Learning new skills
  34. How to Stand Out When Applying for An Academic Job
  35. Maturing from student to researcher
  36. False discovery rate regression (cc NSA’s PRISM)
  37. Job Hunting Advice, Pt. 3: Networking
  38. Getting Started with Git


  1. Some R Resources for GLMs
  2. 失联搜救中的统计数据分析
  3. The gap between data mining and predictive models
  4. Data Mining, machine learning and statistics.
  5. useR! 2014 is underway with 16 tutorials
  6. What is Scalable Machine Learning?
  7. rlist:基于list在R中处理非关系型数据
  8. The perfect candidate
  9. The Leek group guide to giving talks
  10. 38 Seminal Articles Every Data Scientist Should Read
  11. Deep Learning – important resources for learning and understanding
  12. Twenty rules for good graphics     +   Ten Simple Rules for Better Figures
  13. Git Cookbook
  14. Making Your Code Citable
  15. biblatex for statisticians
  16. Do your “data janitor work” like a boss with dplyr


  1. Tutorial: How to detect spurious correlations, and how to find the …
  2. Practical illustration of Map-Reduce (Hadoop-style), on real data
  3. Jackknife logistic and linear regression for clustering and predict…
  4. From the trenches: 360-degrees data science
  5. A synthetic variance designed for Hadoop and big data
  6. Fast Combinatorial Feature Selection with New Definition of Predict…
  7. A little known component that should be part of most data science a…
  8. 11 Features any database, SQL or NoSQL, should have
  9. Clustering idea for very large datasets
  10. Hidden decision trees revisited
  11. Correlation and R-Squared for Big Data
  12. Marrying computer science, statistics and domain expertize
  13. New pattern to predict stock prices, multiplies return by factor 5
  14. What Map Reduce can’t do
  15. Excel for Big Data
  16. Fast clustering algorithms for massive datasets
  17. Source code for our Big Data keyword correlation API
  18. The curse of big data
  19. How to detect a pattern? Problem and solution
  20. Interesting Data Science Application: Steganography
  21. Easily create documents from R with Rmarkdown
  22. How to publish R and ggplot2 to the web
  23. magrittr: Simplifying R code with pipes
  24. Updated dplyr Examples
  25. Video introduction to data manipulation with dplyr
  26. R and Data Science
  27. jiebaR中文分词——R的灵活,C的效率
  28. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
  29. 41 hours of courses given in Iceland this Summer at the Machine Learning Summer School.
  30. summary of parallel machine learning approaches
  31. big data and data science talks

################## From SimplyStats ##################Editor’s Note: Last year I made a list off the top of my head of awesome things other people did. I loved doing it so much that I’m doing it again for 2014. Like last year, I have surely missed awesome things people have done. If you know of some, you should make your own list or add it to the comments! The rules remain the same. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. Update: I missed pipes in R, now added!

  1. I’m copying everything about Jenny Bryan’s amazing Stat 545 class in my data analysis classes. It is one of my absolute favorite open online set of notes on data analysis.
  2. Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, Nicholas J. Horton wrote this awesome paper on integrating R markdown into the curriculum. I love the stuff that Mine and Nick are doing to push data analysis into undergrad stats curricula.
  3. Speaking of those folks, the undergrad guidelines for stats programs put out by the ASA do an impressive job of balancing the advantages of statistics and the excitement of modern data analysis.
  4. Somebody tell Hector Corrada Bravo to stop writing so many awesome papers. He is making us all look bad. His epiviz paper is great and you should go start using the Bioconductor package if you do genomics.
  5. Hilary Mason founded fast forward labs. I love the business model of translating cutting edge academic (and otherwise) knowledge to practice. I am really pulling for this model to work.
  6. As far as I can tell 2014 was the year that causal inference become the new hotness. One example of that is this awesome paper from the Google folks on trying to infer causality from related time series. The R package has some cool features too. I definitely am excited to see all the new innovation in this area.
  7. Hadley was Hadley.
  8. Rafa and Mike taught an awesome class on data analysis for genomics. They also created a book on Github that I think is one of the best introductions to the statistics of genomics that exists so far.
  9. Hilary Parker wrote this amazing introduction to writing R packages that took the twitterverse by storm. It is perfectly written for people who are just at the point of being able to create their own R package. I think it probably generated 100+ R packages just by being so easy to follow.
  10. Oh you’re not reading StatsChat yet? For real?
  11. FiveThirtyEight launched. Despite some early bumps they have done some really cool stuff. Loved the recent piece on the beer mile and I read every piece that Emily Oster writes. She does an amazing job of explaining pretty complicated statistical topics to a really broad audience.
  12. David Robinson’s broom package is one of my absolute favorite R packages that was built this year. One of the most annoying things about R is the variety of outputs different models give and this tidy version makes it really easy to do lots of neat stuff.
  13. Chung and Storey introduced the jackstraw which is both a very clever idea and the perfect name for a method that can be used to identify variables associated with principal components in a statistically rigorous way.
  14. I rarely dig excel-type replacements, but the simplicity of makes me love it. It does one thing and one thing really well.
  15. The hipsteR package for teaching old R dogs new tricks is one of the many cool things Karl Broman did this year. I read all of his tutorials and never cease to learn stuff. In related news if I was 1/10th as organized as that dude I’d actually you know, get stuff done.
  16. Whether I agree with them or not that they should be allowed to do unregulated human subjects research, statistics at tech companies, and in particular randomized experiments have never been hotter. The boldest of the bunch is OKCupid who writes blog posts with titles like, “We experiment on human beings!”
  17. In related news, I love the PlanOut project by the folks over at Facebook, so cool to see an open source approach to experimentation at web scale.
  18. No wonder Mike Jordan (no not that Mike Jordan) is such a superstar. His reddit AMA raised my respect for him from already super high levels. First, its awesome that he did it, and second it is amazing how well he articulates the relationship between CS and Stats.
  19. I’m trying to figure out a way to get Matthew Stephens to write more blog posts. He teased us with the Dynamic Statistical Comparisons post and then left us hanging. The people demand more Matthew.
  20. Di Cook also started a new blog in 2014. She was also part of this cool exploratory data analysis event for the UN. They have a monster program going over there at Iowa State, producing some amazing research and a bunch of students that are recognizable by one name (Yihui, Hadley, etc.).
  21. Love this paper on sure screening of graphical models out of Daniela Witten’s group at UW. It is so cool when a simple idea ends up being really well justified theoretically, it makes the world feel right.
  22. I’m sure this actually happened before 2014, but the Bioconductor folks are still the best open source data science project that exists in my opinion. My favorite development I started using in 2014 is the git-subversion bridge that lets me update my Bioc packages with pull requests.
  23. rOpenSci ran an awesome hackathon. The lineup of people they invited was great and I loved the commitment to a diverse group of junior R programmers. I really, really hope they run it again.
  24. Dirk Eddelbuettel and Carl Boettiger continue to make bigtime contributions to R. This time it is Rocker, with Docker containers for R. I think this could be a reproducibility/teaching gamechanger.
  25. Regina Nuzzo brought the p-value debate to the masses. She is also incredible at communicating pretty complicated statistical ideas to a broad audience and I’m looking forward to more stats pieces by her in the top journals.
  26. Barbara Engelhardt keeps rocking out great papers. But she is also one of the best AE’s I have ever had handle a paper for me at PeerJ. Super efficient, super fair, and super demanding. People don’t get enough credit for being amazing in the peer review process and she deserves it.
  27. Ben Goldacre and Hans Rosling continue to be two of the best advocates for statistics and the statistical discipline – I’m not sure either claims the title of statistician but they do a great job anyway. This piece about Professor Rosling in Science gives some idea about the impact a statistician can have on the most current problems in public health. Meanwhile, I think Dr. Goldacre does a great job of explaining how personalized medicine is an information science in this piece on statins in the BMJ.
  28. Michael Lopez’s series of posts on graduate school in statistics should be 100% required reading for anyone considering graduate school in statistics. He really nails it.
  29.  Trey Causey has an equally awesome Getting Started in Data Science post that I read about 10 times.
  30. Drop everything and go read all of Philip Guo’s posts. Especially this one about industry versus academia or this one on the practical reason to do a PhD.
  31. The top new Twitter feed of 2014 has to be @ResearchMark (incidentally I’m still mourning the disappearance of @STATSHULK).
  32. Stephanie Hicks’ blog combines recipes for delicious treats and statistics, also I thought she had a great summary of the Women in Stats (#WiS2014) conference.
  33. Emma Pierson is a Rhodes Scholar who wrote for 538, 23andMe, and a bunch of other major outlets as an undergrad. Her blog, is another must read. Here is an example of her awesome work on how different communities ignored each other on Twitter during the Ferguson protests.
  34. The Rstudio crowd continues to be on fire. I think they are a huge part of the reason that R is gaining momentum. It wouldn’t be possible to list all their contributions (or it would be an Rstudio exclusive list) but I really like Packrat and R markdown v2.
  35. Another huge reason for the movement with R has been the outreach and development efforts of the Revolution Analytics folks. The Revolutions blog has been a must read this year.
  36. Julian Wolfson and Joe Koopmeiners at University of Minnesota are straight up gamers. They live streamed their recruiting event this year. One way I judge good ideas is by how mad I am I didn’t think of it and this one had me seeing bright red.
  37. This is just an awesome paper comparing lots of machine learning algorithms on lots of data sets. Random forests wins and this is a nice update of one of my favorite papers of all time: Classifier technology and the illusion of progress.
  38. Pipes in R! This stuff is for real. The piping functionality created by Stefan Milton and Hadley is one of the few inventions over the last several years that immediately changed whole workflows for me.



  1. Deep Learning Master Class
  2. Advances in Variational Inference
  3. Numerical Optimization: Understanding L-BFGS
  4. An exact mapping between the Variational Renormalization Group and Deep Learning
  5. New ASA Guidelines for Undergraduate Statistics Programs
  6. 奇异值分解(We Recommend a Singular Value Decomposition)
  7. 如何简单形象又有趣地讲解神经网络是什么?
  8. Academic vs. Industry Careers
  9. Hadley Wickham: Impact the world by being useful
  10. Statisticians in World War II: They also served
  11. A Brief Overview of Deep Learning
  12. Advice for applying Machine Learning
  13. Deep Learning Tutorial
  14. Gibbs Sampling in Haskell
  15. How-to go parallel in R – basics + tips


  1. hierarchical models are not Bayesian models
  2. 嘿,朋友,抢红包了吗?
  3. xgboost: 速度快效果好的boosting模型
  4. Machine Learning for Programming
  5. Deep stuff about deep learning?
  6. 《怎样快糙猛的开始搞Kaggle比赛》aka 迅速入门当上挣钱多干活少整天猎头追跳槽涨一倍数据科学家的捷径. 本文写给想开始搞Kaggle比赛又害怕无从下手的小朋友们。原文发表于
  7. Randomized experimentation


  1. “Navigating Big Data Careers with a Statistics PhD.”
  2. Great article from Professor Radhika Nagpal (Harvard) on tenure-track life.
  3. Career advice for academics from Robert Sternberg (Cornell).
  4. Installing R on OS X  +  Installing R on OS X – “100% Homebrew Edition”