You are currently browsing the monthly archive for July 2012.

Today I saw a link question from reddit: How important is Java/C++ vs just using R/Matlab for big data?  I learned C++ and Matlab when I was undergraduate and I am now using R by self learning as a PhD student in Stats Department. But living in this big data time, R is really not enough for scientific computing. Hence this link question is really what I want to know. Here I want to organize the interesting materials, including posts, about the programming, especially R and C++.

First I want to mention that top projects languages in GitHub:  JavaScript 20%, Ruby 14%, Python 9%, Shell 8%, Java 8%, PHP 7%, C 7%, C++ 4%, Perl 4%, Objective-C 3% among lots of other languages including R, Julia, Matlab. But for me, I only know about C and C++ among these Top 10 languages. For learning for people like me, I give the description list as follows:

  1. JavaScript
    Javascript is an ojbect-oriented, scripting programming language that runs in your web browser. It runs on a simplified set of commands, easier to code and doesn’t require compiling. It’s an important language since it’s embedded into html that happens to to used in millions of web pages to validate forms, create cookies, detect browsers and improve page design and formatting. Big plus, it’s easy to learn and use.
  2. Ruby and Ruby on Rails
    Ruby is a dynamic, object-oriented, open-source programming language; Ruby on Rails is an open-source Web application framework written in Ruby that closely follows the MVC (Model-View-Controller) architecture. With a focus on simplicity, productivity and letting the computers do the work, in a few years, its usage has spread quickly. Ruby is very similar to Python, but with different syntax and libraries. There’s little reason to learn both, so unless you have a specific reason to choose Ruby (i.e. if this is the language your colleagues all use), I’d go with Python.

    Ruby on Rails is one of the most popular web development frameworks out there, so if you’re looking to do primarily web development you should compare Django (Python framework) and RoR first.

  3. Python
    Python is an interpreteddynamically-typed programming language. Python programs stress code readability, so even non-programmers should be able to decipher a Python program with relative ease. This also makes the language one of the easiest to learn and write code in quickly. Python is very popular and has a strong set of libraries for everything from numerical and symbolic computing to data visualization and graphical user interfaces.
  4. Java
    Java is an object-oriented programming language developed by James Gosling and colleagues at Sun Microsystems in the early 1990s. Why you should learn it: Hailed by many developers as a “beautiful” language, it is central to the non-.Net programming experience. Learning Java is critical if you are non-Microsoft.
  5. PHP
    What is PHP? PHP is an open-source, server side html scripting language well suited for web developers as it can easily be embedded into standard html pages. You can run 100% dynamic pages or hybrid pages, 50% html + 50% php.
  6. C
    C is a standardized, general-purpose programming language. It’s one of the most pervasive languages and the basis for several others (such as C++). It’s important to learn C. Once you do, making the jump to Java or C# is fairly easy, because a lot of the syntax is common. C is a low-level, statically typed, compiled language. The main benefit of C is its speed, so it’s useful for tasks that are very computationally intensive. Because it’s compiled into an executable, it’s also easier to distribute C programs than programs written in interpreted languages like Python. The trade-off of increased speed is decreased programmer efficiency. C++ is C with some additional object-oriented features built in. It can be slower than C, but the two are pretty comparable, so it’s up to you whether these additional features are worth it.
  7. Perl
    Perl is an open-source, cross-platform, server-side interpretive programming language used extensively to process text through CGI programs. Perls power in processing of piles of text has made it very popular and widely used to write Web server programs for a range of tasks.

This rank is only for the users on GitHub, which is biased for you. For me, I think C/C++, R, Julia, Matlab, Java, Python, Perl will be popular among stats sphere.

  1. Advice on learning C++ from an R background
  2. Integrating C or C++ into R, where to start?
  3. R for testing, C++ for implementation?
  4. Some thoughts on Java—compared with C++
  5. A list of RSS C++ blogs
  6. Get started with C++ AMP
  7. C++11 Concurrency Series
  8. Google’s Python Class and Google’s C++ Class from Google Code University
  9. Integrating R and C++
  10. Learn Python on Codecademy
  11. Learn How to Code Without Leaving Your Browser
  12. Minimal Advice to Undergrads on Programming
  13. Learning R Via Python (or the other way around).
  14. Bloom teaches Python for Scientific Computing at Berkeley (available as a podcast).
  15. I would focus on learning three classes of languages to really understand the nature of programming and to have a decent toolkit. Everything else is basically variants on that.Learn a low-level language so you understand what goes on at the bare metal and so you can make hardware dance
    The obvious choice here is C, but assembly language might also be good.Learn a language for architecting large systems
    If you want to build large code bases, you’re going to need one of the strongly typed languages. Personally, I think Java is the best choice here; but C++, Scala and even Ada are acceptable.Learn a language for scripting things together quickly
    There are a few choices here: shell, Python, Perl, Lua. Any of these will do, but Python is probably the foremost. These are great for gluing existing pieces together.

    Now, if you only get three, that’s it. But I’m going to suggest two more categories.

    Learn a language that forces you to think differently about programming
    These are majorly different world perspectives. Examples here would be functional programming, like Haskell, ML, etc, but also logic programming like Prolog.

    Learn a language that lets you build web-based applications quickly
    This could be web2py or Javascript — but the ability to quickly hack together a web demo is really useful today.

Lana Yarosh shared her with us 5 practices (developed through much trial and error) that helped her stay happy in grad school:

  • Pick a good conference in your field and go to it every year (including your first year, even if you have to pay for it out of pocket) — when there were times that I thought about quitting (and there were those times), a conference has always brought me the energy, the influx of new ideas, and the wonderful people that I needed to get back in gear. My two chosen conferences are CHI (Human Factors in Computing) and IDC (Interaction Design and Children).
  • Avoid “time shifting” whenever possible — time shifting is when you end up shifting something you need to do today to another day in order to do some piece of work (e.g., “I’ll sleep tomorrow,” “I’ll get in touch with my advisor some other day, today I need to focus on this paper,” etc.). In my experience, time shifting only makes me more stressed out and less productive in the long run. If you need to skip this conference deadline and try for another, then maybe that’s the thing to do.
  • Get to know the people in your program — these folks are not only great to get to know as friends, but also will likely be your colleagues in the years to come. Also, they can commiserate with anything that you’re currently facing so they’re a great source of social support.
  • Have a routine that includes all of the things that are important to you — make a list of what is important to your happiness and make sure that you get a chance to do these things. My list includes things like swimming, hanging out with friends, exploring new places, reading for fun, and yes, research. You may have to set boundaries to make sure that the important things actually make it on your schedule, but it’s totally worth it to your overall level of happiness. I once told my advisor that I would not do certain types of academic activities because it would interfere with my work/life balance. He wasn’t happy at first, but later on accepted it and even said he admired me sticking to my guns on this (but, do pick your battles).
  • Don’t be afraid to ask for help — when you’re struggling or need something, ask for it. I hate asking for help, but I basically went crazy when I tried to handle everything myself. I’ve gotten help from my advisor, my committee members, my lab mates, my roommates, my extended academic family, my biological family, people I’ve met at CoC Happy Hour, and professionals (the Counseling Center at Georgia Tech is free for students, may be the same at your school). Don’t be afraid of looking lame. Sometimes you have to decide whether you want to save your face or your ass and the choice should be clear.

At the end of the post, he mentioned that

For me, I’m more productive when I’m happy. So, when I plan to “swim, do 8 hours of work, and have dinner with friends,” I actually get a lot more done than when I plan to “work for the next 16 hours.” And, I’m immeasurably happier.

This is really what I want!!!

Something is worth learning ahead of time: from the post

A real-time skill is something you need for live performance. If you’re going to speak French, you have to memorize a large number words before you need them in conversation. Looking up every word in a English-French dictionary as needed might work in the privacy of your study, but it would be infuriatingly slow in a face-to-face conversation. Some skills that we don’t think of as being real-time become real-time when you have to use them while interacting with other people.

More subtle than real-time skills are what I’m calling bicycle skills. Suppose you own a bicycle but haven’t learned to ride it. Each day you need to go to a store half a mile away. Each day you face the decision whether to walk or learn to ride the bicycle. It takes less time to just walk to the store than to learn to ride the bicycle and ride to the store. If you always do what is fastest that day, you’ll walk every day. I’m thinking of a bicycle skill as anything that doesn’t take too long to learn, quickly repays time invested, but will never happen without deliberate effort.

 

The Hardy-Weinberg equilibrium is a principle stating that the genetic variation in a population will remain constant from one generation to the next in the absence of disturbing factors. When mating is random in a large population with no disruptive circumstances, the law predicts that both genotype and allele frequencies will remain constant because they are in equilibrium.

The Hardy-Weinberg equilibrium can be disturbed by a number of forces, including mutations, natural selection, nonrandom mating, genetic drift, and gene flow. For instance, mutations disrupt the equilibrium of allele frequencies by introducing new alleles into a population. Similarly, natural selection and nonrandom mating disrupt the Hardy-Weinberg equilibrium because they result in changes in gene frequencies. This occurs because certain alleles help or harm the reproductive success of the organisms that carry them. Another factor that can upset this equilibrium is genetic drift, which occurs when allele frequencies grow higher or lower by chance and typically takes place in small populations. Gene flow, which occurs when breeding between two populations transfers new alleles into a population, can also alter the Hardy-Weinberg equilibrium.

Because all of these disruptive forces commonly occur in nature, the Hardy-Weinberg equilibrium rarely applies in reality. Therefore, the Hardy-Weinberg equilibrium describes an idealized state, and genetic variations in nature can be measured as changes from this equilibrium state.

From Hardy-Weinberg equilibrium.

PS: In a mathematical way, we have the following:

P_{Aa}^{2}=4P_{AA}P_{aa}.

The following from Revolutions:

John Myles White, self-described “statistics hacker” and co-author of “Machine Learning for Hackers” was interviewed recently by The Setup. In the interview, he describes his some of his go-to R packages for data science:

Most of my work involves programming, so programming languages and their libraries are the bulk of the software I use. I primarily program in R, but, if the situation calls for it, I’ll use MatlabRuby or Python. …

That said, for me the specific language I use is much less important than the libraries availble for that language. In R, I do most of my graphics using ggplot2, and I clean my data using plyr, reshape, lubridate and stringr. I do most of my analysis using rjags, which interfaces with JAGS, and I’ll sometimes use glmnet for regression modeling. And, of course, I use ProjectTemplate to organize all of my statistical modeling work. To do text analysis, I’ll use the tm and lda packages.

Also in JMW’s toolbox: Julia, TextMate 2, MySQL, Dropbox and a beefy MacBook. Read the full interview linked below for an insightful look at how he uses these and other tools day to day.

The Setup / Interview: John Myles White

The basics of statistical simulation

A statistical simulation often consists of the following steps:

  1. Simulate a random sample of size N from a statistical model.
  2. Compute a statistic for the sample.
  3. Repeat 1 and 2 many times and accumulate the results.
  4. Examine the union of the statistics, which approximates the sampling distribution of the statistic and tells you how the statistic varies due to sampling variation.

The key to improving the simulation is to restructure the simulation algorithm as follows:

  1. Simulate many random samples of size N from a statistical model.
  2. Compute a statistic for each sample.
  3. Examine the union of the statistics, which approximates the sampling distribution of the statistic and tells you how the statistic varies due to sampling variation.

Please refer to the two posts @RickWicklin, 1) Eight tips to make your simulation run fasterand 2) Simulation in SAS: The slow way or the BY way for details.

 

Recently, p-value is a hot term due to the God particle. The following is from the post:

The follow list summarizes five criticisms of significance testing as it is commonly practiced.

  1. Andrew Gelman: In reality, null hypotheses are nearly always false. Is drug A identically effective as drug B? Certainly not. You know before doing an experiment that there must be some difference that would show up given enough data.
  2. Jim Berger: A small p-value means the data were unlikely under the null hypothesis. Maybe the data were just as unlikely under the alternative hypothesis. Comparisons of hypotheses should be conditional on the data.
  3. Stephen Ziliak and Deirdra McCloskey: Statistical significance is not the same as scientific significance. The most important question for science is the size of an effect, not whether the effect exists.
  4. William Gosset: Statistical error is only one component of real error, maybe a small component. When you actually conduct multiple experiments rather than speculate about hypothetical experiments, the variability of your data goes up.
  5. John Ioannidis: Small p-values do not mean small probability of being wrong. In one review, 74% of studies with p-value 0.05 were found to be wrong.

Related posts:
Statistically significant but incorrect
False positives for medical papers

Please also refer to the comments of the above post.

So-called Bayesian hypothesis testing is just as bad as regular hypothesis testing

How loud is the evidence?

The following is from Quora-Kaggle. And I think it’s very useful for data scientist to do a project.

Question:

To do well in a competition, clearly many aspects are important.
But, what do you think helped you (or a top competitor you know) do better than others?

At a high level, I’m curious if you approach a competition with one-size-fits-all approach or try to develop a particular insight.

More specifically, I’m interested in your thoughts on utility of the following:
– bringing more data to the party
– feature engineering
– data prep/normalization
– superior knowledge of subtleties of ML
– know-how of tackling predictive modeling problems
– proprietary tools
– picking a problem you had particular insight about
– picking less competitive competition(s)
– forming a good team
– seek specialist’s advice
– persistence
– luck

Answer:

Thanks for asking me to answer this question (I guess at least one person thinks I am a top kaggle competitor!).  Anyone please feel free to correct anything inaccurate or off base here.

This is a tough question to answer, because much like any competitive endeavor, any given Kaggle competition requires a unique blend of skills and several different factors.  In some competitions, luck plays a large part.  In others, an element that you had not considered at all will play a large part.

For example, I was first and/or second for most of the time that the Personality Prediction Competition [1] ran, but I ended up 18th, due to overfitting in the feature selection stage, something that I has never encountered before with the method I used.  A good post on some of the seemingly semi-random shifts that happen at the end of a competition can be found on the Kaggle blog [2].

Persistence, Persistence, and more Persistence

You have outlined some key factors to success.  Not all of them are applicable to all competitions, but finding the one that does apply is key.  In this, persistence is very important.  It is easy to become discouraged when you don’t get into the top 5 right away, but it is definitely worth it to keep trying.  In one competition, I think that I literally tried every single published method on a topic.

In my first ever Kaggle competition, the Photo Quality Prediction [3] competition, I ended up in 50th place, and had no idea what the top competitors had done differently from me.

I managed to learn from this experience, however, and did much better in the my second competition, the Algorithmic Trading Challenge [4].

What changed the result from the Photo Quality competition to the Algorithmic Trading competition was learning and persistence.  I did not really spend much time on the former competition, and it showed in the results.

Expect to make many bad submissions that do not score well.  You should absolutely be reading as much relevant literature (and blog posts, etc), as you can while the competition is running.  As long as you learn something new that you can apply to the competition later, or you learn something from your failed submission (maybe that a particular algorithm or approach is ill-suited to the data), you are on the right track.

This persistence needs to come from within, though.  In order to make yourself willing to do this, you have to ask yourself why you are engaging in a particular competition.  Do you want to learn?  Do you want to gain opportunities by placing highly?  Do you just want to prove yourself?  The monetary reward in most Kaggle competitions is not enough to motivate a significant time investment, so unless you clearly know what you want and how to motivate yourself, it can be tough to keep trying.  Does rank matter to you?  If not, you have the luxury of learning about interesting things that may or may not impact score, but you don’t if you are trying for first place.

The Rest of the Factors

Now that I have addressed what I think is in the single most important factor (persistence), I will address the rest of your question:

1.  The most important data-related factor (to me) is how you prepare the data, and what features you extract.  Algorithm selection is important, but much less so.  I haven’t really seen the use of any proprietary tools among top competitors, although a couple of first place finishers have used open-source tools that they coded/maintain.

2.  I have had poor results with external data, typically.  Unless you notice someone on the leaderboard who has a huge amount of separation from the rest of the pack (or a group that has separation), it is unlikely that anyone has found “killer” external data.  That said, you should try to use all the data you are given, and there are often innovative ways to utilize what you are given to generate larger training sets.  An example is the Benchmark Bond Competition [5], where the competition hosts released two datasets because the first one could be reverse-engineered easily.  Using both more than doubled the available train data (this did not help score, and I did not use it in the final model, but it it an illustration of the point).

3.  Initial domain-specific knowledge can be helpful (some bond pricing formulas, etc, helped me in the Benchmark Bond competition), but it is not critical, and what you need can generally be picked up by learning while you are competing.  For example, I learned NLP methods while I competed in the Hewlett Foundation ASAP Competition.  That said, you definitely need to quickly learn the relevant domain-specific elements that you don’t know, or you will not really be able to compete in most competitions.

4.  Picking a less competitive competition can definitely be useful at first.  The research competitions tend to have less competitors than the ones with large prizes.  Later on, I find it useful to compete in more competitive competitions because it forces you to learn more and step outside your comfort zone.

5.  Forming a good team is critical.  I have been lucky enough to work with great people on two different competitions (ASAP and Bond), and I learned a lot from them.  People tend to be split into those that almost always work alone and those that almost always team up, but it is useful to try to do both.  You can learn a lot from working in a team, but working on your own can make you learn things that you might otherwise rely on a teammate for.

6.  Luck plays a part as well.  In some competitions, .001% separates 3rd and 4th place, for example.  At that point, its hard to say whose approach is “better”, but only one is generally recognized as a winner.  A fact of Kaggle, I suppose.

7.  The great thing about machine learning is that you can apply similar techniques to almost any problem.  I don’t think that you need to pick problems that you have a particular insight about or particular knowledge about, because frankly, it’s more interesting to do something new and learn about it as you go along.  Even if you have a great insight on day one, others will likely think of it, but they may do so on day 20 or day 60.

8.  Don’t be afraid to get a low rank.  Sometimes you see an interesting competition, but think that you won’t be able to spend much time on it, and may not get a decent rank.  Don’t worry about this.  Nobody is going to judge you!

9.  Every winning Kaggle entry is the combination of dozens of small insights.  There is rarely one large aha moment that wins you everything.  If you do all of the above, make sure you keep learning, and keep working to iterate your solution, you will do well.

Learning is Fun?

I think that the two main elements that I stressed here are persistence and learning.  I think that these two concepts encapsulate my Kaggle experience nicely, and even if you don’t win a competition, as long as you learned something, you spent your time wisely.

References

1.  http://www.kaggle.com/c/twitter-…
2.  http://blog.kaggle.com/2012/07/0…
3.  http://www.kaggle.com/c/PhotoQua…
4.  http://www.kaggle.com/c/Algorith…
5.  http://www.kaggle.com/c/benchmar…

The following is from JSM:

Guidelines for Preparing Effective Presentations

These tips apply regardless of whether the time for the presentations is short (less than 30 minutes) or long. Complaints about poor presentations have been received for decades and continue to be received. The ASA has offered a short-course on presentation for many years, and routinely sends “tips” to speakers to promote effective presentations, but often go ignored. The tips and suggestions are here to help you. Please put them to good use. An ad hoc committee (consisting of A. Lawrence Gould, Chair; Howard Kaplan; Peter A Lachenbruch; and Katherine Monti) was formed at the April 1999 ENAR business meeting, to address this persistent and pervasive problem. Effective presentations make learning and technical advances more likely. They also enhance the perception of the presenter in the eyes of the professional community. Boring, ineffective presentations are not paid much attention and often are quickly forgotten, especially by planners of future invited sessions.

Preparation

Content organization

  • Make sure the audience walks away understanding the five things any listener to a presentation really cares about:
    a. What is the problem and why?
    b. What has been done about it?
    c. What is the presenter doing (or has done) about it?
    d. What additional value does the presenter’s approach provide?
    e. Where do we go from here?
  • Carefully budget your time, especially for short (e.g., 15 minute) presentations.
  • Allow time to describe the problem clearly enough for the audience to appreciate the value of your contribution. This usually will take more than 30 seconds.
  • Leave enough time to present your own contribution clearly. This almost never will require all of the allotted time.
  • Put your material in a context that the audience can relate to. It’s a good idea to aim your presentation to an audience of colleagues who are not familiar with your research area. Your objective is to communicate an appreciation of the importance of your work, not just to lay the results out.
  • Give references and a way to contact you so those interested in the theoretical details can follow up.

Preparing effective displays

Here are some suggestions that will make your displays more effective.

  • Keep it simple. The fact that you can include all kinds of cute decorations, artistic effects, and logos does not mean that you should. Fancy designs or color shifts can make the important material hard to read. Less is more.
  • Use at least a 24-point font so everyone in the room can read your material. Unreadable material is worse than useless – it inspires a negative attitude by the audience to your work and, ultimately, to you. NEVER use a photocopy of a standard printed page as a display – it is difficult to overstate how annoying this is to an audience.
  • Try to limit the material to eight lines per slide, and keep the number of words to a minimum. Summarize the main points – don’t include every detail of what you plan to say. Keep it simple.
  • Limit the tables to four rows/columns for readability. Sacrifice content for legibility – unreadable content is worse than useless. Many large tables can be displayed more effectively as a graph than as a table.
  • Don’t put a lot of curves on a graphical display – busy graphical displays are hard to read. Also, label your graphs clearly with BIG, READABLE TYPE.
  • Use easily read fonts. Simple fonts like Sans Serif and Arial are easier to read than fancier ones like Times Roman or Monotype Corsiva. Don’t use italic fonts.
  • Light letters (yellow or white) on a dark background (e.g., dark blue) often will be easier to read when the material is displayed on LCD projectors.
  • Use equations sparingly if at all – audience members not working in the research area can find them difficult to follow as part of a rapidly delivered presentation. Avoid derivations and concentrate on presenting what your results mean. The audience will concede the proof and those who really are interested can follow up with you, which they’re more likely to do if they understand your results.
  • Don’t fill up the slide – the peripheral material may not make it onto the display screen – especially the material on the bottom of a portrait-oriented transparency.
  • Identify the journal when you give references: Smith, Bcs96 clues the reader that the article is in a 1996 issue of Biometrics, and is much more useful than just Smith 1996.
  • Finally, and this is critical, always, always, always preview your presentation. You will look foolish if symbols and Greek letters that looked OK in a WORD document didn’t translate into anything readable in POWERPOINT – and it happens!

Timing your talk

Don’t deliver a 30-minute talk in 15 minutes. Nothing irritates an audience more than a rushed presentation. Your objective is to engage the audience and have them understand your message. Don’t flood them with more than they can absorb. Think in terms of what it would take if you were giving (or, better, listening to) the last paper in the last contributed paper session of the last day. This means:

  • Present only as much material as can reasonably fit into the time period allotted. Generally that means 1 slide per minute, or less.
  • Talk at a pace that everybody in the audience can understand. Speak slowly, clearly, and loudly, especially if your English is heavily accented
  • PRACTICE, PRACTICE, PRACTICE. Ask a colleague to judge your presentation, delivery, clarity of language, and use of time.
  • Balance the amount of material you present with a reasonable pace of presentation. If you feel rushed when you practice, then you have too much material. Budget your time to take a minute or two less than your maximum allotment. Again, less is more.

Loose ends

  • PRACTICE, PRACTICE, PRACTICE the presentation, with care to content, delivery and use of time. (In case you missed this recommendation above)

The Presentation

  • Put on the microphone and be sure that it works before you begin.
  • Be sure everyone in the room can see your material. Make sure you do not block the screen. Move around if you must so that everyone has a chance to see everything.
  • Never apologize for your displays. More to the point, make apologies unnecessary by doing the material properly in the first place (see the recommendations above). Do not say, “I know you can’t see this, but…” The reaction of many people in the audience will be “why bother showing it, then?” (Or, even worse, “Why didn’t you take the trouble to make them legible?”)
  • Don’t apologize for incomplete results. Researchers understand that all research continues. Just present the results and let the audience judge. It is okay to say, “work is on-going”. Do not say, “I’m sorry that work is not done.” This invites the audience to tune out or wonder why you are talking at all.

When Finished

  • Thank the audience for their attention
  • Gather you materials and move off quickly to allow the next presenter to prepare
  • Stay for the entire session

The following is from the post:

  1. When giving an invited talk at a general TCS conference, do not assume that everyone in the audience is interested in the technicalities of your subject. Focus on the main message, tell the story of the ideas and why you think they are important. Give everyone something to take home.
  2. Do not assume that you do not need to introduce the setting for your work because someone else has done it before or on an earlier conference day. Not everyone will have attended the talks where the background and motivation were presented.
  3. Do not run over time.
  4. Never speak with your hands on your mouth, even if it feels good 🙂
  5. Do not let your voice drop to an inaudible level as your sentence progresses. Dare to speak slowly and loudly.
  6. Ask yourself: How many slides do I really need for a 20-minute talk? Most of us will only use a few, and those should convey the message of the talk at a suitable level of abstraction.

The advice we give others is the advice we ourselves need.

eQTL tries to regress each gene expression against each SNP, in order to find those regulatory elements. And eQTL uses “normal” samples, right? (by normal I mean “no disease” like those in 1000genome project)

GWAS compares SNPs between normal(control) and disease(test) samples, trying to find out those higher-frequency variants enriched for diseases.

linkage mapping/recombination mapping/positional cloning – rely on known markers (typically SNPs) that are close to the gene responsible for a disease or trait to segregate with that marker within a family. Works great for high-penetrance, single gene traits and diseases.

QTL mapping/interval mapping – for quantitative traits like height that are polygenic. Same as linkage mapping except the phenotype is continuous and the markers are put into a scoring scheme to measure their contribution – i.e. “marker effects” or “allelic contribution”. Big in agriculture.

GWAS/linkage disequilibrium mapping – score thousands of SNPs at once from a population of unrelated individuals. Measure association with a disease or trait with the presumption that some markers are in LD with, or actually are, causative SNPs.

So linkage mapping and QTL mapping are similar in that they rely on Mendelian inheritance to isolate loci. QTL mapping and GWAS are similar in that they typically measure association in terms of log-odds along a genetic or physical map and do not assume one gene or locus is responsible. And finally, linkage mapping and GWAS are both concerned with categorical traits and diseases.

Linkage studies are performed when you have pedigrees of related individals and a phenotype (such as breast cancer) that is present in some but not all of the family members. These individuals could be humans or animals; linkage in humans is studied using existing families, so no breeding is involved. For each locus, you tabulate cases where parents and children who do or don’t show the phenotype also have the same allele. Linkage studies are the most powerful approach when studying highly penetrant phenotypes, which means that if you have the allele you have a strong probability of exhibiting the phenotype. They can identiy rare alleles that are present in small numbers of families, usualy due to a founder mutation. Linkage is how you find an allele such as the mutations in BRCA1 associated with breast cancer.

Association studies are used when you don’t have pedigrees; here the statistical test is a logistic regression or a related test for trends. They work when the phenotype has much lower penetrance; they are in fact more powerful than linkage analysis in those cases, provided you have enough informative cases and matched controls. Association studies are how you find common, low penetrance alleles such as the variations in FGFR2 that confer small increases in breast cancer susceptibility.

In The Old Days, neither association tests nor linkage tests were “genome-wide”; there wasn’t a technically feasable or affordable way to test the whole genome at once. Studies were often performed at various levels of resolution as the locus associated with the phenotype was refined. Studies were often performed with a small number of loci chosen because of prior knowledge or hunches. Now the most common way to perform these studies in humans is to use SNP chips that measure hundreds of thousands of loci spread across the whole genome, thus the name GWAS. The reason you’re testing “the whole genome” without sequencing the whole genome of each case and control is an important point that is a separate topic; if you don’t yet know how this works, start with the concept of Linkage Disequilibrium. I haven’t encountered the term GWLS myself, but I think it’s safe to say that this is just a way to indicate that the whole genome was queried for linkage to a phenotype.

Genomic Convergence of Genome-wide Investigations for Complex Traits

###############################################################################

The following comes from Khader Shameer:

The following articles were really useful for me to understand the concepts around GWAS.

I would recommend the following reviews to understand the concept and methods. Most of these reviews refers the major studies and specific details can be obtained from individual papers. But you can get an overall idea about the concept, statistical methods and expected results from a GWAS studies from these review articles.

How to Interpret a Genome-wide Association Study

An easy to ready review article that start with basic concepts and discuss future prospects of GWAS Genome-wide association studies and beyond.

A detailed introduction to basic concepts of GWAS from the perspective of vascular disease : Genome-wide Association Studies for Atherosclerotic Vascular Disease and Its Risk Factors

Great overview of the current state of GWAS studies: Genomewide Association Studies and Assessment of the Risk of Disease

Detailed overview of statistical methods : Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application

For a bioinformatics perspective Jason Moore et.al’s review will be a good start : Bioinformatics Challenges for Genome-Wide Association Studies

Soumya Raychaudhuri’s review provides overview of various approaches for interpretations of variants from GWAS Mapping rare and common causal alleles for complex human diseases.

A tutorial on statistical methods for population association studies

Online Resources: I would recommend to start from GWAS page at Wikipedia followed by NIH FAQ on GWAS, NHGRI Catalog of GWAS, dbGAP, GWAS integrator and related question at BioStar.

##############################################################################

For introductory material, the new blog Genomes Unzipped has a couple of great posts: (From Neilfws)

  1. Simplicity is hard to sell
  2. Self-Repairing Bayesian Inference
  3. Praxis and Ideology in Bayesian Data Analysis
  4. In-consistent Bayesian inference
  5. Big Data Generalized Linear Models with Revolution R Enterprise
  6. Quants, Models, and the Blame Game
  7. Fun with the googleVis Package for R
  8. Topological Data Analysis
  9. The Winners of the LaTeX and Graphics Contest 
  10. Is Machine Learning Losing Impact?
  11. Machine Learning Doesn’t Matter?
  12. Components of Statistical Thinking and Implications for Instruction and Assessment
  13. Xiao-Li Meng and Xianchao Xie rethink asymptotics
  14. Higgs boson and five sigma
  15. What is the Statistics Department 25 Years From Now?
  16. Statistics: Your chance for happiness (or misery)
  17. Manifolds: motivation and definition
  18. Why Emacs is important to me? : ESS and org-mode
  19. Interesting Emacs linkfest
  20. Devs Love Bacon: Everything you need to know about Machine Learning in 30 minutes or less
  21. Visualizing Galois Fields
  22. Visualizing Galois Fields (Follow-up)
  23. Statistical Reasoning on iTunes U
  24. Computing log gamma differences
  25. Where to start if you’re going to revise statistics
  26. Power laws and the generalized CLT
  27. Open problems in next-gen sequence analysis
  28. More equations, less citations?
  29. Talk: Some Introductory Remarks on Bayesian Inference

Blog Stats

  • 185,523 hits

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 518 other subscribers