Monthly Archive

You are currently browsing the monthly archive for February 2012.

This Friday, there is a talk on Personalized Medicine and Artificial Intelligence given by Michael Kosorok from Department of Biostatistics at University of North Carolina at Chapel Hill. The following materials could be helpful to get some idea of this area:

Active Learning for Developing Personalized Treatment a new paper from arxiv.
From Statistical Genetics to Predictive Models in Personalized Medicine a workshop from NIPS 2011.
Machine Learning for Personalized Medicine: Will This Drug Give Me a Heart Attack?
A paper written by Michael Kosorok, Penalized Q-Learning for Dynamic Treatment Regimes

The World As Hologram

February 19, 2012 in Academic, Life | Leave a comment

Today I just watched an interesting video about the indestructability of information and the nature of black holes, a talk given by Leonard Susskind of the Stanford Institute for Theoretical Physics.

Video: Leonard Susskind on The World As Hologram

And I also looked up some materials online:

World’s Most Precise Clocks Could Reveal Universe Is a Hologram

The World as a Hologram (A paper written by L. Susskind)

UC Berkeley’s Raphael Bousso presents a friendly introduction to the ideas behind the holographic principle, which may be very important in the hunt for a theory of quantum gravity.

Q-A Section 7-Bartlett corrections

February 17, 2012 in Academic, Q-A Section, Statistics | Leave a comment

Q: What are Bartlett corrections?

A: Strictly speaking, a Bartlett correction is a scalar transformation applied to the likelihood ratio (LR) statistic that yields a new, improved test statistic which has a chi-squared null distribution to order O(1/n). This represents a clear improvement over the original statistic in the sense that LR is distributed as chi-squared under the null hypothesis only to order O(1).

Q: Are there extensions of Bartlett corrections?

A: Yes. Some of them arose in response to Sir David Cox’s 1988 paper, “Some aspects of conditional and asymptotic inference: a review” (Sankhya A). A particularly useful one was proposed by Gauss Cordeiro and Silvia Ferrari in a 1991 Biometrika paper. They have shown how to Bartlett-correct test statistics whose null asymptotic distribution is chi-squared with special emphasis on Rao’s score statistic.

Q: Where can I find a survey paper on Bartlett corrections?

A: There are a few around. Two particularly useful ones are:

Cribari-Neto, F. and Cordeiro, G.M. (1996) On Bartlett and Bartlett-type corrections. Econometric Reviews, 15, 339-367.
Jensen, J.L. (1993) A historical sketch and some new results on the improved likelihood statistic. Scandinavian Journal of Statistics, 20, 1-15.

Q: What are the alternatives to Bartlett corrections?

A: There are several alternatives. A closely related one are Edgeworth expansions, named after the economist/statistician Francis Ysidro Edgeworth. There are also saddlepoint expansions. A computer-intensive alternative is known as the bootstrap and was proposed by Bradley Efron in his 1979 Annals of Statistics paper.

Please refer to Bartlett Corrections Page.

Resampling

February 17, 2012 in Academic, Statistics | Leave a comment

From wiki, we have the following:

In statistics, resampling is any of a variety of methods for doing one of the following:

Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)

Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)

Validating models by using random subsets (bootstrapping, cross validation)

Common resampling techniques include bootstrapping, jackknifing and permutation tests.

Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a mean, median,proportion, odds ratio, correlation coefficient or regression coefficient.

Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias and standard error (variance) of a statistic, when a random sample of observations is used to calculate it. The basic idea behind the jackknife variance estimator lies in systematically recomputing the statistic estimate leaving out one or more observations at a time from the sample set. From this new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

Cross-validation is a statistical method for validating a predictive model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (a training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points.

Sequencing Data

February 17, 2012 in Academic, Biostatistics, Statistics | Leave a comment

These days I am struggling with the sequencing data. What do they look like? What’s the essential difference with the micro-array data for the statistician community? Discrete and Continuous? So far, I am still not clear of these. Could anyone help me with this? Thanks.

The following are the collection of the related and maybe something else:

Talk on RNA-seq Analysis, presented by Wing H. Wong, at the Joint Statistical Meetings on August 3, 2010 in Vancouver, Canada.
Julia Salzman, Hui Jiang and Wing Hung Wong (2011), Statistical Modeling of RNA-Seq Data. Statistical Science 2011, Vol. 26, No. 1, 62-83. doi: 10.1214/10-STS343.
Hui Jiang and Wing Hung Wong (2009), Statistical Inferences for Isoform Expression in RNA-Seq.
Wenxiu Ma and Wing Hung Wong (2011), The Analysis of ChIP-Seq Data.
Genotype and SNP calling from next-generation sequencing data
Saran Vardhanabhuti, Mingyao Li and Hongzhe Li (2011), A Hierarchical Bayesian Model for Estimating and Inferring Differential Isoform Expression for Multi-sample RNA-Seq Data
BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data, 2011, Yuan Ji1,*, Yanxun Xu2, Qiong Zhang3, Kam-Wah Tsui3, Yuan Yuan4, Clift Norris Jr.1, Shoudan Liang4, Han Liang4,*
A new paper written by Lin Wan, Xiting Yan, Ting Chen, and Fengzhu Sun, Biostat 2012 published 21 February 2012, Modeling RNA degradation for RNA-Seq with applications
Video 1 and video 2 for Analysis and design of RNA sequencing experiments for identifying isoform regulation

Useful for referring—2-13-2012

February 13, 2012 in Useful for referring | Leave a comment

R and presentations: a basic example of knitr and beamer: Combine the knitr package with the Latex package beamer for presentation slides, instead of the the Sweave package because it basically is a better Sweave.
Constructing Summary Statistics for Approximate Bayesian Computation: Semi-automatic ABC : a good paper worthy of learning and discussing. Many modern statistical applications involve inference for complex stochastic models, where it is easy to simulate from the models, but impossible to calculate likelihoods. Approximate Bayesian computation (ABC) is a method of inference for such models. It replaces calculation of the likelihood by a step which involves simulating artificial data for different parameter values, and comparing summary statistics of the simulated data to summary statistics of the observed data.
Elegant & fast data manipulation with data.table : Extension of data.frame for fast indexing, fast ordered joins, fast assignment, fast grouping and list columns.
Ordinal Measures of Association : These statistics I have met for twice so far. The recent one is in this paper written by Han Liu etc.
Table design : Almost every research paper and thesis in statistics contains at least some tables, yet students are rarely taught how to make good tables. While the principles of good graphics are slowly becoming part of a statistical education (although not an econometrics education!), the principles of good tables are often ignored.
Dirichlet distribution

Group Testing Designs, Algorithms, and Applications to Biology

February 8, 2012 in Academic, Biostatistics | Leave a comment

IMA meeting on Group Testing Designs, Algorithms, and Applications to Biology

Amihood Amir – Bar-Ilan University
http://u.cs.biu.ac.il/~amir/

Length Reduction via Polynomials
February 16, 2012 11:30 am – 12:30 pm

Keywords of the presentation: Sparse covlutions, polynomials, finite fields, length reduction.

Efficient handling of sparse data is a key challenge in Computer Science. Binary convolutions, such as the Fast Fourier Transform or theWalsh Transform are a useful tool in many applications and are efficiently solved. In the last decade, several problems required efficient solution of sparse binary convolutions.

Both randomized and deterministic algorithms were developed for efficiently computing the sparse FFT. The key operation in all these algorithms was length reduction. The sparse data is mapped into small vectors that preserve the convolution result. The reduction method used to-date was the modulo function since it preserves location (of the ”1” bits) up to cyclic shift.

In this paper we present a new method for length reduction – polynomials. We show that this method allows a faster deterministic computation of the sparse FFT than currently known in the literature. It also enables the development of an efficient algorithm for computing the binary sparse Walsh Transform. To our knowledge, this is the first such algorithm in the literature.

(Joint work with Oren Kappah, Ely Porat, and Amir Rothschild)

Sharon Aviran – University of California, Berkeley
http://bio.math.berkeley.edu/aviran/

RNA Structure Characterization from High-Throughput Chemical Mapping Experiments
February 13, 2012 3:45 pm – 4:45 pm

Keywords of the presentation: RNA structure characterization, high-throughput sequencing, maximum likelihood estimation

New regulatory roles continue to emerge for both natural and engineered noncoding RNAs, many of which have specific secondary and tertiary structures essential to their function. This highlights a growing need to develop technologies that enable rapid and accurate characterization of structural features within complex RNA populations. Yet, available structure characterization techniques that are reliable are also vastly limited by technological constraints, while the accuracy of popular computational methods is generally poor. These limitations thus pose a major barrier to the comprehensive determination of structure from sequence and thereby to the development of mechanistic understanding of transcriptome dynamics. To address this need, we have recently developed a high-throughput structure characterization technique, called SHAPE-Seq, which simultaneously measures quantitative, single nucleotide-resolution, secondary and tertiary structural information for hundreds of RNA molecules of arbitrary sequence. SHAPE-Seq combines selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE) chemical mapping with multiplexed paired-end deep sequencing of primer extension products. This generates millions of sequencing reads, which are then analyzed using a fully automated data analysis pipeline. Previous bioinformatics methods, in contrast, are laborious, heuristic, and expert-based, and thus prohibit high-throughput chemical mapping.

In this talk, I will review recent developments in experimental RNA structure characterization as well as advances in sequencing technologies. I will then describe the SHAPE-Seq technique, focusing on its automated data analysis method, which relies on a novel probabilistic model of a SHAPE-Seq experiment, adjoined by a rigorous maximum likelihood estimation framework. I will demonstrate the accuracy and simplicity of our approach as well as its applicability to a general class of chemical mapping techniques and to more traditional SHAPE experiments that use capillary electrophoresis to identify and quantify primer extension products.

This is joint work with Lior Pachter, Julius Lucks, Stefanie Mortimer, Shujun Luo, Cole Trapnell, Gary Schroth, Jennifer Doudna and Adam Arkin.

Mahdi Cheraghchi – Carnegie Mellon University
http://mahdi.ch

Improved Constructions for Non-adaptive Threshold Group Testing
February 15, 2012 3:45 pm – 4:15 pm

Keywords of the presentation: Group testing, Explicit constructions

The basic goal in combinatorial group testing is to identify a set of up to d defective items within a large population of size n >> d using a pooling strategy. Namely, the items can be grouped together in pools, and a single measurement would reveal whether there are one or more defectives in the pool. The threshold model is a generalization of this idea where a measurement returns positive if the number of defectives in the pool exceeds a fixed threshold u, negative if this number is below a fixed lower threshold L 0.

Ferdinando Cicalese – Università di Salerno
http://www.dia.unisa.it/~cicalese/

Competitive Testing for Evaluating Priced Functions
February 15, 2012 2:00 pm – 3:00 pm

Keywords of the presentation: function evaluation, competitive analysis, Boolean functions, decision trees, adaptive testing, non-uniform costs

The problem of evaluating a function by sequentially testing a subset of variables whose values uniquely identify the function’s value arises in several domains of computer science. A typical example is from the field of automatic diagnosis in Artificial Intelligence. Automatic diagnosis systems employ a sequence of tests and based on their outcomes, they provide a diagnosis (e.g., a patient has cancer or not, a component of a system is failing or not). In general, it is not necessary to perform all the available tests in order to determine the correct diagnosis. Since different tests might incur also very different costs, it is reasonable to look for testing procedures that minimize the cost incurred to produce the diagnosis. Another example is query optimization, a major issue in the area of databases. Query optimization refers to the problem of defining strategies to evaluate queries in order to minimize the user response time. In a typical database query, thousands of tuples are scanned to determine which of them match the query. By carefully defining a strategy to evaluate the attributes, one can save a considerable amount of computational time. In general, it is not necessary to evaluate all attributes of a tuple in order to determine whether it matches the query or not and a smart choice of the attributes to evaluate first can avoid very expensive and/or redundant attribute evaluations. This talk will focus on the function evaluation problem where tests have non-uniform costs and competitive analysis is employed for measuring the algorithm’s performance. For monotone Boolean functions we will present an optimal algorithm, i.e., with the best possible competitiveness. We will also discuss the case of some classes of non-monotone functions.

Annalisa De Bonis – Università di Salerno
http://www.dia.unisa.it/~debonis/

A Group Testing Approach to Corruption Localizing Hashing
February 16, 2012 2:00 pm – 3:00 pm

Keywords of the presentation: Algorithms, Cryptography, Corruption-Localizing Hashing, Group Testing, Superimposed Codes.

Efficient detection of integrity violations is crucial for the reliability of both data at rest and data in transit. In addition to detecting corruptions, it is often desirable to have the capability of obtaining information about the location of corrupted data blocks. While ideally one would want to always find all changes in the corrupted data, in practice this capability may be expensive, and one may be content with localizing or finding a superset of any changes. Corruption-localizing hashing is a cryptographic primitive that enhances collision-intractable hash functions thus improving the detection property of these functions into a suitable localization property. Besides allowing detection of changes in the input data, corruption-localizing hash schemes also obtain some superset of the changes location, where the accuracy of this superset with respect to the actual changes is a metric of special interest, called localization factor. In this talk we will address the problem of designing corruption-localizing hash schemes with reduced localization factor and with small time and storage complexity. We will show that this problem corresponds to a particular variant of nonadaptive group testing, and illustrate a construction technique based on superimposed codes. This group testing approach allowed to obtain corruption-localizing hash schemes that greatly improve on previous constructions. In particular, we will present a corruption-localizing hash scheme that achieves an arbitrarily small localization factor while only incurring a sublinear time and storage complexity.

Christian Deppe – Universität Bielefeld
http://www.math.uni-bielefeld.de/~cdeppe/

Poster – Finding one of m defective elements
February 14, 2012 4:15 pm – 5:30 pm

In contrast to the classical goal of group testing we want to find onw defective elements of $D$ defective elements. We examine four different test functions. We give adaptive strategies and lower bounds for the number of tests. We treat the cases if the number of defectives are known and if the number of defectives are bounded.

Yaniv Erlich – Whitehead Institute for Biomedical Research
http://jura.wi.mit.edu/erlich/main.html

Tutorial: Cost effective sequencing of rare genetic variations
February 13, 2012 10:00 am – 11:00 am

In the past few years, we have experienced a paradigm shift in human genetics. Accumulating lines of evidence have highlighted the pivotal role of rare genetic variations in a wide variety of traits and diseases. Studying rare variations is a needle in a haystack problem, as large cohorts have to be assayed in order to trap the variations and gain statistical power. The performance of DNA sequencing is exponentially growing, providing sufficient capacity to profile an extensive number of specimens. However, sample preparation schemes do not scale as sequencing capacity. A brute force approach of preparing hundredths to thousands of specimens for sequencing is cumbersome and cost-prohibited. The next challenge, therefore, is to develop a scalable technique that circumvents the bottleneck in sample preparation.

My tutorial will provide background on rare genetic variations and DNA sequencing. I will present our sample prep strategy, called DNA Sudoku, that utilizes combinatorial pooling/compressed sensing approach to find rare genetic variations. More importantly, I will discuss several major distinction from the classical combinatorial due to sequencing specific constraints.

Dina Esposito – Whitehead Institute for Biomedical Research
http://www.wi.mit.edu/

Mining Rare Human Variations using Combinatorial Pooling
February 13, 2012 3:15 pm – 3:45 pm

Keywords of the presentation: high-throughput sequencing, combinatorial pooling, liquid-handling, compressed sensing

Finding rare genetic variations in large cohorts requires tedious preparation of large numbers of specimens for sequencing. We are developing a solution, called DNA Sudoku, to reduce prep time and increase the throughput of samples. By using a combinatorial pooling approach, we multiplex specimens and then barcode the pools, rather than individuals, for sequencing in a single lane on the Illumina platform.

We have developed a protocol for quantifying, calibrating, and pooling DNA samples using a liquid-handling robot, which has required a significant amount of testing in order to reduce volume variation. I will discuss our protocol and the steps we have taken to reduce CV. For accurate decoding and to reduce the possibility of specimen dropout, it is important that the DNA samples are accurately quantified and calibrated so that equal amounts can be pooled and sequenced. We can determine the number of carriers in each pool from sequencing output and reconstruct the original identity of individual specimens based on the pooling design, allowing us to identify a small number of carriers in a large cohort.

Zoltan Furedi – Hungarian Academy of Sciences (MTA)
http://www.math.uiuc.edu/~z-furedi/

Superimposed codes
February 14, 2012 3:45 pm – 4:15 pm

There are many instances in Coding Theory when codewords must be restored from partial information, like defected data (error correcting codes), or some superposition of the strings. These lead to superimposed codes, a close relative of group testing problems.

There are lots of versions and related problems, like Sidon sets, sum-free sets, unionfree families, locally thin families, cover-free codes and families, etc. We discuss two cases cancellative and union-free codes.

A family of sets Ƒ (and the corresponding code of 0-1 vectors) is called union-free if A ∪ B≠ C ∪ D and A,B,C,D ∈ F imply {A,B} = {C,D}. Ƒ is called t-cancellative if for all distict t + 2 members A₁, … ,A_t and B,C ∈ Ƒ

A1 ∪ … ∪ A_t ∪ B ≠ A₁ ∪ … A_t ∪C:

Let c_t(n) be the size of the largest t-cancellative code on n elements. We significantly improve the previous upper bounds of Körner and Sinaimeri, e.g., we show c2(n) ≤ 2^0:322n (for n > n₀).

Anna Gal – University of Texas at Austin
http://www.cs.utexas.edu/~panni/

Streaming algorithms for approximating the length of the longest increasing subsequence
February 17, 2012 9:00 am – 10:00 am

Keywords of the presentation: Data streams, lower bounds, longest increasing subsequence

The data stream model allows only one way access to the data, possibly with several passes. This model is motivated by problems related to processing very large data sets, when it is desirable to keep only a small part of the data in active memory at any given point during the computation.

I will talk about proving lower bounds on how much space (memory) is necessary to still be able to solve the given task. I will focus on the problem of approximating the length of the longest increasing subsequence, which is a measure of how well the data is sorted.

Joint work with Parikshit Gopalan.

Anna Gilbert – University of Michigan
http://www.math.lsa.umich.edu/~annacg/

Tutorial: Sparse signal recovery
February 13, 2012 11:15 am – 12:15 pm

Keywords of the presentation: streaming algorithms, group testing, sparse signal recovery, introduction.

My talk will be a tutorial about sparse signal recovery but, more importantly, I will provide an overview of what the research problems are at the intersection of biological applications of group testing, streaming algorithms, sparse signal recovery, and coding theory. The talk should help set the stage for the rest of the workshop.

David Golan – Tel Aviv University

Weighted Pooling – Simple and Effective Techniques for Pooled High Throughput Sequencing Design
February 15, 2012 11:30 am – 12:00 pm

Despite the rapid decrease in sequencing costs, sequencing a large group of individuals is still prohibitively expensive. Recently, several sophisticated pooling designs were suggested that can identify carriers of rare alleles in large cohorts with a significantly smaller number of lanes, thus dramatically reducing the cost of such large scale genotyping projects. These approaches all use combinatorial pooling designs where each individual is either present in a pool or absent from it. One can then infer the number of carriers in a pool, and by combining information across pools, reconstruct the identity of the carriers.

We show that one can gain further efficiency and cost reduction by using “weighted” designs, in which different individuals donate different amounts of DNA to the pools. Intuitively, in this situation the number of mutant reads in a pool does not only indicate the number of carriers, but also of the identity of the carriers.

We describe and study a simple but powerful example of such weighted designs, with non-overlapping pools. We demonstrate that even this naive approach is not only easier to implement and analyze but is also competitive in terms of accuracy with combinatorial designs when identifying very rare variants, and is superior to the combinatorial designs when genotyping more common variants.

We then discuss how weighting can be further incorporated into existing designs to increase their accuracy and demonstrate the resulting improvement in reconstruction efficiency using simulations. Finally, we argue that these weighted designs have enough power to facilitate detection of common alleles, so they can be used as a cornerstone of whole-exome or even whole-genome sequencing projects.

Stefano Lonardi – University of California, Riverside
http://www.cs.ucr.edu/~stelo/

Combinatorial Pooling Enables Selective Sequencing of the Barley Gene Space
February 14, 2012 3:15 pm – 3:45 pm

PDF of the slides (application/pdf)

Keywords of the presentation: combinatorial pooling, DNA sequencing and assembly, genomics, next-generation sequencing,

We propose a new sequencing protocol that combines recent advances in combinatorial pooling design and second-generation sequencing technology to efficiently approach de novo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when dealing with hundreds or thousands of DNA samples, such as genome-tiling gene-rich BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundreds of million of short reads and assign them to the correct BAC clones so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is extremely accurate (99.57% of the deconvoluted reads are assigned to the correct BAC), and the resulting BAC assemblies have very high quality (BACs are covered by contigs over about 77% of their length, on average). Experimental results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate (almost 70% of left/right pairs in paired-end reads are assigned to the same BAC, despite being processed independently) and the BAC assemblies have good quality (the average sum of all assembled contigs is about 88% of the estimated BAC length).

Joint work with D. Duma (UCR), M. Alpert (UCR), F. Cordero (U of Torino), M. Beccuti (U of Torino), P. R. Bhat (UCR and Monsanto), Y. Wu (UCR and Google), G. Ciardo (UCR), B. Alsaihati (UCR), Y. Ma (UCR), S. Wanamaker (UCR), J. Resnik (UCR), and T. J. Close (UCR).

Preprint available at http://arxiv.org/abs/1112.4438

Mikhail B Malyutov – Northeastern University
http://www.math.neu.edu/~malioutov/

Poster – Upgraded Separate Testing of Inputs in Compressive Sensing
February 14, 2012 4:15 pm – 5:30 pm

Screening experiments (SE) deal with finding a small number s Active Inputs (AIs) out of a vast total amount t of inputs in a regressionmodel. Of special interest in the SE theory is finding the so-called maximal rate (capacity). For a set of t elements, denote the set of its s-subsets by (s, t). Introduce a noisy system with t binary inputs and one output y. Suppose that only s 0. The f-capacity Cf (s) = limt!1(log t/Nf (s, t, )) for any > 0 is the ‘limit for the per- formance of the f-analysis’. We obtained tight capacity bounds [4], [5] by formalizing CSS as a special case of Multi-Access Communication Channels (MAC) of information transmission capacity region construction developed by R. Ahlswede in [1] and comparing CSS’ maximal rate (capacity) with small error for two practical methods of outputs’ analysis under the optimal CSS design motivated by applications like [2] . Recovering Active Inputs with small error probability and accurate parameter estimation are both possible with rates less than capacity and impossible with larger rates. A staggering amount of attention was recently devoted to the study of compressive sensing and related areas using sparse priors in over parameterized linear models which may be viewed as a special case of our models with continuous input levels. The threshold phenomenon was empirically observed in early papers [3], [6] : as the dimension of a randominstance of a problem grows there is a sharp transition from successful recovery to failure as a function of the number of observations versus the dimension and sparsity of the unknown signal. Finding this threshold is closely related to our capacity evaluation. Some threshold bounds for the compressive sensing were made using standard information-theoretic tools, e.g. in [15]

Mikhail B Malyutov – Northeastern University
http://www.math.neu.edu/~malioutov/

Greedy Separate Testing Sparse Inputs
February 16, 2012 3:15 pm – 3:45 pm

Lecture Slides (pdf)

Lower and Upper Bounds and Simulated performance of three Analysis methods under Asymptotically OPTIMAL=RANDOMX in terms of CAPACITY C(s).

Olgica Milenkovic – University of Illinois at Urbana-Champaign
http://www.ece.illinois.edu/directory/profile.asp?milenkov

Probabilistic and combinatorial models for quantized group testing
February 15, 2012 3:15 pm – 3:45 pm

Keywords of the presentation: group testing, MAC channel, quantization, graphical models

We consider a novel group testing framework where defectives obey a probabilistic model in which the number and set of defectives is governed by a graphical model. We furthermore assume that the defectives have importance factors that influence their strength on the test outcomes and the accuracy with which the outcomes are read. This kind of scenario arises, for example, in MAC channels with networked users using different communication powers, or in DNA pooling schemes which involve large families.

Jelani Nelson – Princeton University
http://people.seas.harvard.edu/~minilek/

Sparser Johnson-Lindenstrauss Transforms
February 16, 2012 10:15 am – 11:15 am

Keywords of the presentation: dimensionality reduction, johnson-lindenstrauss, numerical linear algebra, massive data

The Johnson-Lindenstrauss (JL) lemma (1984) states that any n points in d-dimensional Euclidean space can be embedded into k = O((log n)/eps^2) dimensions so that all pairwise distances are preserved up to 1+/-eps. Furthermore, this embedding can be achieved via a linear mapping. The JL lemma is a useful tool for speeding up solutions to several high-dimensional problems: closest pair, nearest neighbor, diameter, minimum spanning tree, etc. It also speeds up some clustering and string processing algorithms, reduces the amount of storage required to store a dataset, and can be used to reduce memory required for numerical linear algebra problems such as linear regression and low rank approximation.

The original proofs of the JL lemma let the linear mapping be specified by a random dense k x d matrix (e.g. i.i.d. Gaussian entries). Thus, performing an embedding requires dense matrix-vector multiplication. We give the first construction of linear mappings for JL in which only a subconstant fraction of the embedding matrix is non-zero, regardless of how eps and n are related, thus always speeding up the embedding time. Previous constructions only achieved sparse embedding matrices for 1/eps >> log n.

This is joint work with Daniel Kane (Stanford).

Natasha Przulj – Imperial College London
http://www.doc.ic.ac.uk/~natasha/

Network topology as a source of biological information
February 14, 2012 10:15 am – 11:15 am

Keywords of the presentation: biological networks, graph algorithms, network alignment

Sequence-based computational approaches have revolutionized biological understanding. However, they can fail to explain some biological phenomena. Since proteins aggregate to perform a function instead of acting in isolation, the connectivity of a protein interaction network (PIN) will provide additional insight into the inner working on the cell, over and above sequences of individual proteins. We argue that sequence and network topology give insights into complementary slices of biological information, which sometimes corroborate each other, but sometimes do not. Hence, the advancement depends on the development of sophisticated graph-theoretic methods for extracting biological knowledge purely from network topology before being integrated with other types of biological data (e.g., sequence). However, dealing with large networks is non-trivial, since many graph-theoretic problems are computationally intractable, so heuristic algorithms are sought.

Analogous to sequence alignments, alignments of biological networks will likely impact biomedical understanding. We introduce a family of topology-based network alignment (NA) algorithms, (that we call GRAAL algorithms), that produces by far the most complete alignments of biological networks to date: our alignment of yeast and human PINs demonstrates that even distant species share a surprising amount of PIN topology. We show that both species phylogeny and protein function can be extracted from our topological NA. Furtermore, we demonstrate that the NA quality improves with integration of additional data sources (including sequence) into the alignment algorithm: surprisingly, 77.7% of proteins in the baker’s yeast PIN participate in a connected subnetwork that is fully contained in the human PIN suggesting broad similarities in internal cellular wiring across all life on Earth. Also, we demonstrate that topology around cancer and non-cancer genes is different and when integrated with functional genomics data, it successfully predicts new cancer genes in melanogenesis-related pathways.

Sylvie Ricard-Blum – Université Claude-Bernard (Lyon I)
http://www.ibcp.fr/scripts/affiche_detail.php?n_id=174

A dynamic and quantitative protein interaction network regulating angiogenesis
February 16, 2012 3:45 pm – 4:15 pm

Keywords of the presentation: protein interaction networks, kinetics, affinity, protein arrays, surface plasmon resonance

Angiogenesis, consisting in the formation of blood vessels from preexisting ones, is of crucial importance in pathological situations such as cancer and diabetes and is a therapeutic target for these two diseases. We have developed protein and sugar arrays probed by surface plasmon resonance (SPR) imaging to identify binding partners of proteins, polysaccharides and receptors regulating angiogenesis. Interactions collected from our own experiments and from literature curation have been used to build a network comprising protein-protein and protein-polysaccharide interactions. To switch from a static to a dynamic and quantitative interaction network, we have measured kinetics and affinity of interactions by SPR to discriminate transient from stable interactions and to prioritize interactions in the network. We have also identified protein interaction sites either experimentally or by molecular modeling to discriminate interactions occurring simultaneously from those taking place sequentially. The ultimate step is to integrate, using bioinformatics tools, all these parameters in the interaction network together with inhibitors of these interactions and with gene and protein expression data available from Array Express or Gene Expression Omnibus and from the Human Protein Atlas. This dynamic network will allow us to understand how angiogenesis is regulated in a concerted fashion via several receptors and signaling pathways, to identify crucial interactions for the integrity of the network that are potential therapeutic targets, and to predict the side effects of anti-angiogenic treatments.

Atri Rudra – University at Buffalo (SUNY)
http://www.cse.buffalo.edu/~atri/

Tutorial: Group Testing and Coding Theory
February 14, 2012 9:00 am – 10:00 am

Keywords of the presentation: Code Concatenation, Coding theory, Group Testing, List Decoding, List Recovery

Group testing was formalized by Dorfman in his 1943 paper and was originally used in WW-II to identify soldiers with syphilis. The main insight in this application is that blood samples from different soldiers can be combined to check if at least one of soldiers in the pool has the disease. Since then group testing has found numerous applications in many areas such as (computational) biology, combinatorics and (theoretical) computer science.

Theory of error-correcting codes, or coding theory, was born in the works of Shannon in 1948 and Hamming in 1950. Codes are ubiquitous in our daily life and have also found numerous applications in theoretical computer science in general and computational complexity in particular.

Kautz and Singleton connected these two areas in their 1964 paper by using “code concatenation” to design good group testing schemes. All of the (asymptotically) best know explicit constructions of group testing schemes use the code concatenation paradigm. In this talk, we will focus on the “decoding” problem for group testing: i.e. given the outcomes of the tests on the pools, identify the infected soldiers. Recent applications of group testing in data stream algorithm require sub-linear time decoding, which is not guaranteed by the traditional constructions.

The talk will first survey the Kautz-Singleton construction and then will will show how recent developments in list decoding of codes lead in a modular way to sub-linear time decodable group testing schemes.

Vyacheslav V. Rykov – University of Nebraska
http://www.unomaha.edu/~wwwmath/faculty/rykov/index.html

Superimposed Codes and Designs for Group Testing Models
February 14, 2012 11:30 am – 12:30 pm

(application/pdf)

Keywords of the presentation: superimposed codes, screening designs, rate of code, threshold designs and codes

We will discuss superimposed codes and non-adaptive group testing designs arising from the potentialities of compressed genotyping models in molecular biology. The given survey is also motivated by the 30th anniversary of our recurrent upper bound on the rate of superimposed codes published in 1982.

Alex Samorodnitsky – Hebrew University
http://www.cs.huji.ac.il/~salex/

Testing Boolean functions
February 14, 2012 2:00 pm – 3:00 pm

I will talk about property testing of Boolean functions, concentrating on two topics: testing monotonicity (an example of a combinatorial property) and testing the property of being a low-degree polynomial (an algebraic property).

Sriram Sankararaman – Harvard Medical School
http://sriramsankararaman.com/

Genomic Privacy and the Limits of Individual Detection in a Pool
February 15, 2012 10:15 am – 11:15 am

Keywords of the presentation: Genomewide association studies, Privacy, Pooled designs, Hypothesis testing, Local Asymptotic normality

Statistical power to detect associations in genome-wide association studies can be enhanced by combining data across studies in meta-analysis or replication studies. Such methods require data to flow freely in the scientific community, however, and this raises privacy concerns.

Till recently, many studies pooled individuals together, making only the allele frequencies of each SNP in the pool publicly available. However a technique that could be used to detect the presence of individual genotypes from such data prompted organizations such as the NIH to restrict public access to summary data . To again allow public access to data from association studies, we need to determine which set of SNPs can be safely exposed while preserving an acceptable level of privacy.

To answer this question, we provide an upper bound on the power achievable by any detection method as a function of factors such as the number and the allele frequencies of exposed SNPs, the number of individuals in the pool, and the false positive rate of the method. Our approach is based on casting the problem in a statistical hypothesis testing framework for which the likelihood ratio test (LR-test) attains the maximal power achievable.

Our analysis provides quantitative guidelines for researchers to make SNPs public without compromising privacy. We recommend, based on our analysis, that only common independent SNPs be exposed. The final decision regarding the exposed SNPs should be based on the analytical bound in conjunction with empirical estimates of the power of the LR test. To this end, we have implemented a tool, SecureGenome, that determines the set of SNPs that can be safely exposed for a given dataset.

Alexander Schliep – Rutgers University
http://bioinformatics.rutgers.edu

From screening clone libraries to detecting biological agents
February 17, 2012 10:15 am – 11:15 am

Keywords of the presentation: (generalized) group testing, screening clone libraries, DNA microarrays

Group testing has made many contributions to modern molecular biology. In the Human Genome Project large clone libraries where created to amplify DNA. Widely used group testing schemes vastly accelerated the detection of overlaps between the individual clones in these libraries through experiments, realizing savings both in effort and materials.

Modern molecular biology also contributed to group testing. The problem of generalized group testing (in the combinatorial sense) arises naturally, when one uses oligonucleotide probes to identify biological agents present in a sample. In this setting a group testing design cannot be chosen arbitrarily. The possible columns of a group testing design matrix are prescribed by the biology, namely by the hybridization reactions between target DNA and probes

Noam Shental – Open University of Israel
http://www.openu.ac.il/home/shental/

Identification of rare alleles and their carriers using compressed se(que)nsing
February 13, 2012 2:00 pm – 3:00 pm

Keywords of the presentation: compressed sensing, group testing, genetics, rare alleles

Identification of rare variants by resequencing is important both for detecting novel variations and for screening individuals for known disease alleles. New technologies enable low-cost resequencing of target regions, although it is still prohibitive to test more than a few individuals. We propose a novel pooling design that enables the recovery of novel or known rare alleles and their carriers in groups of individuals. The method is based on combining next-generation sequencing technology with a Compressed Sensing (CS) approach. The approach is general, simple and efficient, allowing for simultaneous identification of multiple variants and their carriers. It reduces experimental costs, i.e., both sample preparation related costs and direct sequencing costs, by up to 70 fold, and thus allowing to scan much larger cohorts. We demonstrate the performance of our approach over several publicly available data sets, including the 1000 Genomes Pilot 3 study. We believe our approach may significantly improve cost effectiveness of future association studies, and in screening large DNA cohorts for specific risk alleles.

We will present initial results of two projects that were initiated following publication. The first project concerns identification of de novo SNPs in genetic disorders common among Ashkenazi Jews, based on sequencing 3000 DNA samples. The second project in plant genetics involves identifying SNPs related to water and silica homeostasis in Sorghum bicolor, based on sequencing 3000 DNA samples using 1-2 Illumina lanes.

Joint work with Amnon Amir from the Weizmann Institute of Science, and Or Zuk from the Broad Institute of MIT and Harvard

Nicolas Thierry-Mieg – Centre National de la Recherche Scientifique (CNRS)
http://membres-timc.imag.fr/Nicolas.Thierry-Mieg/

Shifted Transversal Design Smart-pooling: increasing sensitivity, specificity and efficiency in high-throughput biology
February 15, 2012 9:00 am – 10:00 am

Keywords of the presentation: combinatorial group testing, smart-pooling, interactome mapping

Group testing, also know as smart-pooling, is a promising strategy for achieving high efficiency, sensitivity, and specificity in systems-level projects. It consists in assaying well-chosen pools of probes, such that each probe is present in several pools, hence tested several times. The goal is to construct the pools so that the positive probes can usually be directly identified from the pattern of positive pools, despite the occurrence of false positives and false negatives. While striving for this goal, two interesting mathematical or computational problems emerge: the pooling problem (how should the pools be designed?), and the decoding problem (how to interpret the outcomes?). In this talk I will discuss these questions and the solutions we have proposed: a flexible and powerful combinatorial construction for designing smart-pools (the Shifted Transveral Design, STD), and an efficient exact algorithm for interpreting results (interpool). I will then present the results of validation experiments that we have performed in the context of yeast two-hybrid interactome mapping.

Bayesian Statistics (Cont’d)

February 8, 2012 in Academic, Statistics | Leave a comment

There is a Bayesian Cake Club, from where we could find a list of papers on Bayesian Statistics:

B.T. Knapik A.W. van der Vaart J.H. van Zanten (2011) Bayesian Inverse problems with Gaussian Priors.
C. Yau and C. Holmes. (2011) Hierarchical Bayesian nonparametric mixture models for clustering with variable relevance determination. Bayesian Analysis 6(2), 329-352
Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation
by Fearnhead and Prangle
Philosophy and the practice of Bayesian statistics
by Gelman and Shalizi
Catching up faster by switching sooner: a predictive approach to adaptive estimation with an application to the Akaike information criterion – Bayesian information criterion dilemma
by van Erven, Grunwald and de Rooij

Suboptimal behaviour of Bayes and MDL in classification under misspecification
by Peter Grunwald and John Langford
Likelihood-free Estimation of model evidence
by Xavier Didelot, Richard G. Everitt, Adam M. Johansen and Daniel J. Lawson
On the use of non-local prior densities in Bayesian hypothesis tests
by Valen E. Johnson and David Rossell
Approximate Bayesian Computation: A Nonparametric Perspective
by Michael Blum
Inconsistent Bayesian Estimation
by Christensen
A Hierarchical Bayesian Framework for Constructing Sparsity-inducing Priors
Anthony Lee, Francois Caron, Arnaud Doucet, Chris Holmes
Dynamics of Bayesian updating with dependent data and misspecified models
Cosma Rohilla Shalizi
Posterior Predictive p-values in Bayesian Hierarchical Models
G.H. Steinbakk, G.O. Storvik, Scandinavian Journal of Statistics, Vol. 36: 320-336, 2009, doi: 10.1111/j.1467-9469.2008.00630.x
Bayesian Model Averaging: A Tutorial
Jennifer A. Hoeting, David Madigan, Adrian E. Raftery and Chris T. Volinsky, Statistical Science, Vol. 14, No. 4 (Nov., 1999), pp. 382-401.
Optimal Predictive Model Selection
Barbieri and Berger (2004), The Annals of Statistics, 32, 870-897.
Use of Exchangeability
JFC Kingman (1978), The Annals of Probability, 6, 183-197.
The concept of exchangeability and its applications
Bernardo (1996).
Hybrid Dirichlet mixture models for functional data
Petrone, Guindani and Gelfand, JRSSB, 71, 755-782 (2009).
Some notes from Peter to facilitate reading the above: cribsheet
Reducing the Dimensionality of Data with Neural Networks
Hinton and Salakhutdinov, Science 313, 504-507, 2006.
Supplementary material: tech rep, slides .
Joint Bayesian Estimation of Alignment and Phylogeny
BENJAMIN D. REDELINGS AND MARC A. SUCHARD, Syst. Biol. 54, 401-418, 2005.
(Some introductory background reading on Phylogeny can be found inPhylogeny Estimation: Traditional and Bayesian Approaches by M. Holder and P.O. Lewis, Nature Reviews, 2003.)
Bayesian inference for a discretely observed stochastic kinetic model
Boys, Wilkinson and Kirkwood.
Stat Comput, 18, 125-135, (2008).
Agreeing to Disagree
Aumann
The Annals of Statistics, 4, 1236-9, (1976).
Belief and the Problem of Ulysses and the Sirens
Van Fraassen
Philosophical Studies, 77, 7-37 (1995)
Updating Subjective Probability
Diaconis and Zabell
JASA, 77, 380, 822-830 (1982)
Objective Bayesian variable selection.
G. Casella and E. Moreno
JASA, 101, 157-167 (2006).
Separation measures and the geometry of Bayes factor selection for classification.
J.Q. Smith, P.E. Anderson, and S. Liverani
JRSSB, 70, 5, 957-980 (2008).
Examples of Adaptive MCMC
Roberts, G. O. and Rosenthal, J. S.; Preprint (2008)
Hyper Markov laws in the statistical analysis of decomposable graphical models
S. Lauritzen and P. Dawid
Annals of Statistics, Vol. 21, pp. 1272-1317 (1993)
Subjective Bayesian Analysis: Principles and Practice
M. Goldstein
Bayesian Analysis, Vol. 1, 403-420 (+discussion), 2006
Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models
O. Papaspiliopoulos and G. O. Roberts
Biometrika, Vol. 95, pp. 169-186 (2008)
Bayesian calibration of computer models
M.C. Kennedy and A. O’Hagan
Journal of the Royal Statistical Society, Series B, Volume 63, pp. 425-464 (2001)
Multiple-bias modelling for analysis of observational data
S. Greenland
Journal of the Royal Statistical Society: Series A (Statistics in Society), Volume 168, Number 2, March 2005 , pp. 267-306 (2005)
Bayesian Prediction of Deterministic Functions, with Applications to the Design and Analysis of Computer Experiments
C. Currin, T. Mitchell, M. Morris, and D. Ylvisaker
Journal of the American Statistical Association, v. 86, pp. 953-963 (1991).
Causal Inference Without Counterfactuals
A. P. Dawid
Journal of the American Statistical Association, Vol. 95, pp. 407-424 (2000)
Extended Ensemble Monte Carlo
Y. Iba
Int. J. Mod. Phys. C12, 623-656 (2001)
Sparse graphical models for exploring gene expression data
A. Dobra, B. Jones, C. Hans, J.R. Nevins and M. West.
Journal of Multivariate Analysis, 90 (2004): 196-212.
P Values for Composite Null Models
M. J. Bayarri and James O. Berger
JASA, 95 (452), 1127-1142 (2000).
Gibbs Sampling Methods for Stick-Breaking Priors
H. Ishwaran and L. F. James
JASA, 96 (453), 161-173 (2001)
Bayesian density regression
Dunson, D., Pillai, N., and Park J.-H.
JRSS(B) 69(2), 163-183, 2007.
Bayesian Inference for Causal Effects: The Role of Randomization
D. B. Rubin
Annals of Statistics, Vol. 6, No. 1, pp 34-58 (1978)

Bayesian Statistics

February 1, 2012 in Academic, Statistics | Leave a comment

The reading list given by Professor Xi’an on the topic of ABC (Approximate Beyesian Computation) convergence:

Blum M.G.B. (2010) Approximate Bayesian Computation: a nonparametric perspective. Journal of the American Statistical Association, 105: 1178-1187
Dean, T.A., Sumeetpal, Singh S.S., Jasra, A. and Peters, G.W. (2010) Parameter Estimation for Hidden Markov Models with Intractable Likelihoods. Cambridge University Engineering Department Technical Report 66, arXiv:1103.5399v1
Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. Royal Statistical Society (Series B)
Marin, J.-M., Pillai, N., Robert, C.P. and Rousseau, J. (2011) Relevant statistics for Bayesian model choice. arXiv:1110.4700
Robert, C.P., Cornuet, J.-M., Marin, J.-M. and Pillai, N.S. (2011) Lack of confidence in approximate Bayesian computational (ABC) model choice. PNAS (Open Access). 108(37), 15112-15117,arXiv:1102.4432
Wilkinson, R. (2008) Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. arXiv:0811.3355

Monthly Archive

Useful for referring–2-25-2012

Personalized Treatment and Machine Learning

The World As Hologram

Q-A Section 7-Bartlett corrections

Resampling

Sequencing Data

Useful for referring—2-13-2012

Group Testing Designs, Algorithms, and Applications to Biology

IMA meeting on Group Testing Designs, Algorithms, and Applications to Biology

Bayesian Statistics (Cont’d)

Bayesian Statistics

Recent Comments

Blog Stats

Log In/Out

Email Subscription

Recent Posts

Twitter Updates

Categories

Archives

Bioinformatics

Blogroll

CS blogs

general math blogs

interesting blogs

Journal Club

machine learning blogs

Newly Added

probability blogs

statistics blogs

Blog Stats