You are currently browsing the monthly archive for February 2012.

This Friday, there is a talk on Personalized Medicine and Artificial Intelligence given by Michael Kosorok from Department of Biostatistics at University of North Carolina at Chapel Hill. The following materials could be helpful to get some idea of this area:

Today I just watched an interesting video about the indestructability of information and the nature of black holes, a talk given by Leonard Susskind of the Stanford Institute for Theoretical Physics.

Video: Leonard Susskind on The World As Hologram

And I also looked up some materials online:

World’s Most Precise Clocks Could Reveal Universe Is a Hologram

The World as a Hologram (A paper written by L. Susskind)

UC Berkeley’s Raphael Bousso presents a friendly introduction to the ideas behind the holographic principle, which may be very important in the hunt for a theory of quantum gravity.

Q: What are Bartlett corrections?

A: Strictly speaking, a Bartlett correction is a scalar transformation applied to the likelihood ratio (LR) statistic that yields a new, improved test statistic which has a chi-squared null distribution to order O(1/n). This represents a clear improvement over the original statistic in the sense that LR is distributed as chi-squared under the null hypothesis only to order O(1).

Q: Are there extensions of Bartlett corrections?

A: Yes. Some of them arose in response to Sir David Cox’s 1988 paper, “Some aspects of conditional and asymptotic inference: a review” (Sankhya A). A particularly useful one was proposed by Gauss Cordeiro and Silvia Ferrari in a 1991 Biometrika paper. They have shown how to Bartlett-correct test statistics whose null asymptotic distribution is chi-squared with special emphasis on Rao’s score statistic.

Q: Where can I find a survey paper on Bartlett corrections?

A: There are a few around. Two particularly useful ones are:

Q: What are the alternatives to Bartlett corrections?

A: There are several alternatives. A closely related one are Edgeworth expansions, named after the economist/statistician Francis Ysidro Edgeworth. There are also saddlepoint expansions. A computer-intensive alternative is known as the bootstrap and was proposed by Bradley Efron in his 1979 Annals of Statistics paper.

Please refer to Bartlett Corrections Page.

From wiki, we have the following:

In statisticsresampling is any of a variety of methods for doing one of the following:

  1. Estimating the precision of sample statistics (mediansvariancespercentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)
  2. Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)
  3. Validating models by using random subsets (bootstrapping, cross validation)

Common resampling techniques include bootstrapping, jackknifing and permutation tests.

  • Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a meanmedian,proportionodds ratiocorrelation coefficient or regression coefficient.
  • Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias and standard error (variance) of a statistic, when a random sample of observations is used to calculate it. The basic idea behind the jackknife variance estimator lies in systematically recomputing the statistic estimate leaving out one or more observations at a time from the sample set. From this new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.
  • Cross-validation is a statistical method for validating a predictive model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (a training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.
  • permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points.

These days I am struggling with the sequencing data. What do they look like? What’s the essential difference with the micro-array data for the statistician community? Discrete and Continuous? So far, I am still not clear of these. Could anyone help me with this? Thanks.

The following are the collection of the related and maybe something else:

  1. Talk on RNA-seq Analysis, presented by Wing H. Wong, at the Joint Statistical Meetings on August 3, 2010 in Vancouver, Canada.
  2. Julia Salzman, Hui Jiang and Wing Hung Wong (2011), Statistical Modeling of RNA-Seq Data. Statistical Science 2011, Vol. 26, No. 1, 62-83. doi: 10.1214/10-STS343.
  3. Hui Jiang and Wing Hung Wong (2009), Statistical Inferences for Isoform Expression in RNA-Seq.
  4. Wenxiu Ma and Wing Hung Wong (2011), The Analysis of ChIP-Seq Data.
  5. Genotype and SNP calling from next-generation sequencing data
  6. Saran Vardhanabhuti, Mingyao Li and Hongzhe Li (2011), A Hierarchical Bayesian Model for Estimating and Inferring Differential Isoform Expression for Multi-sample RNA-Seq Data
  7. BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data, 2011, Yuan Ji1,*, Yanxun Xu2, Qiong Zhang3, Kam-Wah Tsui3, Yuan Yuan4, Clift Norris Jr.1, Shoudan Liang4, Han Liang4,*
  8. A new paper written by Lin Wan, Xiting Yan, Ting Chen, and Fengzhu Sun, Biostat 2012 published 21 February 2012, Modeling RNA degradation for RNA-Seq with applications
  9. Video 1 and video 2 for Analysis and design of RNA sequencing experiments for identifying isoform regulation
  1. R and presentations: a basic example of knitr and beamer:  Combine the knitr package with the Latex package beamer for presentation slides, instead of the the Sweave package because it basically is a better Sweave.
  2. Constructing Summary Statistics for Approximate Bayesian Computation: Semi-automatic ABC : a good paper worthy of learning and discussing. Many modern statistical applications involve inference for complex stochastic models, where it is easy to simulate from the models, but impossible to calculate likelihoods. Approximate Bayesian computation (ABC) is a method of inference for such models. It replaces calculation of the likelihood by a step which involves simulating artificial data for different parameter values, and comparing summary statistics of the simulated data to summary statistics of the observed data.
  3. Elegant & fast data manipulation with data.table : Extension of data.frame for fast indexing, fast ordered joins, fast assignment, fast grouping and list columns.
  4. Ordinal Measures of Association : These statistics I have met for twice so far. The recent one is in this paper written by Han Liu etc.
  5. Table design : Almost every research paper and the­sis in sta­tis­tics con­tains at least some tables, yet stu­dents are rarely taught how to make good tables. While the prin­ci­ples of good graph­ics are slowly becom­ing part of a sta­tis­ti­cal edu­ca­tion (although not an econo­met­rics edu­ca­tion!), the prin­ci­ples of good tables are often ignored.
  6. Dirichlet distribution

IMA meeting on Group Testing Designs, Algorithms, and Applications to Biology

Amihood Amir – Bar-Ilan University
http://u.cs.biu.ac.il/~amir/

 

Length Reduction via Polynomials
February 16, 2012 11:30 am – 12:30 pm

 

Keywords of the presentation: Sparse covlutions, polynomials, finite fields, length reduction.

 

Efficient handling of sparse data is a key challenge in Computer Science. Binary convolutions, such as the Fast Fourier Transform or theWalsh Transform are a useful tool in many applications and are efficiently solved. In the last decade, several problems required efficient solution of sparse binary convolutions.

Both randomized and deterministic algorithms were developed for efficiently computing the sparse FFT. The key operation in all these algorithms was length reduction. The sparse data is mapped into small vectors that preserve the convolution result. The reduction method used to-date was the modulo function since it preserves location (of the ”1” bits) up to cyclic shift.

In this paper we present a new method for length reduction – polynomials. We show that this method allows a faster deterministic computation of the sparse FFT than currently known in the literature. It also enables the development of an efficient algorithm for computing the binary sparse Walsh Transform. To our knowledge, this is the first such algorithm in the literature.

(Joint work with Oren Kappah, Ely Porat, and Amir Rothschild)


Sharon Aviran – University of California, Berkeley
http://bio.math.berkeley.edu/aviran/

 

RNA Structure Characterization from High-Throughput Chemical Mapping Experiments
February 13, 2012 3:45 pm – 4:45 pm

 

Keywords of the presentation: RNA structure characterization, high-throughput sequencing, maximum likelihood estimation

 

New regulatory roles continue to emerge for both natural and engineered noncoding RNAs, many of which have specific secondary and tertiary structures essential to their function. This highlights a growing need to develop technologies that enable rapid and accurate characterization of structural features within complex RNA populations. Yet, available structure characterization techniques that are reliable are also vastly limited by technological constraints, while the accuracy of popular computational methods is generally poor. These limitations thus pose a major barrier to the comprehensive determination of structure from sequence and thereby to the development of mechanistic understanding of transcriptome dynamics. To address this need, we have recently developed a high-throughput structure characterization technique, called SHAPE-Seq, which simultaneously measures quantitative, single nucleotide-resolution, secondary and tertiary structural information for hundreds of RNA molecules of arbitrary sequence. SHAPE-Seq combines selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE) chemical mapping with multiplexed paired-end deep sequencing of primer extension products. This generates millions of sequencing reads, which are then analyzed using a fully automated data analysis pipeline. Previous bioinformatics methods, in contrast, are laborious, heuristic, and expert-based, and thus prohibit high-throughput chemical mapping.

In this talk, I will review recent developments in experimental RNA structure characterization as well as advances in sequencing technologies. I will then describe the SHAPE-Seq technique, focusing on its automated data analysis method, which relies on a novel probabilistic model of a SHAPE-Seq experiment, adjoined by a rigorous maximum likelihood estimation framework. I will demonstrate the accuracy and simplicity of our approach as well as its applicability to a general class of chemical mapping techniques and to more traditional SHAPE experiments that use capillary electrophoresis to identify and quantify primer extension products.

This is joint work with Lior Pachter, Julius Lucks, Stefanie Mortimer, Shujun Luo, Cole Trapnell, Gary Schroth, Jennifer Doudna and Adam Arkin.


Mahdi Cheraghchi – Carnegie Mellon University
http://mahdi.ch

 

Improved Constructions for Non-adaptive Threshold Group Testing
February 15, 2012 3:45 pm – 4:15 pm

 

Keywords of the presentation: Group testing, Explicit constructions

 

The basic goal in combinatorial group testing is to identify a set of up to d defective items within a large population of size n >> d using a pooling strategy. Namely, the items can be grouped together in pools, and a single measurement would reveal whether there are one or more defectives in the pool. The threshold model is a generalization of this idea where a measurement returns positive if the number of defectives in the pool exceeds a fixed threshold u, negative if this number is below a fixed lower threshold L 0.

Read More…


Ferdinando Cicalese – Università di Salerno
http://www.dia.unisa.it/~cicalese/

 

Competitive Testing for Evaluating Priced Functions
February 15, 2012 2:00 pm – 3:00 pm

 

Keywords of the presentation: function evaluation, competitive analysis, Boolean functions, decision trees, adaptive testing, non-uniform costs

 

The problem of evaluating a function by sequentially testing a subset of variables whose values uniquely identify the function’s value arises in several domains of computer science. A typical example is from the field of automatic diagnosis in Artificial Intelligence. Automatic diagnosis systems employ a sequence of tests and based on their outcomes, they provide a diagnosis (e.g., a patient has cancer or not, a component of a system is failing or not). In general, it is not necessary to perform all the available tests in order to determine the correct diagnosis. Since different tests might incur also very different costs, it is reasonable to look for testing procedures that minimize the cost incurred to produce the diagnosis. Another example is query optimization, a major issue in the area of databases. Query optimization refers to the problem of defining strategies to evaluate queries in order to minimize the user response time. In a typical database query, thousands of tuples are scanned to determine which of them match the query. By carefully defining a strategy to evaluate the attributes, one can save a considerable amount of computational time. In general, it is not necessary to evaluate all attributes of a tuple in order to determine whether it matches the query or not and a smart choice of the attributes to evaluate first can avoid very expensive and/or redundant attribute evaluations. This talk will focus on the function evaluation problem where tests have non-uniform costs and competitive analysis is employed for measuring the algorithm’s performance. For monotone Boolean functions we will present an optimal algorithm, i.e., with the best possible competitiveness. We will also discuss the case of some classes of non-monotone functions.

Annalisa De Bonis – Università di Salerno
http://www.dia.unisa.it/~debonis/

 

A Group Testing Approach to Corruption Localizing Hashing
February 16, 2012 2:00 pm – 3:00 pm

 

Keywords of the presentation: Algorithms, Cryptography, Corruption-Localizing Hashing, Group Testing, Superimposed Codes.

 

Efficient detection of integrity violations is crucial for the reliability of both data at rest and data in transit. In addition to detecting corruptions, it is often desirable to have the capability of obtaining information about the location of corrupted data blocks. While ideally one would want to always find all changes in the corrupted data, in practice this capability may be expensive, and one may be content with localizing or finding a superset of any changes. Corruption-localizing hashing is a cryptographic primitive that enhances collision-intractable hash functions thus improving the detection property of these functions into a suitable localization property. Besides allowing detection of changes in the input data, corruption-localizing hash schemes also obtain some superset of the changes location, where the accuracy of this superset with respect to the actual changes is a metric of special interest, called localization factor. In this talk we will address the problem of designing corruption-localizing hash schemes with reduced localization factor and with small time and storage complexity. We will show that this problem corresponds to a particular variant of nonadaptive group testing, and illustrate a construction technique based on superimposed codes. This group testing approach allowed to obtain corruption-localizing hash schemes that greatly improve on previous constructions. In particular, we will present a corruption-localizing hash scheme that achieves an arbitrarily small localization factor while only incurring a sublinear time and storage complexity.

Christian Deppe – Universität Bielefeld
http://www.math.uni-bielefeld.de/~cdeppe/

 

Poster – Finding one of m defective elements
February 14, 2012 4:15 pm – 5:30 pm

 

In contrast to the classical goal of group testing we want to find onw defective elements of $D$ defective elements. We examine four different test functions. We give adaptive strategies and lower bounds for the number of tests. We treat the cases if the number of defectives are known and if the number of defectives are bounded.

Yaniv Erlich – Whitehead Institute for Biomedical Research
http://jura.wi.mit.edu/erlich/main.html

 

Tutorial: Cost effective sequencing of rare genetic variations
February 13, 2012 10:00 am – 11:00 am

 

In the past few years, we have experienced a paradigm shift in human genetics. Accumulating lines of evidence have highlighted the pivotal role of rare genetic variations in a wide variety of traits and diseases. Studying rare variations is a needle in a haystack problem, as large cohorts have to be assayed in order to trap the variations and gain statistical power. The performance of DNA sequencing is exponentially growing, providing sufficient capacity to profile an extensive number of specimens. However, sample preparation schemes do not scale as sequencing capacity. A brute force approach of preparing hundredths to thousands of specimens for sequencing is cumbersome and cost-prohibited. The next challenge, therefore, is to develop a scalable technique that circumvents the bottleneck in sample preparation.

My tutorial will provide background on rare genetic variations and DNA sequencing. I will present our sample prep strategy, called DNA Sudoku, that utilizes combinatorial pooling/compressed sensing approach to find rare genetic variations. More importantly, I will discuss several major distinction from the classical combinatorial due to sequencing specific constraints.


Dina Esposito – Whitehead Institute for Biomedical Research
http://www.wi.mit.edu/

 

Mining Rare Human Variations using Combinatorial Pooling
February 13, 2012 3:15 pm – 3:45 pm

 

Keywords of the presentation: high-throughput sequencing, combinatorial pooling, liquid-handling, compressed sensing

 

Finding rare genetic variations in large cohorts requires tedious preparation of large numbers of specimens for sequencing. We are developing a solution, called DNA Sudoku, to reduce prep time and increase the throughput of samples. By using a combinatorial pooling approach, we multiplex specimens and then barcode the pools, rather than individuals, for sequencing in a single lane on the Illumina platform.

We have developed a protocol for quantifying, calibrating, and pooling DNA samples using a liquid-handling robot, which has required a significant amount of testing in order to reduce volume variation. I will discuss our protocol and the steps we have taken to reduce CV. For accurate decoding and to reduce the possibility of specimen dropout, it is important that the DNA samples are accurately quantified and calibrated so that equal amounts can be pooled and sequenced. We can determine the number of carriers in each pool from sequencing output and reconstruct the original identity of individual specimens based on the pooling design, allowing us to identify a small number of carriers in a large cohort.


Zoltan Furedi – Hungarian Academy of Sciences (MTA)
http://www.math.uiuc.edu/~z-furedi/

 

Superimposed codes
February 14, 2012 3:45 pm – 4:15 pm

 

There are many instances in Coding Theory when codewords must be restored from partial information, like defected data (error correcting codes), or some superposition of the strings. These lead to superimposed codes, a close relative of group testing problems.

There are lots of versions and related problems, like Sidon sets, sum-free sets, unionfree families, locally thin families, cover-free codes and families, etc. We discuss two cases cancellative and union-free codes.

A family of sets Ƒ (and the corresponding code of 0-1 vectors) is called union-free if A ∪ B≠ C ∪ D and A,B,C,D ∈ F imply {A,B} = {C,D}. Ƒ is called t-cancellative if for all distict t + 2 members A1, … ,At and B,C ∈ Ƒ

A1 ∪ … ∪ At ∪ B ≠ A1 ∪ … At ∪C:

Let ct(n) be the size of the largest t-cancellative code on n elements. We significantly improve the previous upper bounds of Körner and Sinaimeri, e.g., we show c2(n) ≤ 20:322n (for n > n0).


Anna Gal – University of Texas at Austin
http://www.cs.utexas.edu/~panni/

 

Streaming algorithms for approximating the length of the longest increasing subsequence
February 17, 2012 9:00 am – 10:00 am

 

Keywords of the presentation: Data streams, lower bounds, longest increasing subsequence

 

The data stream model allows only one way access to the data, possibly with several passes. This model is motivated by problems related to processing very large data sets, when it is desirable to keep only a small part of the data in active memory at any given point during the computation.

I will talk about proving lower bounds on how much space (memory) is necessary to still be able to solve the given task. I will focus on the problem of approximating the length of the longest increasing subsequence, which is a measure of how well the data is sorted.

Joint work with Parikshit Gopalan.


Anna Gilbert – University of Michigan
http://www.math.lsa.umich.edu/~annacg/

 

Tutorial: Sparse signal recovery
February 13, 2012 11:15 am – 12:15 pm

 

Keywords of the presentation: streaming algorithms, group testing, sparse signal recovery, introduction.

 

My talk will be a tutorial about sparse signal recovery but, more importantly, I will provide an overview of what the research problems are at the intersection of biological applications of group testing, streaming algorithms, sparse signal recovery, and coding theory. The talk should help set the stage for the rest of the workshop.

David Golan – Tel Aviv University

 

Weighted Pooling – Simple and Effective Techniques for Pooled High Throughput Sequencing Design
February 15, 2012 11:30 am – 12:00 pm

 

Despite the rapid decrease in sequencing costs, sequencing a large group of individuals is still prohibitively expensive. Recently, several sophisticated pooling designs were suggested that can identify carriers of rare alleles in large cohorts with a significantly smaller number of lanes, thus dramatically reducing the cost of such large scale genotyping projects. These approaches all use combinatorial pooling designs where each individual is either present in a pool or absent from it. One can then infer the number of carriers in a pool, and by combining information across pools, reconstruct the identity of the carriers.

We show that one can gain further efficiency and cost reduction by using “weighted” designs, in which different individuals donate different amounts of DNA to the pools. Intuitively, in this situation the number of mutant reads in a pool does not only indicate the number of carriers, but also of the identity of the carriers.

We describe and study a simple but powerful example of such weighted designs, with non-overlapping pools. We demonstrate that even this naive approach is not only easier to implement and analyze but is also competitive in terms of accuracy with combinatorial designs when identifying very rare variants, and is superior to the combinatorial designs when genotyping more common variants.

We then discuss how weighting can be further incorporated into existing designs to increase their accuracy and demonstrate the resulting improvement in reconstruction efficiency using simulations. Finally, we argue that these weighted designs have enough power to facilitate detection of common alleles, so they can be used as a cornerstone of whole-exome or even whole-genome sequencing projects.


Stefano Lonardi – University of California, Riverside
http://www.cs.ucr.edu/~stelo/

 

Combinatorial Pooling Enables Selective Sequencing of the Barley Gene Space
February 14, 2012 3:15 pm – 3:45 pm

 

Keywords of the presentation: combinatorial pooling, DNA sequencing and assembly, genomics, next-generation sequencing,

 

We propose a new sequencing protocol that combines recent advances in combinatorial pooling design and second-generation sequencing technology to efficiently approach de novo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when dealing with hundreds or thousands of DNA samples, such as genome-tiling gene-rich BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundreds of million of short reads and assign them to the correct BAC clones so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is extremely accurate (99.57% of the deconvoluted reads are assigned to the correct BAC), and the resulting BAC assemblies have very high quality (BACs are covered by contigs over about 77% of their length, on average). Experimental results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate (almost 70% of left/right pairs in paired-end reads are assigned to the same BAC, despite being processed independently) and the BAC assemblies have good quality (the average sum of all assembled contigs is about 88% of the estimated BAC length).

Joint work with D. Duma (UCR), M. Alpert (UCR), F. Cordero (U of Torino), M. Beccuti (U of Torino), P. R. Bhat (UCR and Monsanto), Y. Wu (UCR and Google), G. Ciardo (UCR), B. Alsaihati (UCR), Y. Ma (UCR), S. Wanamaker (UCR), J. Resnik (UCR), and T. J. Close (UCR).

Preprint available at http://arxiv.org/abs/1112.4438

Read More…


Mikhail B Malyutov – Northeastern University
http://www.math.neu.edu/~malioutov/

 

Poster – Upgraded Separate Testing of Inputs in Compressive Sensing
February 14, 2012 4:15 pm – 5:30 pm

 

Screening experiments (SE) deal with finding a small number s Active Inputs (AIs) out of a vast total amount t of inputs in a regressionmodel. Of special interest in the SE theory is finding the so-called maximal rate (capacity). For a set of t elements, denote the set of its s-subsets by (s, t). Introduce a noisy system with t binary inputs and one output y. Suppose that only s 0. The f-capacity Cf (s) = limt!1(log t/Nf (s, t, )) for any > 0 is the ‘limit for the per- formance of the f-analysis’. We obtained tight capacity bounds [4], [5] by formalizing CSS as a special case of Multi-Access Communication Channels (MAC) of information transmission capacity region construction developed by R. Ahlswede in [1] and comparing CSS’ maximal rate (capacity) with small error for two practical methods of outputs’ analysis under the optimal CSS design motivated by applications like [2] . Recovering Active Inputs with small error probability and accurate parameter estimation are both possible with rates less than capacity and impossible with larger rates. A staggering amount of attention was recently devoted to the study of compressive sensing and related areas using sparse priors in over parameterized linear models which may be viewed as a special case of our models with continuous input levels. The threshold phenomenon was empirically observed in early papers [3], [6] : as the dimension of a randominstance of a problem grows there is a sharp transition from successful recovery to failure as a function of the number of observations versus the dimension and sparsity of the unknown signal. Finding this threshold is closely related to our capacity evaluation. Some threshold bounds for the compressive sensing were made using standard information-theoretic tools, e.g. in [15]

Mikhail B Malyutov – Northeastern University
http://www.math.neu.edu/~malioutov/

 

Greedy Separate Testing Sparse Inputs
February 16, 2012 3:15 pm – 3:45 pm

 

Lower and Upper Bounds and Simulated performance of three Analysis methods under Asymptotically OPTIMAL=RANDOMX in terms of CAPACITY C(s).

Olgica Milenkovic – University of Illinois at Urbana-Champaign
http://www.ece.illinois.edu/directory/profile.asp?milenkov

 

Probabilistic and combinatorial models for quantized group testing
February 15, 2012 3:15 pm – 3:45 pm

 

Keywords of the presentation: group testing, MAC channel, quantization, graphical models

 

We consider a novel group testing framework where defectives obey a probabilistic model in which the number and set of defectives is governed by a graphical model. We furthermore assume that the defectives have importance factors that influence their strength on the test outcomes and the accuracy with which the outcomes are read. This kind of scenario arises, for example, in MAC channels with networked users using different communication powers, or in DNA pooling schemes which involve large families.

Jelani Nelson – Princeton University
http://people.seas.harvard.edu/~minilek/

 

Sparser Johnson-Lindenstrauss Transforms
February 16, 2012 10:15 am – 11:15 am

 

Keywords of the presentation: dimensionality reduction, johnson-lindenstrauss, numerical linear algebra, massive data

 

The Johnson-Lindenstrauss (JL) lemma (1984) states that any n points in d-dimensional Euclidean space can be embedded into k = O((log n)/eps^2) dimensions so that all pairwise distances are preserved up to 1+/-eps. Furthermore, this embedding can be achieved via a linear mapping. The JL lemma is a useful tool for speeding up solutions to several high-dimensional problems: closest pair, nearest neighbor, diameter, minimum spanning tree, etc. It also speeds up some clustering and string processing algorithms, reduces the amount of storage required to store a dataset, and can be used to reduce memory required for numerical linear algebra problems such as linear regression and low rank approximation.

The original proofs of the JL lemma let the linear mapping be specified by a random dense k x d matrix (e.g. i.i.d. Gaussian entries). Thus, performing an embedding requires dense matrix-vector multiplication. We give the first construction of linear mappings for JL in which only a subconstant fraction of the embedding matrix is non-zero, regardless of how eps and n are related, thus always speeding up the embedding time. Previous constructions only achieved sparse embedding matrices for 1/eps >> log n.

This is joint work with Daniel Kane (Stanford).


Natasha Przulj – Imperial College London
http://www.doc.ic.ac.uk/~natasha/

 

Network topology as a source of biological information
February 14, 2012 10:15 am – 11:15 am

 

Keywords of the presentation: biological networks, graph algorithms, network alignment

 

Sequence-based computational approaches have revolutionized biological understanding. However, they can fail to explain some biological phenomena. Since proteins aggregate to perform a function instead of acting in isolation, the connectivity of a protein interaction network (PIN) will provide additional insight into the inner working on the cell, over and above sequences of individual proteins. We argue that sequence and network topology give insights into complementary slices of biological information, which sometimes corroborate each other, but sometimes do not. Hence, the advancement depends on the development of sophisticated graph-theoretic methods for extracting biological knowledge purely from network topology before being integrated with other types of biological data (e.g., sequence). However, dealing with large networks is non-trivial, since many graph-theoretic problems are computationally intractable, so heuristic algorithms are sought.

Analogous to sequence alignments, alignments of biological networks will likely impact biomedical understanding. We introduce a family of topology-based network alignment (NA) algorithms, (that we call GRAAL algorithms), that produces by far the most complete alignments of biological networks to date: our alignment of yeast and human PINs demonstrates that even distant species share a surprising amount of PIN topology. We show that both species phylogeny and protein function can be extracted from our topological NA. Furtermore, we demonstrate that the NA quality improves with integration of additional data sources (including sequence) into the alignment algorithm: surprisingly, 77.7% of proteins in the baker’s yeast PIN participate in a connected subnetwork that is fully contained in the human PIN suggesting broad similarities in internal cellular wiring across all life on Earth. Also, we demonstrate that topology around cancer and non-cancer genes is different and when integrated with functional genomics data, it successfully predicts new cancer genes in melanogenesis-related pathways.


Sylvie Ricard-Blum – Université Claude-Bernard (Lyon I)
http://www.ibcp.fr/scripts/affiche_detail.php?n_id=174

 

A dynamic and quantitative protein interaction network regulating angiogenesis
February 16, 2012 3:45 pm – 4:15 pm

 

Keywords of the presentation: protein interaction networks, kinetics, affinity, protein arrays, surface plasmon resonance

 

Angiogenesis, consisting in the formation of blood vessels from preexisting ones, is of crucial importance in pathological situations such as cancer and diabetes and is a therapeutic target for these two diseases. We have developed protein and sugar arrays probed by surface plasmon resonance (SPR) imaging to identify binding partners of proteins, polysaccharides and receptors regulating angiogenesis. Interactions collected from our own experiments and from literature curation have been used to build a network comprising protein-protein and protein-polysaccharide interactions. To switch from a static to a dynamic and quantitative interaction network, we have measured kinetics and affinity of interactions by SPR to discriminate transient from stable interactions and to prioritize interactions in the network. We have also identified protein interaction sites either experimentally or by molecular modeling to discriminate interactions occurring simultaneously from those taking place sequentially. The ultimate step is to integrate, using bioinformatics tools, all these parameters in the interaction network together with inhibitors of these interactions and with gene and protein expression data available from Array Express or Gene Expression Omnibus and from the Human Protein Atlas. This dynamic network will allow us to understand how angiogenesis is regulated in a concerted fashion via several receptors and signaling pathways, to identify crucial interactions for the integrity of the network that are potential therapeutic targets, and to predict the side effects of anti-angiogenic treatments.

Atri Rudra – University at Buffalo (SUNY)
http://www.cse.buffalo.edu/~atri/

 

Tutorial: Group Testing and Coding Theory
February 14, 2012 9:00 am – 10:00 am

 

Keywords of the presentation: Code Concatenation, Coding theory, Group Testing, List Decoding, List Recovery

 

Group testing was formalized by Dorfman in his 1943 paper and was originally used in WW-II to identify soldiers with syphilis. The main insight in this application is that blood samples from different soldiers can be combined to check if at least one of soldiers in the pool has the disease. Since then group testing has found numerous applications in many areas such as (computational) biology, combinatorics and (theoretical) computer science.

Theory of error-correcting codes, or coding theory, was born in the works of Shannon in 1948 and Hamming in 1950. Codes are ubiquitous in our daily life and have also found numerous applications in theoretical computer science in general and computational complexity in particular.

Kautz and Singleton connected these two areas in their 1964 paper by using “code concatenation” to design good group testing schemes. All of the (asymptotically) best know explicit constructions of group testing schemes use the code concatenation paradigm. In this talk, we will focus on the “decoding” problem for group testing: i.e. given the outcomes of the tests on the pools, identify the infected soldiers. Recent applications of group testing in data stream algorithm require sub-linear time decoding, which is not guaranteed by the traditional constructions.

The talk will first survey the Kautz-Singleton construction and then will will show how recent developments in list decoding of codes lead in a modular way to sub-linear time decodable group testing schemes.


Vyacheslav V. Rykov – University of Nebraska
http://www.unomaha.edu/~wwwmath/faculty/rykov/index.html

 

Superimposed Codes and Designs for Group Testing Models
February 14, 2012 11:30 am – 12:30 pm

 

Keywords of the presentation: superimposed codes, screening designs, rate of code, threshold designs and codes

 

We will discuss superimposed codes and non-adaptive group testing designs arising from the potentialities of compressed genotyping models in molecular biology. The given survey is also motivated by the 30th anniversary of our recurrent upper bound on the rate of superimposed codes published in 1982.

Alex Samorodnitsky – Hebrew University
http://www.cs.huji.ac.il/~salex/

 

Testing Boolean functions
February 14, 2012 2:00 pm – 3:00 pm

 

I will talk about property testing of Boolean functions, concentrating on two topics: testing monotonicity (an example of a combinatorial property) and testing the property of being a low-degree polynomial (an algebraic property).

Sriram Sankararaman – Harvard Medical School
http://sriramsankararaman.com/

 

Genomic Privacy and the Limits of Individual Detection in a Pool 
February 15, 2012 10:15 am – 11:15 am

 

Keywords of the presentation: Genomewide association studies, Privacy, Pooled designs, Hypothesis testing, Local Asymptotic normality

 

Statistical power to detect associations in genome-wide association studies can be enhanced by combining data across studies in meta-analysis or replication studies. Such methods require data to flow freely in the scientific community, however, and this raises privacy concerns.

Till recently, many studies pooled individuals together, making only the allele frequencies of each SNP in the pool publicly available. However a technique that could be used to detect the presence of individual genotypes from such data prompted organizations such as the NIH to restrict public access to summary data . To again allow public access to data from association studies, we need to determine which set of SNPs can be safely exposed while preserving an acceptable level of privacy.

To answer this question, we provide an upper bound on the power achievable by any detection method as a function of factors such as the number and the allele frequencies of exposed SNPs, the number of individuals in the pool, and the false positive rate of the method. Our approach is based on casting the problem in a statistical hypothesis testing framework for which the likelihood ratio test (LR-test) attains the maximal power achievable.

Our analysis provides quantitative guidelines for researchers to make SNPs public without compromising privacy. We recommend, based on our analysis, that only common independent SNPs be exposed. The final decision regarding the exposed SNPs should be based on the analytical bound in conjunction with empirical estimates of the power of the LR test. To this end, we have implemented a tool, SecureGenome, that determines the set of SNPs that can be safely exposed for a given dataset.


Alexander Schliep – Rutgers University
http://bioinformatics.rutgers.edu

 

From screening clone libraries to detecting biological agents
February 17, 2012 10:15 am – 11:15 am

 

Keywords of the presentation: (generalized) group testing, screening clone libraries, DNA microarrays

 

Group testing has made many contributions to modern molecular biology. In the Human Genome Project large clone libraries where created to amplify DNA. Widely used group testing schemes vastly accelerated the detection of overlaps between the individual clones in these libraries through experiments, realizing savings both in effort and materials.

Modern molecular biology also contributed to group testing. The problem of generalized group testing (in the combinatorial sense) arises naturally, when one uses oligonucleotide probes to identify biological agents present in a sample. In this setting a group testing design cannot be chosen arbitrarily. The possible columns of a group testing design matrix are prescribed by the biology, namely by the hybridization reactions between target DNA and probes


Noam Shental – Open University of Israel
http://www.openu.ac.il/home/shental/

 

Identification of rare alleles and their carriers using compressed se(que)nsing
February 13, 2012 2:00 pm – 3:00 pm

 

Keywords of the presentation: compressed sensing, group testing, genetics, rare alleles

 

Identification of rare variants by resequencing is important both for detecting novel variations and for screening individuals for known disease alleles. New technologies enable low-cost resequencing of target regions, although it is still prohibitive to test more than a few individuals. We propose a novel pooling design that enables the recovery of novel or known rare alleles and their carriers in groups of individuals. The method is based on combining next-generation sequencing technology with a Compressed Sensing (CS) approach. The approach is general, simple and efficient, allowing for simultaneous identification of multiple variants and their carriers. It reduces experimental costs, i.e., both sample preparation related costs and direct sequencing costs, by up to 70 fold, and thus allowing to scan much larger cohorts. We demonstrate the performance of our approach over several publicly available data sets, including the 1000 Genomes Pilot 3 study. We believe our approach may significantly improve cost effectiveness of future association studies, and in screening large DNA cohorts for specific risk alleles.

We will present initial results of two projects that were initiated following publication. The first project concerns identification of de novo SNPs in genetic disorders common among Ashkenazi Jews, based on sequencing 3000 DNA samples. The second project in plant genetics involves identifying SNPs related to water and silica homeostasis in Sorghum bicolor, based on sequencing 3000 DNA samples using 1-2 Illumina lanes.

Joint work with Amnon Amir from the Weizmann Institute of Science, and Or Zuk from the Broad Institute of MIT and Harvard


Nicolas Thierry-Mieg – Centre National de la Recherche Scientifique (CNRS)
http://membres-timc.imag.fr/Nicolas.Thierry-Mieg/

 

Shifted Transversal Design Smart-pooling: increasing sensitivity, specificity and efficiency in high-throughput biology
February 15, 2012 9:00 am – 10:00 am

 

Keywords of the presentation: combinatorial group testing, smart-pooling, interactome mapping

 

Group testing, also know as smart-pooling, is a promising strategy for achieving high efficiency, sensitivity, and specificity in systems-level projects. It consists in assaying well-chosen pools of probes, such that each probe is present in several pools, hence tested several times. The goal is to construct the pools so that the positive probes can usually be directly identified from the pattern of positive pools, despite the occurrence of false positives and false negatives. While striving for this goal, two interesting mathematical or computational problems emerge: the pooling problem (how should the pools be designed?), and the decoding problem (how to interpret the outcomes?). In this talk I will discuss these questions and the solutions we have proposed: a flexible and powerful combinatorial construction for designing smart-pools (the Shifted Transveral Design, STD), and an efficient exact algorithm for interpreting results (interpool). I will then present the results of validation experiments that we have performed in the context of yeast two-hybrid interactome mapping.

Read More…


David P. Woodruff – IBM Research Division
http://www.almaden.ibm.com/cs/people/dpwoodru/

 

Tutorial: The Data Stream Model
February 16, 2012 9:00 am – 10:00 am

 

The data stream model has emerged as a way of analyzing algorithmic efficiency in the presence of massive data sets. Typically the algorithm is allowed a few (usually one) passes over the input, and must use limited memory and have very fast per input item processing time. I will give a survey of algorithms and lower bounds in this area, with an emphasis on problems such as norm estimation, numerical linear algebra, empirical entropy, l_p-sampling, and heavy hitters. Time-permitting I’ll also discuss the extension to functional monitoring, in which there are multiple sites each with a data stream and a central coordinator should continuously maintain a statistic over the union of the data streams.

Or Zuk – Broad Institute
http://www.weizmann.ac.il/home/fezuk/

 

Reconstruction of bacterial communities using sparse representation
February 17, 2012 11:30 am – 12:30 pm

 

Keywords of the presentation: metagenomics, sparse representation, compressed seqeuncing

 

Determining the identities and frequencies of species present in a sample is a central problem in metagenomics, with scientific, environmental and clinical implications.

A popular approach to the problem is sequencing the Ribosomal 16s RNA gene in the sample using universal primers, and using variation in the gene’s sequence between different species to identify the species present in the sample. We present a novel framework for community reconstruction, based on sparse representation; while millions of microorganisms are present on earth, with known 16s sequences stored in a database, only a small minority (typically a few hundreds) are likely to be present in any given sample,

We discuss the statistical framework, algorithms used and results in terms of accuracy and species resolution.

There is a Bayesian Cake Club, from where we could find a list of papers on Bayesian Statistics:

The reading list given by Professor Xi’an on the topic of ABC (Approximate Beyesian Computation) convergence:

Blog Stats

  • 185,514 hits

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 518 other subscribers