### IMA meeting on Group Testing Designs, Algorithms, and Applications to Biology

**Length Reduction via Polynomials**

February 16, 2012 11:30 am – 12:30 pm

*Keywords of the presentation*: Sparse covlutions, polynomials, finite fields, length reduction.

Both randomized and deterministic algorithms were developed for efficiently computing the sparse FFT. The key operation in all these algorithms was length reduction. The sparse data is mapped into small vectors that preserve the convolution result. The reduction method used to-date was the modulo function since it preserves location (of the ”1” bits) up to cyclic shift.

In this paper we present a new method for length reduction – polynomials. We show that this method allows a faster deterministic computation of the sparse FFT than currently known in the literature. It also enables the development of an efficient algorithm for computing the binary sparse Walsh Transform. To our knowledge, this is the first such algorithm in the literature.

(Joint work with Oren Kappah, Ely Porat, and Amir Rothschild)

**RNA Structure Characterization from High-Throughput Chemical Mapping Experiments**

February 13, 2012 3:45 pm – 4:45 pm

*Keywords of the presentation*: RNA structure characterization, high-throughput sequencing, maximum likelihood estimation

In this talk, I will review recent developments in experimental RNA structure characterization as well as advances in sequencing technologies. I will then describe the SHAPE-Seq technique, focusing on its automated data analysis method, which relies on a novel probabilistic model of a SHAPE-Seq experiment, adjoined by a rigorous maximum likelihood estimation framework. I will demonstrate the accuracy and simplicity of our approach as well as its applicability to a general class of chemical mapping techniques and to more traditional SHAPE experiments that use capillary electrophoresis to identify and quantify primer extension products.

This is joint work with Lior Pachter, Julius Lucks, Stefanie Mortimer, Shujun Luo, Cole Trapnell, Gary Schroth, Jennifer Doudna and Adam Arkin.

**Improved Constructions for Non-adaptive Threshold Group Testing**

February 15, 2012 3:45 pm – 4:15 pm

*Keywords of the presentation*: Group testing, Explicit constructions

**Competitive Testing for Evaluating Priced Functions**

February 15, 2012 2:00 pm – 3:00 pm

*Keywords of the presentation*: function evaluation, competitive analysis, Boolean functions, decision trees, adaptive testing, non-uniform costs

**A Group Testing Approach to Corruption Localizing Hashing**

February 16, 2012 2:00 pm – 3:00 pm

*Keywords of the presentation*: Algorithms, Cryptography, Corruption-Localizing Hashing, Group Testing, Superimposed Codes.

**Poster – Finding one of m defective elements**

February 14, 2012 4:15 pm – 5:30 pm

**Tutorial: Cost effective sequencing of rare genetic variations**

February 13, 2012 10:00 am – 11:00 am

My tutorial will provide background on rare genetic variations and DNA sequencing. I will present our sample prep strategy, called DNA Sudoku, that utilizes combinatorial pooling/compressed sensing approach to find rare genetic variations. More importantly, I will discuss several major distinction from the classical combinatorial due to sequencing specific constraints.

**Mining Rare Human Variations using Combinatorial Pooling**

February 13, 2012 3:15 pm – 3:45 pm

*Keywords of the presentation*: high-throughput sequencing, combinatorial pooling, liquid-handling, compressed sensing

We have developed a protocol for quantifying, calibrating, and pooling DNA samples using a liquid-handling robot, which has required a significant amount of testing in order to reduce volume variation. I will discuss our protocol and the steps we have taken to reduce CV. For accurate decoding and to reduce the possibility of specimen dropout, it is important that the DNA samples are accurately quantified and calibrated so that equal amounts can be pooled and sequenced. We can determine the number of carriers in each pool from sequencing output and reconstruct the original identity of individual specimens based on the pooling design, allowing us to identify a small number of carriers in a large cohort.

**Superimposed codes**

February 14, 2012 3:45 pm – 4:15 pm

There are lots of versions and related problems, like Sidon sets, sum-free sets, unionfree families, locally thin families, cover-free codes and families, etc. We discuss two cases cancellative and union-free codes.

A family of sets Ƒ (and the corresponding code of 0-1 vectors) is called union-free if *A ∪ B≠ C ∪ D* and *A,B,C,D* ∈ F imply {*A,B*} = {*C,D*}. Ƒ is called *t*-cancellative if for all distict *t* + 2 members *A _{1}, … ,A_{t}* and

*B,C ∈ Ƒ*

A1 ∪ … ∪ A_{t} ∪ B ≠ A_{1} ∪ … A_{t} ∪C:

Let *c _{t}(n)* be the size of the largest

*t*-cancellative code on

*n*elements. We significantly improve the previous upper bounds of Körner and Sinaimeri, e.g., we show

*c2(n)*≤ 2

^{0:322n}(for

*n > n*).

_{0}

**Streaming algorithms for approximating the length of the longest increasing subsequence**

February 17, 2012 9:00 am – 10:00 am

*Keywords of the presentation*: Data streams, lower bounds, longest increasing subsequence

I will talk about proving lower bounds on how much space (memory) is necessary to still be able to solve the given task. I will focus on the problem of approximating the length of the longest increasing subsequence, which is a measure of how well the data is sorted.

Joint work with Parikshit Gopalan.

**Tutorial: Sparse signal recovery**

February 13, 2012 11:15 am – 12:15 pm

*Keywords of the presentation*: streaming algorithms, group testing, sparse signal recovery, introduction.

**Weighted Pooling – Simple and Effective Techniques for Pooled High Throughput Sequencing Design**

February 15, 2012 11:30 am – 12:00 pm

We show that one can gain further efficiency and cost reduction by using “weighted” designs, in which different individuals donate different amounts of DNA to the pools. Intuitively, in this situation the number of mutant reads in a pool does not only indicate the number of carriers, but also of the identity of the carriers.

We describe and study a simple but powerful example of such weighted designs, with non-overlapping pools. We demonstrate that even this naive approach is not only easier to implement and analyze but is also competitive in terms of accuracy with combinatorial designs when identifying very rare variants, and is superior to the combinatorial designs when genotyping more common variants.

We then discuss how weighting can be further incorporated into existing designs to increase their accuracy and demonstrate the resulting improvement in reconstruction efficiency using simulations. Finally, we argue that these weighted designs have enough power to facilitate detection of common alleles, so they can be used as a cornerstone of whole-exome or even whole-genome sequencing projects.

**Combinatorial Pooling Enables Selective Sequencing of the Barley Gene Space**

February 14, 2012 3:15 pm – 3:45 pm

*Keywords of the presentation*: combinatorial pooling, DNA sequencing and assembly, genomics, next-generation sequencing,

Joint work with D. Duma (UCR), M. Alpert (UCR), F. Cordero (U of Torino), M. Beccuti (U of Torino), P. R. Bhat (UCR and Monsanto), Y. Wu (UCR and Google), G. Ciardo (UCR), B. Alsaihati (UCR), Y. Ma (UCR), S. Wanamaker (UCR), J. Resnik (UCR), and T. J. Close (UCR).

Preprint available at http://arxiv.org/abs/1112.4438

**Poster – Upgraded Separate Testing of Inputs in Compressive Sensing**

February 14, 2012 4:15 pm – 5:30 pm

*X*in terms of CAPACITY

*C(s)*.

**Olgica Milenkovic**– University of Illinois at Urbana-Champaign

http://www.ece.illinois.edu/directory/profile.asp?milenkov

**Probabilistic and combinatorial models for quantized group testing**

February 15, 2012 3:15 pm – 3:45 pm

*Keywords of the presentation*: group testing, MAC channel, quantization, graphical models

**Sparser Johnson-Lindenstrauss Transforms**

February 16, 2012 10:15 am – 11:15 am

*Keywords of the presentation*: dimensionality reduction, johnson-lindenstrauss, numerical linear algebra, massive data

The original proofs of the JL lemma let the linear mapping be specified by a random dense k x d matrix (e.g. i.i.d. Gaussian entries). Thus, performing an embedding requires dense matrix-vector multiplication. We give the first construction of linear mappings for JL in which only a subconstant fraction of the embedding matrix is non-zero, regardless of how eps and n are related, thus always speeding up the embedding time. Previous constructions only achieved sparse embedding matrices for 1/eps >> log n.

This is joint work with Daniel Kane (Stanford).

**Network topology as a source of biological information**

February 14, 2012 10:15 am – 11:15 am

*Keywords of the presentation*: biological networks, graph algorithms, network alignment

Analogous to sequence alignments, alignments of biological networks will likely impact biomedical understanding. We introduce a family of topology-based network alignment (NA) algorithms, (that we call GRAAL algorithms), that produces by far the most complete alignments of biological networks to date: our alignment of yeast and human PINs demonstrates that even distant species share a surprising amount of PIN topology. We show that both species phylogeny and protein function can be extracted from our topological NA. Furtermore, we demonstrate that the NA quality improves with integration of additional data sources (including sequence) into the alignment algorithm: surprisingly, 77.7% of proteins in the baker’s yeast PIN participate in a connected subnetwork that is fully contained in the human PIN suggesting broad similarities in internal cellular wiring across all life on Earth. Also, we demonstrate that topology around cancer and non-cancer genes is different and when integrated with functional genomics data, it successfully predicts new cancer genes in melanogenesis-related pathways.

**Sylvie Ricard-Blum**– Université Claude-Bernard (Lyon I)

http://www.ibcp.fr/scripts/affiche_detail.php?n_id=174

**A dynamic and quantitative protein interaction network regulating angiogenesis**

February 16, 2012 3:45 pm – 4:15 pm

*Keywords of the presentation*: protein interaction networks, kinetics, affinity, protein arrays, surface plasmon resonance

**Tutorial: Group Testing and Coding Theory**

February 14, 2012 9:00 am – 10:00 am

*Keywords of the presentation*: Code Concatenation, Coding theory, Group Testing, List Decoding, List Recovery

Theory of error-correcting codes, or coding theory, was born in the works of Shannon in 1948 and Hamming in 1950. Codes are ubiquitous in our daily life and have also found numerous applications in theoretical computer science in general and computational complexity in particular.

Kautz and Singleton connected these two areas in their 1964 paper by using “code concatenation” to design good group testing schemes. All of the (asymptotically) best know explicit constructions of group testing schemes use the code concatenation paradigm. In this talk, we will focus on the “decoding” problem for group testing: i.e. given the outcomes of the tests on the pools, identify the infected soldiers. Recent applications of group testing in data stream algorithm require sub-linear time decoding, which is not guaranteed by the traditional constructions.

The talk will first survey the Kautz-Singleton construction and then will will show how recent developments in list decoding of codes lead in a modular way to sub-linear time decodable group testing schemes.

**Vyacheslav V. Rykov**– University of Nebraska

http://www.unomaha.edu/~wwwmath/faculty/rykov/index.html

**Superimposed Codes and Designs for Group Testing Models**

February 14, 2012 11:30 am – 12:30 pm

*Keywords of the presentation*: superimposed codes, screening designs, rate of code, threshold designs and codes

**Testing Boolean functions**

February 14, 2012 2:00 pm – 3:00 pm

**Genomic Privacy and the Limits of Individual Detection in a Pool**

February 15, 2012 10:15 am – 11:15 am

*Keywords of the presentation*: Genomewide association studies, Privacy, Pooled designs, Hypothesis testing, Local Asymptotic normality

Till recently, many studies pooled individuals together, making only the allele frequencies of each SNP in the pool publicly available. However a technique that could be used to detect the presence of individual genotypes from such data prompted organizations such as the NIH to restrict public access to summary data . To again allow public access to data from association studies, we need to determine which set of SNPs can be safely exposed while preserving an acceptable level of privacy.

To answer this question, we provide an upper bound on the power achievable by any detection method as a function of factors such as the number and the allele frequencies of exposed SNPs, the number of individuals in the pool, and the false positive rate of the method. Our approach is based on casting the problem in a statistical hypothesis testing framework for which the likelihood ratio test (LR-test) attains the maximal power achievable.

Our analysis provides quantitative guidelines for researchers to make SNPs public without compromising privacy. We recommend, based on our analysis, that only common independent SNPs be exposed. The final decision regarding the exposed SNPs should be based on the analytical bound in conjunction with empirical estimates of the power of the LR test. To this end, we have implemented a tool, SecureGenome, that determines the set of SNPs that can be safely exposed for a given dataset.

**From screening clone libraries to detecting biological agents**

February 17, 2012 10:15 am – 11:15 am

*Keywords of the presentation*: (generalized) group testing, screening clone libraries, DNA microarrays

Modern molecular biology also contributed to group testing. The problem of generalized group testing (in the combinatorial sense) arises naturally, when one uses oligonucleotide probes to identify biological agents present in a sample. In this setting a group testing design cannot be chosen arbitrarily. The possible columns of a group testing design matrix are prescribed by the biology, namely by the hybridization reactions between target DNA and probes

**Identification of rare alleles and their carriers using compressed se(que)nsing**

February 13, 2012 2:00 pm – 3:00 pm

*Keywords of the presentation*: compressed sensing, group testing, genetics, rare alleles

We will present initial results of two projects that were initiated following publication. The first project concerns identification of de novo SNPs in genetic disorders common among Ashkenazi Jews, based on sequencing 3000 DNA samples. The second project in plant genetics involves identifying SNPs related to water and silica homeostasis in Sorghum bicolor, based on sequencing 3000 DNA samples using 1-2 Illumina lanes.

Joint work with Amnon Amir from the Weizmann Institute of Science, and Or Zuk from the Broad Institute of MIT and Harvard

**Nicolas Thierry-Mieg**– Centre National de la Recherche Scientifique (CNRS)

http://membres-timc.imag.fr/Nicolas.Thierry-Mieg/

**Shifted Transversal Design Smart-pooling: increasing sensitivity, specificity and efficiency in high-throughput biology**

February 15, 2012 9:00 am – 10:00 am

*Keywords of the presentation*: combinatorial group testing, smart-pooling, interactome mapping

**Tutorial: The Data Stream Model**

February 16, 2012 9:00 am – 10:00 am

**Reconstruction of bacterial communities using sparse representation**

February 17, 2012 11:30 am – 12:30 pm

*Keywords of the presentation*: metagenomics, sparse representation, compressed seqeuncing

A popular approach to the problem is sequencing the Ribosomal 16s RNA gene in the sample using universal primers, and using variation in the gene’s sequence between different species to identify the species present in the sample. We present a novel framework for community reconstruction, based on sparse representation; while millions of microorganisms are present on earth, with known 16s sequences stored in a database, only a small minority (typically a few hundreds) are likely to be present in any given sample,

We discuss the statistical framework, algorithms used and results in terms of accuracy and species resolution.

## Leave a comment

Comments feed for this article