Category Archive

You are currently browsing the category archive for the ‘Bio-Glossary’ category.

【Bio-Glossary】Hardy-Weinberg equilibrium

July 23, 2012 in Bio-Glossary | Leave a comment

The Hardy-Weinberg equilibrium is a principle stating that the genetic variation in a population will remain constant from one generation to the next in the absence of disturbing factors. When mating is random in a large population with no disruptive circumstances, the law predicts that both genotype and allele frequencies will remain constant because they are in equilibrium.

The Hardy-Weinberg equilibrium can be disturbed by a number of forces, including mutations, natural selection, nonrandom mating, genetic drift, and gene flow. For instance, mutations disrupt the equilibrium of allele frequencies by introducing new alleles into a population. Similarly, natural selection and nonrandom mating disrupt the Hardy-Weinberg equilibrium because they result in changes in gene frequencies. This occurs because certain alleles help or harm the reproductive success of the organisms that carry them. Another factor that can upset this equilibrium is genetic drift, which occurs when allele frequencies grow higher or lower by chance and typically takes place in small populations. Gene flow, which occurs when breeding between two populations transfers new alleles into a population, can also alter the Hardy-Weinberg equilibrium.

Because all of these disruptive forces commonly occur in nature, the Hardy-Weinberg equilibrium rarely applies in reality. Therefore, the Hardy-Weinberg equilibrium describes an idealized state, and genetic variations in nature can be measured as changes from this equilibrium state.

From Hardy-Weinberg equilibrium.

PS: In a mathematical way, we have the following:

$P_{Aa}^{2}=4P_{AA}P_{aa}$ .

【Bio-Glossary】Microbiome

April 26, 2012 in Bio-Glossary, Biostatistics, Statistics | Leave a comment

I just came back from the talk, “Statistical Methods for Analysis of Gut Microbiome Data” , given by Professor Hongzhe Lee from University of Pennysylvania.

I learned this new biological name: Microbiome—-as extended human genomes.

A microbiome is the totality of microbes, their genetic elements (genomes), and environmental interactions in a particular environment. The term “microbiome” was coined by Joshua Lederberg, who argued that microorganisms inhabiting the human body should be included as part of the human genome, because of their influence on human physiology. The human body contains over 10 times more microbial cells than human cells.

There are several research methods:

Targeted amplicon sequencing

Targeted amplicon sequencing relies on having some expectations about the composition of the community that is being studied. In target amplicon sequencing a phylogenetically informative marker is targeted for sequencing. Such a marker should be present in ideally all the expected organisms. It should also evolve in such a way that it is conserved enough that primers can target genes from a wide range of organisms while evolving quickly enough to allow for finer resolution at the taxonomic level. A common marker for human microbiome studies is the gene for bacterial 16S rRNA (i.e. “16S rDNA”, the sequence of DNA which encodes the ribosomal RNA molecule). Since ribosomes are present in all living organisms, using 16S rDNA allows for DNA to be amplified from many more organisms than if another marker were used. The 16S rDNA gene contains both slowly evolving regions and fast evolving regions; the former can be used to design broad primers while the latter allow for finer taxonomic distinction. However, species-level resolution is not typically possible using the 16S rDNA. Primer selection is an important step, as anything that cannot be targeted by the primer will not be amplified and thus will not be detected. Different sets of primers have been shown to amplify different taxonomic groups due to sequence variation.

Targeted studies of eukaryotic and viral communities are limited and subject to the challenge of excluding host DNA from amplification and the reduced eukaryotic and viral biomass in the human microbiome.

After the amplicons are sequenced, molecular phylogenetic methods are used to infer the composition of the microbial community. This is done by clustering the amplicons into operational taxonomic units (OTUs) and inferring phylogenetic relationships between the sequences. An important point is that the scale of data is extensive, and further approaches must be taken to identify patterns from the available information. Tools used to analyze the data include VAMPS, QIIME and mothur.

Metagenomic sequencing

Metagenomics is also used extensively for studying microbial communities. In metagenomic sequencing, DNA is recovered directly from environmental samples in an untargeted manner with the goal of obtaining an unbiased sample from all genes of all members of the community. Recent studies use shotgun Sanger sequencing or pyrosequencing to recover the sequences of the reads. The reads can then be assembled into contigs. To determine the phylogenetic identity of a sequence, it is compared to available full genome sequences using methods such as BLAST. One drawback of this approach is that many members of microbial communities do not have a representative sequenced genome.

Despite the fact that metagenomics is limited by the availability of reference sequences, one significant advantage of metagenomics over targeted amplicon sequencing is that metagenomics data can elucidate the functional potential of the community DNA. Targeted gene surveys cannot do this as they only reveal the phylogenetic relationship between the same gene from different organisms. Functional analysis is done by comparing the recovered sequences to databases of metagenomic annotations such as KEGG. The metabolic pathways that these genes are involved in can then be predicted with tools such as MG-RAST, CAMERA^[42] and IMG/M.

RNA and protein-based approaches

Metatranscriptomics studies have been performed to study the gene expression of microbial communities through methods such as the pyrosequencing of extracted RNA. Structure based studies have also identified non-coding RNAs (ncRNAs) such as ribozymes from microbiota.^[45]Metaproteomics is a new approach that studies the proteins expressed by microbiota, giving insight into its functional potential.

He analyzed two statistical methods based on the first technology listed above:

Kernel-based regression to test the effect of Microbiome composition on an outcome
Sparse Dirichlet-Multinomial regression for Taxon-level analysis

The following is the abstract of this talk:

With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. In some cases, we may have a large number of covariates and identification of the relevant covariates and their associated bacterial taxa becomes important. In this talk, I present several statistical methods for analysis of the human microbiome data, including exploratory analysis methods such as generalized UniFrac distances and graph-constrained canonical correlations and statistical models for the count data and simplex data. In particular, I present a sparse group variable selection method for Dirichlet-multinomial regression to account for overdispersion of the counts and to impose a sparse group L1 penalty to encourage both group-level and within-group sparsity. I demonstrate the application of these methods with an on-going human gut microbiome study to investigate the association between nutrient intake and microbiome composition. Finally, I present several challenging statistical and computational problems in analysis of shotgun metagenomics data.

【Bio-Glossary】Biological Replicates vs Technical Replicates

April 24, 2012 in Bio-Glossary, Biostatistics | 2 comments

The meaning of the term ”Biological Replicate” unfortunately often does not get adequately addressed in many publications. “Biological Replicate” can have multiple meanings, depending upon the context of the study. A general definition could be that biological replicates are when the same type of organism is grown/treated under the same conditions. For example, if one was performing a cell-based study, then different flasks containing the same type of cell (and preferably the exact same lineage and passage number) which have been grown under the same conditions could be considered biological replicates of one another. The definition becomes a bit trickier when dealing with higher-order organisms, especially humans. This may be an entire discussion in and of itself, but in this case, it is important to note that one does not have a well-defined lineage or passage number for humans. Indeed, it is basically impossible to ensure that all of your samples for one treatment or control have been exposed to the same external factors. In this case, one must do all that is possible to accurately portray and group these organisms; thus, one should group according to such traits as gender, age, and other well-established cause-effect traits (smokers, heavy drinkers, etc.).

Also, it may be helpful to outline the contrast between biological and technical replicates. Though people have varying definitions of technical replicates, perhaps the purest form of technical replicate would be when the exact same sample (after all preparatory techniques) is analyzed multiple times. The point of such a technical replicate would be to establish the variability (experimental error) of the analysis technique (mass spectrometry, LC, etc.), thus allowing one to set confidence limits for what is significant data. This is in contrast to the reasoning behind a biological replicate, which is to establish the biological variability which exists between organisms which should be identical. Knowing the inherent variability between “identical” organisms allows one to decide whether observed differences between groups of organisms exposed to different treatments is simply random or represents a “true” biological difference induced by such treatment.

Biological Factor: Single biological parameter controlled by the investigator. For example, genotype, diet, environmental stimulus, age, etc.

Treatment or Treatment Level: An exact value for a biological factor; for example, stress, no-stress, young, old, drug-treated, placebo, etc.

Condition: A single combination of treatments; for example, strain1/stressed/time10, young/drug-treated, etc.

Sample: An entity which has a single condition and is measured experimentally; for example serum from a single mouse, a sample drawn from a pool of yeast, a sample of pancreatic beta cells pooled from 5 diabetic animals, the third blood sample taken from a participant in a drug study.

Biological Measurement: A value measured on a collection of samples; for example, abundance of protein x, abundance of phospho-protein y, abundance of transcript z.

Experiment: A collection of biological measurements on two or more samples.

Replicate: Two sets of measurements, either within a single experiment or in two different experiments, where measurements are made on samples in the same condition.

Technical Replicates: Replicates that share the same sample; i.e. the measurements are repeated.

Biological Replicates: Replicates where different samples are used for both replicates

Question: Technical/Biological Replicates in RNA-Seq For Two Cell Lines

I have a question around the meaning of “biological replicate” in the context of applying RNA-seq to compare two cell lines. Apologies if this is an overly naeve question.

We have two human cell lines, one of which was derived from the other. Both have different phenotypes, and we want to use RNA-seq to explore the genetic underpinnings of the difference.

If we generate one cDNA library for each sample, and sequence each library on two lanes of an Illumina GA flowcell, I understand we will have “technical replicates”. In this scenario, we can expect very little difference between the two replicates in a sample. If we were to use something like DESeq to call differential expression, it would be inappropriate to treat our technical replicates as replicates in DESeq, since that would likely lead to a large list of DE calls that don’t reflect biological differences.

So, I’d like to know if it possible within our model to have “biological replicates” with which we can use DESeq to call biologically meaningful differential expression.

So, two questions:

(1) If we grow up two sets of cells from each of our two cell lines, generate separate cDNA libraries (4 in total), and sequence them on separate lanes, would these be considered “biological replicates” in the sense that it would be appropriate to treat them as replicates within something like DESeq. I suspect not, since the fact that both replicates in a sample derive from a single cell line within a short period of time will mean that they will be very similar anyway, almost as similar as the technical replicate scenario. Perhaps we would need entirely separate cell lines to be considered biological replicates.

(2) In general, how would others address this – does it seem a better approach to go with separate cells and separate libraries, or would this entail extra effort for effectively no benefit?

Answer:

Two “biological replicate” are two samples that should be identical (as much as you can/want control) but are biologically separated (different cells, different organisms, different populations, colonies…)

You want to check the difference between cell Line A and cell line B. Let’s start assuming they are identical. Even if they are, by random fluctuation, technical issues, intrinsic slightly different environments… you will never observe that all genes have exactly the same expression. You find differences but can’t conclude if they are inevitable fluctuation or result of an actual difference.

So, you want to have 2 independent populations from A and two independent populations from B and then see how the variability WITHIN A1 and A2 compare to B1 and B2. The RNA levels from A1 and A2 WILL NOT be the same because… because biological system are far from being deterministic. They might be very similar, but different.

because A and B would be on different plates (their environment) I would seed A1 and A2, B1 and B2 the same day on 4 distinct (but as similar as possible) dishes, grow them together in the same condition to minimize external influence, and then collect at once from the 4 cell lines, extract RNA…

Since the cost is not growing cell lines, but sequencing, I would recommend to do 4 independent replicates for A and for B (or any other cell lines you may be interested in) in ONE GO, and then freeze the sample or the RNA. Even better, if you could have somebody to give you the lines called alfa, bravo, charlie, delta… (make sure they keep track of what they are in a safe place 😉 ) so that you are not biased while seeding, growing and manipulating the lines, that would be even better!

【Bio-Glossary】haplotype

November 22, 2011 in Bio-Glossary, Biostatistics | Leave a comment

A haplotype is a group of genes within an organism that was inherited together from a single parent. The word “haplotype” is derived from the word “haploid,” which describes cells with only one set of chromosomes, and from the word “genotype,” which refers to the genetic makeup of an organism. A haplotype can describe a pair of genes inherited together from one parent on one chromosome, or it can describe all of the genes on a chromosome that were inherited together from a single parent. This group of genes was inherited together because of genetic linkage, or the phenomenon by which genes that are close to each other on the same chromosome are often inherited together. In addition, the term “haplotype” can also refer to the inheritance of a cluster of single nucleotide polymorphisms (SNPs), which are variations at single positions in the DNA sequence among individuals.

By examining haplotypes, scientists can identify patterns of genetic variation that are associated with health and disease states. For instance, if a haplotype is associated with a certain disease, then scientists can examine stretches of DNA near the SNP cluster to try to identify the gene or genes responsible for causing the disease.

Over the course of many generations, segments of the ancestral chromosomes in an interbreeding population are shuffled through repeated recombination events. Some of the segments of the ancestral chromosomes occur as regions of DNA sequences that are shared by multiple individuals (Figure 1). These segments are regions of chromosomes that have not been broken up by recombination, and they are separated by places where recombination has occurred. These segments are the haplotypes that enable geneticists to search for genes involved in diseases and other medically important traits.

多代之后，经过反复的重组事件，祖先染色体的片段的原有排布在非近亲结婚的人群中已被打乱。某些祖先染色体片段会在许多后代个体的DNA序列中出现。这些是没有被重组打破的区段，相互间被那些发生了重组的区域隔开。这些区段就是单体型(haplotype)，遗传学家利用它可以寻找与疾病或者其它医学上的重要特征相关的基因。

Category Archive

【Bio-Glossary】Hardy-Weinberg equilibrium

【Bio-Glossary】Microbiome

Targeted amplicon sequencing

Metagenomic sequencing

RNA and protein-based approaches

【Bio-Glossary】Biological Replicates vs Technical Replicates

【Bio-Glossary】haplotype

Recent Comments

Blog Stats

Log In/Out

Email Subscription

Recent Posts

Twitter Updates

Categories

Archives

Bioinformatics

Blogroll

CS blogs

general math blogs

interesting blogs

Journal Club

machine learning blogs

Newly Added

probability blogs

statistics blogs

Blog Stats