Biology in the Computational World: Bioinformatics

Showing posts with label Bioinformatics. Show all posts

2014-01-08

Single, complete, and average linkage method

The distance between two objects is defined to be the smallest distance possible between them. If both objects are clusters, the distance between the two closest members are used. This calculation is done by equation (8). Single linkage often produces a very skewed hierarchy (called the chaining problem) and is therefore not very useful for summarizing data. However, outlying objects are easily identified by this method, as they will be the last to be merged.

Complete Linkage

This method is much like the single linkage, but instead of using the minimum of the distances, we use the maximum. Complete linkage tends to be less desirable when there is a considerable amount of noise present in the data. Not surprisingly, complete linkage tends to produce very compact clusters.

Average Linkage

This method takes the mean between all the objects in cluster i to all the objects in cluster j. There are several different ways of defining the average distance. In literature some of these are referred to as WPGMA (weighted pair group method with arithmetic mean), UPGMA (unweighted pair group method with arithmetic mean), UPGMC (unweighted pair group method centroid) and WPGMC (weighted pair group method centroid).

Cited from molmine

2013-12-24

K-mer distance

K-mer based distance estimation

K-mer based distance estimation is an alternative to estimating evolutionary distance based on multiple alignments. At a high level, the distance between two sequences is defined by first collecting the set of k-mers (subsequences of length k) occuring in the two sequences. From these two sets, the evolutionary distance between the two organisms is now defined by measuring how different the two sets are. The more the two sets look alike, the smaller is the evolutionary distance. The main motivation for estimating evolutionary distance based on k-mers, is that it is computationally much faster than first constructing a multiple alignment. Experiments show that phylogenetic tree reconstruction using k-mer based distances can produce results comparable to the slower multiple alignment based methods [Blaisdell, 1989].

All of the k-mer based distance measures completely ignores the ordering of the k-mers inside the input sequences. Hence, if the selected k value (the length of the sequences) is too small, very distantly related organisms may be assigned a small evolutionary distance (in the extreme case where k is

, two organisms will be treated as being identical if the frequency of each nucleotide/amino-acid is the same in the two corresponding sequences). In the other extreme, the k-mers should have a length (k) that is somewhat below the average distance between mismatches if the input sequences were aligned (in the extreme case of k=the length of the sequences, two organisms have a maximum distance if they are not identical). Thus the selected k value should not be too large and not too small. A general rule of thumb is to only use k-mer based distance estimation for organisms that are not too distantly related.

Formal definition of distance. In the following, we give a more formal definition of the three supported distance measures: Euclidian-squared, Mahalanobis and Fractional common k-mer count. For all three, we first associate a point

to every input sequence

. Each point

has one coordinate for every possible length k sequence (e.g. if

represent nucleotide sequences, then

has

coordinates). The coordinate corresponding to a length k sequence

has the value: ``number of times

occurs as a subsequence in

''. Now for two sequences

and

, their evolutionary distance is defined as follows:

Euclidian squared: For this measure, the distance is simply defined as the (squared Euclidian) distance between the two points and , i.e.

$\displaystyle \textrm{dist}(s_1,s_2) = \sum_i (p(s_1)_i - p(s_2)_i)^2.$
Mahalanobis: This measure is essentially a fine-tuned version of the Euclidian squared distance measure. Here all the counts are ``normalized'' by dividing with the standard deviation $\sigma_j$ of the count for the k-mer. The revised formula thus becomes:

$\displaystyle \textrm{dist}(s_1,s_2) = \sum_i(p(s_1)_i/\sigma_i - p(s_2)_i/\sigma_i)^2.$

Here the standard deviations can be computed directly from a set of equilibrium frequencies for the different bases, see [Gentleman and Mullin, 1989].
Fractional common k-mer count: For the last measure, the distance is computed based on the minimum count of every k-mer in the two sequences, thus if two sequences are very different, the minimums will all be small. The formula is as follows:

$\displaystyle \textrm{dist}(s_1,s_2) = \log(0.1 + \sum_i(\min(p(s_1)_i,p(s_2)_i)/(\min(n,m)-k+1))).$

Here is the length of and is the length of . This method has been described in [Edgar, 2004].

In experiments performed in [Höhl et al., 2007], the Mahalanobis distance measure seemed to be the best performing of the three supported measures.

Cited from CLCBio

2013-12-13

Inparalog is..

An inparalog pair for a species set S is a pair of genes (a, b), such that a was duplicated from b after a lowest speciation for S.

For example, in figure 1c., there are no inparalogs but there are one inparalogous pair in mouse in figure 1d.

Homologs, orthologs, paralogs

cited www.bio.davidson.edu

2012-01-17

A docking method using SVM

A method focused on protein kinases and kinase inhibitors.

Predicting a small molecule-kinase interaction map: A machine learning approach

Fabian Buchwald, Lothar Richter and Stefan Kramer^*

Cited from Journal of Cheminformatics

2011-11-21

Trie: effective algorithm for searching dictionary

Trie is so call "prefix tree".

In addition, prefix is like the followings.

For example,

String = "abcdef"

1st prefix substring is "abcdef"
2nd prefix substring is "abcde"
3rd prefix substring is "abcd"
....

....

So, all substring of prefix was represented to a tree.

Trie is made from the concept of prefix tree and all nodes of trie don't have any character but all node can have one index.

This example is for S = {"to", "tea", "ted", "ten", "A", "i", "in", "inn"} and the index of all elements of S = {7, 3, 4, 12, 15, 11, 5, 9}

The searching time of trie is very fast, and by the deep first search (DFS), the time complexity is O(m), in case of a string whose length is m.

Cited from Wikipedia

Trie is very available to search dictionary words. Because the sort of dictionary is very similar with trie.

2011-11-17

The age of a single cell assembly

The age of a single cell assembly has become. Every assembler will be reconstructed to focus on the single cell assembly.

Picking up the pieces:

Picking up the pieces

Nature Methods 8, 896 (2011).
doi:10.1038/nmeth.1753

Author: Michael Eisenstein

An improved algorithm for genomic analysis allows scientists to build remarkably accurate and complete genomic sequences from single bacterial cells.

Annotation of the domestic dog genome sequence: finding the missing genes.

The sequencing the domestic dog genome with RNA-Seq technology!!

RNA-Seq.. Now, for genetic research, I don't know that RNA-Seq has to be implemented necessarily with NGS sequencing.

But, the money and time cost to analyze the data are the problem.

Annotation of the domestic dog genome sequence: finding the missing genes.:

Annotation of the domestic dog genome sequence: finding the missing genes.

Mamm Genome. 2011 Nov 11;

Authors: Derrien T, Vaysse A, André C, Hitte C

Abstract

There are over 350 genetically distinct breeds of domestic dog that present considerable variation in morphology, physiology, and disease susceptibility. The genome sequence of the domestic dog was assembled and released in 2005, providing an estimated 20,000 protein-coding genes that are a great asset to the scientific community that uses the dog system as a genetic biomedical model and for comparative and evolutionary studies. Although the canine gene set had been predicted using a combination of ab initio methods, homology studies, motif analysis, and similarity-based programs, it still requires a deep annotation of noncoding genes, alternative splicing, pseudogenes, regulatory regions, and gain and loss events. Such analyses could benefit from new sequencing technologies (RNA-Seq) to better exploit the advantages of the canine genetic system in tracking disease genes. Here, we review the catalog of canine protein-coding genes and the search for missing genes, and we propose rationales for an accurate identification of noncoding genes though next-generation sequencing.

PMID: 22076420 [PubMed - as supplied by publisher]

2011-11-14

Spectra alignment

After k times shift, the algorithm to find common elements between two sets.

D(k) is the maximum common elements between two sets after k times shift.

[Example]

We assume that one shift is that each element, from 6^th element to n^th element, will subtract 5.

Set A = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
Set B = {10, 30, 30, 40, 50, 55, 65, 75, 85, 95}
Set C = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95}

1. D(1) between Set A and Set B is 10.

{10, 20, 30, 40, 50} is common in Set B and a subset {60, 70, 80, 90, 100} of Set A is in common with Set B because of one shift, -5_(6-n)

2. D(1) between Set A and Set C is 6.

{10, 30, 50} is common in Set C and after one shift, {55, 75, 95} will be also in common in Set C.

de Bruijn graph theory

Recently, de Bruijn graph theory has widely used in the filed of genome assembly.

Expecially, Velvet assembler is based on this graph theory.

The goal of De Bruijn graph is to make superstring through overlapping k - 1 suffix sequence.

So this concept can well match with assembly of short reads of a genome.

de Bruijn graph sample (cited from Wikipedia)

How can de Bruijn graph make a superstring?

Because de Bruijn graph is bidirectional graph, about some input information, it can make superstring as output through traveling every edges or node.

de Bruijn graph can be represented as one of between Hamiltonian cycle or Eulerian cycle.

2011-11-10

De novo assembly of single-cell bacterial genomes

In case of uncultured bacterial genome, NGS sequencers don't archive sequencing yet.

However, the multiplie displacement amplification (MDA, Go to Wikipedia and See this paper) method can be available.

MDA method is useful at the single cell but it might encounter amplification bias and chimeras.

Amplification bias is caused from the irregular coverage of amplified sequences and chimeras occur due to low sequence coverage. Most of standard assemblers use uniform coverage, so they can produce lots of erroneous contigs in case of a single cell assembly.

This paper introduces the specialized software tool for assembling single cell of uncultured bacteria.

Efficient de novo assembly of single-cell bacterial genomes from short-read data sets:

Efficient de novo assembly of single-cell bacterial genomes from short-read data sets

Nature Biotechnology 29, 915 (2011).
doi:10.1038/nbt.1966

Authors: Hamidreza Chitsaz, Joyclyn L Yee-Greenbaum, Glenn Tesler, Mary-Jane Lombardo, Christopher L Dupont, Jonathan H Badger, Mark Novotny, Douglas B Rusch, Louise J Fraser, Niall A Gormley, Ole Schulz-Trieglaff, Geoffrey P Smith, Dirk J Evers, Pavel A Pevzner & Roger S Lasken

2011-11-08

'AC' tag of Swissprot dat file

I had known that 'AC' is unique like 'ID' in swissprot.

But, it was not true.

Moreover, there were multiple 'AC's in one 'ID'.

Oooops!!

Biology in the Computational World