Showing posts with label Computer Science. Show all posts
Showing posts with label Computer Science. Show all posts

2014-01-08

Single, complete, and average linkage method

The distance between two objects is defined to be the smallest distance possible between them. If both objects are clusters, the distance between the two closest members are used. This calculation is done by equation (8). Single linkage often produces a very skewed hierarchy (called the chaining problem) and is therefore not very useful for summarizing data. However, outlying objects are easily identified by this method, as they will be the last to be merged.
  


   

Complete Linkage

This method is much like the single linkage, but instead of using the minimum of the distances, we use the maximum. Complete linkage tends to be less desirable when there is a considerable amount of noise present in the data. Not surprisingly, complete linkage tends to produce very compact clusters.

  

Average Linkage

This method takes the mean between all the objects in cluster i to all the objects in cluster j. There are several different ways of defining the average distance. In literature some of these are referred to as WPGMA (weighted pair group method with arithmetic mean), UPGMA (unweighted pair group method with arithmetic mean), UPGMC (unweighted pair group method centroid) and WPGMC (weighted pair group method centroid).


  

2013-12-24

K-mer distance

K-mer based distance estimation

K-mer based distance estimation is an alternative to estimating evolutionary distance based on multiple alignments. At a high level, the distance between two sequences is defined by first collecting the set of k-mers (subsequences of length k) occuring in the two sequences. From these two sets, the evolutionary distance between the two organisms is now defined by measuring how different the two sets are. The more the two sets look alike, the smaller is the evolutionary distance. The main motivation for estimating evolutionary distance based on k-mers, is that it is computationally much faster than first constructing a multiple alignment. Experiments show that phylogenetic tree reconstruction using k-mer based distances can produce results comparable to the slower multiple alignment based methods [Blaisdell, 1989].

All of the k-mer based distance measures completely ignores the ordering of the k-mers inside the input sequences. Hence, if the selected k value (the length of the sequences) is too small, very distantly related organisms may be assigned a small evolutionary distance (in the extreme case where k is $ 1$ , two organisms will be treated as being identical if the frequency of each nucleotide/amino-acid is the same in the two corresponding sequences). In the other extreme, the k-mers should have a length (k) that is somewhat below the average distance between mismatches if the input sequences were aligned (in the extreme case of k=the length of the sequences, two organisms have a maximum distance if they are not identical). Thus the selected k value should not be too large and not too small. A general rule of thumb is to only use k-mer based distance estimation for organisms that are not too distantly related.

Formal definition of distance. In the following, we give a more formal definition of the three supported distance measures: Euclidian-squared, Mahalanobis and Fractional common k-mer count. For all three, we first associate a point $ p(s)$ to every input sequence $ s$ . Each point $ p(s)$ has one coordinate for every possible length k sequence (e.g. if $ s$ represent nucleotide sequences, then $ p(s)$ has $ 4^k$ coordinates). The coordinate corresponding to a length k sequence $ x$ has the value: ``number of times $ x$ occurs as a subsequence in $ s$ ''. Now for two sequences $ s_1$ and $ s_2$ , their evolutionary distance is defined as follows:
  • Euclidian squared: For this measure, the distance is simply defined as the (squared Euclidian) distance between the two points $ p(s_1)$ and $ p(s_2)$ , i.e.
    $\displaystyle \textrm{dist}(s_1,s_2) = \sum_i (p(s_1)_i - p(s_2)_i)^2.$

  • Mahalanobis: This measure is essentially a fine-tuned version of the Euclidian squared distance measure. Here all the counts $ p(s_j)_i$ are ``normalized'' by dividing with the standard deviation $ \sigma_j$ of the count for the k-mer. The revised formula thus becomes:
    $\displaystyle \textrm{dist}(s_1,s_2) = \sum_i(p(s_1)_i/\sigma_i - p(s_2)_i/\sigma_i)^2.$

    Here the standard deviations can be computed directly from a set of equilibrium frequencies for the different bases, see [Gentleman and Mullin, 1989].
  • Fractional common k-mer count: For the last measure, the distance is computed based on the minimum count of every k-mer in the two sequences, thus if two sequences are very different, the minimums will all be small. The formula is as follows:
    $\displaystyle \textrm{dist}(s_1,s_2) = \log(0.1 +
\sum_i(\min(p(s_1)_i,p(s_2)_i)/(\min(n,m)-k+1))).$

    Here $ n$ is the length of $ s_1$ and $ m$ is the length of $ s_2$ . This method has been described in [Edgar, 2004].

In experiments performed in [Höhl et al., 2007], the Mahalanobis distance measure seemed to be the best performing of the three supported measures.


Cited from CLCBio

2012-03-22

Manifold learning algorithm

Manifold learning algorithm is to find more exact distance or low-dimensional parameterization from high-dimensional data.

* Algorithms

  • Isomap
  • Locally linear embedding (LLE)
  • Hessian LLE
  • Local tangent space alignment (LTSA)

I have to study!! study!! study!!

* References
1. markov's blog
2. Wikipedia page for Nonlinear dimensionality reduction
3. Z. Zhang, J. Wang, H. Zha, Adaptive manifold  Learning, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol.34, No. 2.

2012-02-04

Dijkstra's algorithm

Dijkstra algorithm is typical  greedy algorithm.

Dijkstra algorithm run time, cited from Wikipedia
Therefore, Dijkstra algorithm has the weakness of greedy, which not the path calculated this algorithm is always the shortest path.

But, the shortest pathway problem is NP (non-deterministic polynomial) problem, so this algorithm is really attractive.

2012-02-01

Gale-Shapley algorithm for stable matching problem

Gale-Shapley (G-S) algorithm is to solve stable matching (or stable marriage) problem.

Stable matching is a story about two groups, men and women.

Each man like more than one woman but does not like identically.

Also, each woman likes more than one man but does not like identically.

If you want to contact each other in the their satisfaction, you will be able to G-S algorithm.

Proposor (like men); Acceptor (like women)
Cited from Wikipedia

G-S algorithm is really simple.
// Cited from Wikipedia, http://en.wikipedia.org/wiki/Stable_marriage_problem

function stableMatching {
    Initialize all m ∈ M and w ∈ W to free
    while ∃ free man m who still has a woman w to propose to {
       w = m's highest ranked such woman to whom he has not yet proposed
       if w is free
         (m, w) become engaged
       else some pair (m', w) already exists
         if w prefers m to m'
           (m, w) become engaged
           m' becomes free
         else
           (m', w) remain engaged
    }
}

2012-01-19

Vertex cover

Vertex cover problem is a set of vertices such that each edge of the graph is incident at least one vertex of the set.

Especially, minimum vertex cover is classical issue in graph theory because of NP-hard.

So, many algorithms for solving minimum vertex cover problem are using approximation algorithm.

Vertex cover
Minimum vertex cover
 Minimum vertex cover is that smallest set of vertices that are vertex cover.

2012-01-06

Approximation algorithm


In computer science and operations research, approximation algorithms are algorithms used to find approximate solutions to optimization problems.


Approximation algorithms are often associated with NP-hard problems; since it is unlikely that there can ever be efficientpolynomial time exact algorithms solving NP-hard problems, one settles for polynomial time sub-optimal solutions. 


Unlike heuristics, which usually only find reasonably good solutions reasonably fast, one wants provable solution quality and provable run time bounds. Ideally, the approximation is optimal up to a small constant factor (for instance within 5% of the optimal solution). 


Approximation algorithms are increasingly being used for problems where exact polynomial-time algorithms are known but are too expensive due to the input size.

2011-11-21

Trie: effective algorithm for searching dictionary

Trie is so call "prefix tree".

In addition, prefix is like the followings.

For example,

String = "abcdef"

1st prefix substring is "abcdef"
2nd prefix substring is "abcde"
3rd prefix substring is "abcd"
....

....

So, all substring of prefix was represented to a tree.

Trie is made from the concept of prefix tree and all nodes of trie don't have any character but all node can have one index.

This example is for S = {"to", "tea", "ted", "ten", "A", "i", "in", "inn"} and the index of all elements of S = {7, 3, 4, 12, 15, 11, 5, 9}

The searching time of trie is very fast, and by the deep first search (DFS), the time complexity is O(m), in case of a string whose length is m.

Cited from Wikipedia

Trie is very available to search dictionary words. Because the sort of dictionary is very similar with trie.


2011-11-14

Spectra alignment




After k times shift, the algorithm to find common elements between two sets.


D(k) is the maximum common elements between two sets after k times shift.


[Example]

We assume that one shift is that each element, from 6th element to nth element, will subtract 5.

Set A = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
Set B = {10, 30, 30, 40, 50, 55, 65, 75, 85, 95}
Set C = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95}

1. D(1) between Set A and Set B is 10.

{10, 20, 30, 40, 50} is common in Set B and a subset {60, 70, 80, 90, 100} of Set A is in common with Set B because of one shift, -5(6-n)

2. D(1) between Set A and Set C is 6.

{10, 30, 50} is common in Set C and after one shift, {55, 75, 95} will be also in common in Set C.


de Bruijn graph theory

Recently, de Bruijn graph theory has widely used in the filed of genome assembly.

Expecially, Velvet assembler is based on this graph theory.

The goal of De Bruijn graph is to make superstring through overlapping k - 1 suffix sequence.

So this concept can well match with assembly of short reads of a genome.

de Bruijn graph sample (cited from Wikipedia)


How can de Bruijn graph make a superstring?

Because de Bruijn graph is bidirectional graph, about some input information, it can make superstring as output through traveling every edges or node.

de Bruijn graph can be represented as one of between Hamiltonian cycle or Eulerian cycle.

2011-11-08

'AC' tag of Swissprot dat file

I had known that 'AC' is unique like 'ID' in swissprot.

But, it was not true.

Moreover, there were multiple 'AC's in one 'ID'.

Oooops!!