Generating (DNA and
Protein) Motifs from Profile Matrixes
Paul FODOR , Joy DUTTA , Steven
SKIENA
Stony Brook University,
Stony Brook, NY 11720
A DNA profile matrix gives probability of each base appearing in each position, so a length-n profile matrix scores the likelihood of each of 4n sequences (for DNA sequences the alphabet size is 4). Similar with DNA profiles are the protein profiles but the alphabet size in this case varies with the number of different amino acids (usually the alphabet size m is 20, the alphabet being the symbols of the 20 amino acids represented in the genetic code – the alphabet size can vary since in nature were discovered until now more than 500 amino acids). Given a profile matrix and appropriate data structure, we want to generate these sequences in order of decreasing weight. By taking logs of the probabilities, this reduces to scoring sequences in terms of sums of weight elements. The simplest case is online X+Y sorting, which corresponds to a profile of length 2 (2-dimension matrix) on an m-letter alphabet. This can be solved in O(log2 m) time per element (if the probabilities are already sorted). We use a priority queue (from STL C++) where each element consists of the (i, j) index pair keyed on the Xi + Yj. In each iteration, we remove the min remaining sum, and enqueue the two successor sums (i+1, j) and (i, j+1).
A similar thing can be done for larger profiles (n-dimension matrix), but it seems the set of active profiles (those on heap) can approach mn-1, where m is the alphabet size and n is the length of the profile (close to the diagonal). When properly viewed, the algorithm becomes clean and simple.
We created an efficient implementation (C++) of the proposed algorithm with a nice and usable interface for naive users - website on Internet (in Java Server Pages and Servlets):
Our program includes different execution options, like: give first k-mers, give all k-mers over a given probability, give the k-th number motif. We also explored different directions of the problem like: what about motifs with don’t care terms and ways to incorporate motifs from motif databases in our program.
We realized multiple experiments concerning its performance that cover multiple directions (we took input DNA profiles from multiple sources - see chapter with experiments for sources):
Certain patterns and protein sequences are conserved by evolution, meaning that they likely have important biological functions and thus, is important to be able to identify possible patterns (called motifs) from a set of related sequences, and be able to test whether a given string contains a given motif.
For example:
Highly conserved doesn’t mean completely conserved and such sequence matches may not stand out in any pair wise alignment (these regions can mutate on non important bases).
There are two standard ways to summarize a pattern across multiple sequences:
· A profile is a matrix of probabilities, where the rows represent possible bases and the columns represent consecutive sequence positions (see Table 1.1) (note: a perfect profile has one base with probability 1.0 in each column, while a perfectly useless profile has entries of equal probability).
|
|
s1 |
s2 |
s3 |
|
A |
|||
|
C |
|||
|
G |
|||
|
T |
|||
|
Motif |
A |
A |
G |
Table 1: Example of profile and motif
· Consensus sequence is a concise representation of the highest probability characters of the profile, as a string or regular expression. Note 1: One can think of consensus as an “ancestor” motif, from which mutated motifs emerged. Note 2: The distance between a real motif and the consensus sequence is generally less than that for two real motifs.
The Motif Finding Problem is a defined as follows: given a random sample of sequences (DNA or amino acids) find the pattern that is implanted in each of the individual arrays, namely, the motif. In order to solve the problem, one has to record the frequencies of patterns in the nucleotide sequences and previous knowledge of established motifs reduces the complexity of the problem.
Limitations: we don’t have the complete dictionary of motifs, the “genetic” language does not have a standard “grammar”, only a small fraction of nucleotide sequences encode for motifs, the size of data is enormous and human intelligence is focused on creating good algorithms for the problem rather than on making inferences and guesses on the possible motifs in the sequence.
The Motif Testing Problems defined as follows: test whether a given string contains a given motif. When a single contiguous string represents a motif, exact text search methods suffice to see if a given sequence contains it. When a profile does not contain indels, the probability that a given sequence contains the motif can be easily determined by multiplying the probabilities of each character in each position of each window. If one has a probabilistic model of how typical DNA or amino acids sequences are generated, one can easily compute the probability that a given profile score is significant. When profiles or motifs have gaps, the optimal alignment can be computed using a dynamic programming-type algorithm.
Table 2: Example of probability table
The probability for sequence “ACG” is: P(ACG) =
(0.9)(0.25)(0.60) = 0.0135
By taking logs of the probabilities, this reduces to scoring sequences in terms of sums of weight elements:
Plog(ACG) = log(0.9) + log(0.25) + log(0.60) =
log(0.0135)
Getting the most probable string (consensus sequence) is trivial: greedily selecting the character with highest probability for each position, the cost being O(nm). Finding the second most probable string requires finding the character αk ≠ αi with the lowest penalty. The penalty for changing sj = αi to sj = αk is equal to pi,j – pi,k.
It is seen that continuing in this fashion will not always get you the next most probable string - there's a need to be doing some sort of “backtracking”. For example giving the first 10 motifs for the example above (Table 3), one can easily observe that between positions 3 and 4 or 7 and 8 we have a backtrack:
|
AAG |
0.1404 |
|
|
ACG |
0.1350 |
|
|
AGG |
0.1350 |
|
|
ATG |
0.1296 |
|
|
AAC |
0.0936 |
|
|
ACC |
0.0900 |
|
AGC |
0.0900 |
|
|
ATC |
0.0864 |
|
|
CAG |
0.0156 |
|
|
CGG |
0.0150 |
Table 3: Example of “backtracking” in line 4
The problem which we wanted to solve in this project was to find a technique to quickly generate the first k most probable strings from a profile matrix. A brute force method would require to construct all mn different possible motifs, pick the first k possible and sort them – this solution requires exponential time: O(mn) for generating the motifs, computing the probability of each one and selecting the first k and O(k logk) to sort the k-mers.
We looked for a different geometric solution which does not require exponential time and depends on the value of k since usually we just want to calculate a small number of kmers.
The algorithm is simple and is illustrated below:
Above we considered that each dimension has the same importance but our algorithm considers that different positions have different importance by using a geometric approach of the distance from the point to a query line (n-dimensional plane). This algorithm for n-dimensions is using the formula for the distance from a point to a n-dimensional plane:
· given the point with coordinates(x01, x02, …, x0n) and the plane A1 x1 + A2 x2 + … + An xn + C = 0

ALGORITHM in
pseudocod – this algorithm is implemented fast in C++ (see appendices for
complete listing or website: http://www.lmc.cs.sunysb.edu/~pfodor/motif)
Input:
·
the alphabet size m
·
the string size n
(n-dimensions)
·
the matrix P(n, m)
·
output k most probable strings
·
profile matrix m x n
1.
Sort the columns of the matrix by probability and remember the mappings with
the original matrix (keep in each element <probability, index i in
alphabet>).
We
use the following data structure: composite-point <point1, point2,
…, pointn>.
3.
We create a priority queue candidate of composite points in which the priority
of the composite point is given by the distance from the point to the
n-dimensional plane (query line).
4.
En-queue the point with the highest probability on all n dimensions.
5.
Start iteration until we found k points – we de-queued n point from the
priority queue (do iteration k times):
de-queue the highest priority composite point and insert n neigbours (calculate the distance for en-queue).
To understand the algorithm completely consider the simplest case X+Y sorting, which corresponds to a profile of length 2 (2-dimension matrix) on an m-letter alphabet. This can be solved in O(log2 m) time per element (if the probabilities are already sorted). We use a priority queue (from STL C++) where each element consists of the (i, j) index pair keyed on the Xi + Yj. In each iteration, we remove the min remaining sum, and enqueue the two successor sums (i+1, j) and (i, j+1).
This case can be visualized (length n = 2 and an alphabet of size |Σ| = m) in Figure 1 and the execution of the algorithm can be seen in Figure 2.

Figure 1: Geometric
interpretation in
2


Create matrix – original view Step 1: find 1st max – choose max in all dimensions
- next candidates: (p’m,2, p’m-1,1), (p’m-1,2, p’m,1)


Step 2: find 2nd max Step 3: find 3rd max
- next candidates: (p’m,2, p’m-2,1), (p’m-1,2, p’m-1,1) …. (p’m-1,2, p’m,1)
Figure 2: Algorithm execution for 2 dimensions
For three dimensions case the reader can see Figure 3:

Figure 2: Geometric interpretation in
3
Considering the time this algorithm is much faster than the brute force approach since the preprocessing cost was O(nm logm) and to calculate each next kmer is O(lognm) per element (we insert the new candidates in the priority queue – we used STL C++ priority queue). Now is obvious that for small k’s this is much further than the brute force which required exponential time O(mn).
Next it seems the set of active profiles (those on heap) can approach mn-1, where m is the alphabet size and n is the length of the profile (close to the diagonal), but this case is very improbable since we always want the first small number of k’s motifs and we won’t reach near the diagonal.
Our program includes different execution options, like:
For computing only the k-th motif we tried to come with a faster algorithm (by binary search on n-dinemsions) but we realized that this is not possible – starting point was the Zone Theorem (which is like a binary search method on n dimensions) which says that we can “march" through the zone of a query line q in O(mn). In addition, |Zone|=O(mn-1) implies S(m, n) = O(mn-1), but all this does not say anything about how to choose the position of the query line only how many points are above the query line after you already choose a query line.
We also explored different directions of the problem like:
This project opens many future directions but one of the most important is to implement a fast motif finding program. This was over our time possibilities but is a good and not very complex future direction. It that takes as input: a gene sequence and a set of profile matrices (with either counts or thresholds) and output: where on the sequence does the profile adequately or best match.
Algorithm:
· Build suffix tree of the sequence.
· For each profile, do an exact search in the suffix tree with the best k-mers for it in decreasing order until it matches (print the motifs and positions :-)
We realized multiple experiments concerning its performance that cover multiple directions (we took input DNA profiles from multiple sources - see chapter with experiments for sources):
We also tried to see for typical motifs in motif databases how many patterns are expected at a given probability threshold, or what is the expected probability of the i-th motif. We wanted to find a faster algorithm for what if we just want to know all motifs with probability over a threshold, or even just how many there are (binary search).
A DNA profile matrix gives probability of each base appearing in each position, so a length-n profile matrix scores the likelihood of each of 4n sequences (for DNA sequences the alphabet size is 4). Similar with DNA profiles are the protein profiles but the alphabet size in this case varies with the number of different amino acids (usually the alphabet size m is 20, the alphabet being the symbols of the 20 amino acids represented in the genetic code – the alphabet size can vary since in nature were discovered until now more than 500 amino acids – see http://en.wikipedia.org/wiki/Amino_acids).
Given a profile matrix and appropriate data structure, we want to generate these sequences in order of decreasing weight. By taking logs of the probabilities, this reduces to scoring sequences in terms of sums of weight elements.
The simplest case is online X+Y sorting, which corresponds to a profile of length 2 (2-dimension matrix) on an m-letter alphabet. This can be solved in O(log2 m) time per element (if the probabilities are already sorted). We use a priority queue (from STL C++) where each element consists of the (i, j) index pair keyed on the Xi + Yj. In each iteration, we remove the min remaining sum, and enqueue the two successor sums (i+1, j) and (i, j+1).
A similar thing can be done for larger profiles (n-dimension matrix), but it seems the set of active profiles (those on heap) can approach mn-1, where m is the alphabet size and n is the length of the profile (close to the diagonal). When properly viewed, the algorithm becomes clean and simple.
We created an efficient implementation (C++) of the proposed algorithm with a nice and usable interface for naive users - website on Internet (in Java Server Pages and Servlets):
Our program includes different execution options, like: give first k-mers, give all k-mers over a given probability, give the k-th number motif. We also explored different directions of the problem like: what about motifs with don’t care terms and ways to incorporate motifs from motif databases in our program.
We realized multiple experiments concerning its performance that cover multiple directions (we took input DNA profiles from multiple sources - see chapter with experiments for sources):
We also tried to see for typical motifs in motif databases how many patterns are expected at a given probability threshold, or what is the expected probability of the i-th motif. We wanted to find a faster algorithm for what if we just want to know all motifs with probability over a threshold, or even just how many there are (binary search).
We also tried to implement a fast motif finding program that takes as input: a gene sequence and a set of profile matrices (with either counts or thresholds) and output: where on the sequence does the profile adequately or best match – we didn’t have time to implement this algorithm since this was over our time possibilities – the algorithm is described in the future work section.
algorithm:
Build suffix tree (call an existing program)
for each profile, do an exact search in the suffix tree with the
best k-mers for it in decreasing order until it matches.
A.1. kthmp-geometric.cpp
// Program to find the kth most probable
sequence of length n where albhabet size is m - geometric approach
#ifdef __GNUC__
#define int64 long long
#else // MSVC, say
#pragma warning(disable: 4786)
#endif
#include <iostream>
#include <fstream>
#include <vector>
#include <utility>
#include <algorithm>
#include <map>
#include <queue>
#include <cmath>
using namespace std;
///////////////////////////////////////////////////////////////////////////
class MatrixElem
{
public:
float
prob;
int
alphabet;
int
sortedIndex;
MatrixElem(float p=0.0, int a=0): prob(p), alphabet(a) { sortedIndex =
-1; }
friend
ostream& operator<<(ostream& out, const MatrixElem& me) {
out
<< "<" << me.sortedIndex << ","
<< me.prob << "," << me.alphabet <<
">";
return out;
}
bool
operator==(const MatrixElem& me) {
if(prob == me.prob && alphabet == me.alphabet &&
sortedIndex == me.sortedIndex) return true;
return false;
}
static bool comparator(const MatrixElem&
me1, const MatrixElem& me2) {
if(me1.prob > me2.prob) return true;
return false;
}
};
///////////////////////////////////////////////////////////////////////////
typedef vector<int> VI;
typedef vector<float> VF;
typedef vector<MatrixElem> VE;
typedef vector<vector<MatrixElem>
> VVE;
///////////////////////////////////////////////////////////////////////////
class CompositePoint
{
private:
CompositePoint() {}
float
computeSumLogProbs(const VE&);
float
computeDistFromNDimPlane(const VF&);
public:
float
distFromNDimPlane;
float
sumLogProbs; // sum of log of probabilities of the elems
VE
matElems; // must have n elements
CompositePoint(VE elems, const VF& coeffs) {
matElems = elems;
sumLogProbs
= computeSumLogProbs(elems);
distFromNDimPlane = computeDistFromNDimPlane(coeffs);
}
class
comparator {
public:
bool
operator()(const CompositePoint& cp1, const CompositePoint& cp2) {
if(cp1.distFromNDimPlane < cp2.distFromNDimPlane) return true;
return false;
}
};
};
float CompositePoint::computeSumLogProbs(const
VE& elems)
{
float
sum = 0.0;
for(int i=0; i<elems.size(); i++) {
sum
+= log(elems[i].prob);
}
return
sum;
}
float CompositePoint::computeDistFromNDimPlane(const
VF& coeffs)
{
float
coeff_sq_sum=0.0, prob_sum=0.0;
int n
= coeffs.size();
for(int i=0; i<n; i++) {
prob_sum += coeffs[i] * matElems[i].prob;
coeff_sq_sum += coeffs[i] * coeffs[i];
}
return
prob_sum / sqrt(coeff_sq_sum);
}
///////////////////////////////////////////////////////////////////////////
ostream& operator<<(ostream&
out, const VE& ve)
{
int i;
for(i=0; i<ve.size(); i++) {
out
<< ve[i] << " ";
}
return
out;
}
ostream& operator<<(ostream&
out, const VVE& vve)
{
int i;
for(i=0; i<vve.size(); i++) {
out
<< vve[i] << endl;
}
return
out;
}
ostream& operator<<(ostream&
out, const CompositePoint& cp)
{
out
<< "CP[(" << cp.matElems << "), "
<< cp.distFromNDimPlane << "]";
return
out;
}
///////////////////////////////////////////////////////////////////////////
// a simple mapping to the alphabet symbol
from the representing integer
string getAlphabet(int index)
{
switch(index) {
case
0: return "A";
case
1: return "C";
case
2: return "T";
case
3: return "G";
default: return "X";
}
}
///////////////////////////////////////////////////////////////////////////
string getSequence(const CompositePoint&
cp)
{