Generating (DNA and Protein) Motifs from Profile Matrixes

Paul FODOR , Joy DUTTA , Steven SKIENA

Stony Brook University, Stony Brook, NY 11720

 

Abstract

A DNA profile matrix gives probability of each base appearing in each position, so a length-n profile matrix scores the likelihood of each of 4n sequences (for DNA sequences the alphabet size is 4). Similar with DNA profiles are the protein profiles but the alphabet size in this case varies with the number of different amino acids (usually the alphabet size m is 20, the alphabet being the symbols of the 20 amino acids represented in the genetic code – the alphabet size can vary since in nature were discovered until now more than 500 amino acids). Given a profile matrix and appropriate data structure, we want to generate these sequences in order of decreasing weight. By taking logs of the probabilities, this reduces to scoring sequences in terms of sums of weight elements. The simplest case is online X+Y sorting, which corresponds to a profile of length 2 (2-dimension matrix) on an m-letter alphabet. This can be solved in O(log2 m) time per element (if the probabilities are already sorted). We use a priority queue (from STL C++) where each element consists of the (i, j) index pair keyed on the Xi + Yj. In each iteration, we remove the min remaining sum, and enqueue the two successor sums (i+1, j) and (i, j+1).

A similar thing can be done for larger profiles (n-dimension matrix), but it seems the set of active profiles (those on heap) can approach mn-1, where m is the alphabet size and n is the length of the profile (close to the diagonal). When properly viewed, the algorithm becomes clean and simple.

We created an efficient implementation (C++) of the proposed algorithm with a nice and usable interface for naive users  - website on Internet (in Java Server Pages and Servlets):

Our program includes different execution options, like: give first k-mers, give all k-mers over a given probability, give the k-th number motif. We also explored different directions of the problem like: what about motifs with don’t care terms and ways to incorporate motifs from motif databases in our program.

We realized multiple experiments concerning its performance that cover multiple directions (we took input DNA profiles from multiple sources  - see chapter with experiments for sources):

 

1.    Introduction: About Profiles and Motifs

Certain patterns and protein sequences are conserved by evolution, meaning that they likely have important biological functions and thus, is important to be able to identify possible patterns (called motifs) from a set of related sequences, and be able to test whether a given string contains a given motif.

For example:

Highly conserved doesn’t mean completely conserved and such sequence matches may not stand out in any pair wise alignment (these regions can mutate on non important bases).

There are two standard ways to summarize a pattern across multiple sequences:

·        A profile is a matrix of probabilities, where the rows represent possible bases and the columns represent consecutive sequence positions (see Table 1.1) (note: a perfect profile has one base with probability 1.0 in each column, while a perfectly useless profile has entries of equal probability).

 

s1

s2

s3

A

0.90

0.26

0.00

C

0.10

0.25

0.40

G

0.00

0.25

0.60

T

0.00

0.24

0.00

Motif

A

A

G

Table 1: Example of profile and motif

·        Consensus sequence is a concise representation of the highest probability characters of the profile, as a string or regular expression. Note 1: One can think of consensus as an “ancestor” motif, from which mutated motifs emerged. Note 2: The distance between a real motif and the consensus sequence is generally less than that for two real motifs.

The Motif Finding Problem is a defined as follows: given a random sample of sequences (DNA or amino acids) find the pattern that is implanted in each of the individual arrays, namely, the motif. In order to solve the problem, one has to record the frequencies of patterns in the nucleotide sequences and previous knowledge of established motifs reduces the complexity of the problem.

Limitations: we don’t have the complete dictionary of motifs, the “genetic” language does not have a standard “grammar”, only a small fraction of nucleotide sequences encode for motifs, the size of data is enormous and human intelligence is focused on creating good algorithms for the problem rather than on making inferences and guesses on the possible motifs in the sequence.

The Motif Testing Problems defined as follows: test whether a given string contains a given motif. When a single contiguous string represents a motif, exact text search methods suffice to see if a given sequence contains it. When a profile does not contain indels, the probability that a given sequence contains the motif can be easily determined by multiplying the probabilities of each character in each position of each window. If one has a probabilistic model of how typical DNA or amino acids sequences are generated, one can easily compute the probability that a given profile score is significant. When profiles or motifs have gaps, the optimal alignment can be computed using a dynamic programming-type algorithm.


2.    Description of the Specific Problem and Possible Algorithms

Given a profile matrix and appropriate data structure, we wanted to generate the motif sequences in order of decreasing weight.

Given an alphabet S = {α1, α2, …, αi, … αm} and m x n probability matrix P = [pi,j] with the properties:

·        0 ≤ pi,j ≤ 1

·        Si=1,m pi,j =1

Using the matrix and the alphabet we can construct a string S = s1, s2, …, sj, …, sn, where for each j=1,n sj = αi where i is in [1,m] with probability pi,j.

Example: S = {A, C, G, T} and P is given by the Table 2.

 

S

 

s1

S2

S3

α1 = A

0.90

0.26

0.00

α 2 = C

0.10

0.25

0.40

Α3 = G

0.00

0.25

0.60

Α4 = T

0.00

0.24

0.00

Table 2: Example of probability table

The probability for sequence “ACG” is: P(ACG) = (0.9)(0.25)(0.60) = 0.0135

By taking logs of the probabilities, this reduces to scoring sequences in terms of sums of weight elements:

Plog(ACG) = log(0.9) + log(0.25) + log(0.60) = log(0.0135)

Getting the most probable string (consensus sequence) is trivial: greedily selecting the character with highest probability for each position, the cost being O(nm). Finding the second most probable string requires finding the character αk ≠ αi with the lowest penalty. The penalty for changing sj = αi to sj = αk is equal to pi,j – pi,k.

It is seen that continuing in this fashion will not always get you the next most probable string - there's a need to be doing some sort of “backtracking”. For example giving the first 10 motifs for the example above (Table 3), one can easily observe that between positions 3 and 4 or 7 and 8 we have a backtrack:

Position

Motif

Probability

1

AAG

0.1404

2

ACG

0.1350

3

AGG

0.1350

4

ATG

0.1296

5

AAC

0.0936

6

ACC

0.0900 


 

7

AGC

0.0900  

8

ATC

0.0864

9

CAG

0.0156

10

CGG

0.0150

Table 3: Example of “backtracking” in line 4

The problem which we wanted to solve in this project was to find a technique to quickly generate the first k most probable strings from a profile matrix. A brute force method would require to construct all mn different possible motifs, pick the first k possible and sort them – this solution requires exponential time: O(mn) for generating the motifs, computing the probability of each one and selecting the first k and O(k logk) to sort the k-mers.

We looked for a different geometric solution which does not require exponential time and depends on the value of k since usually we just want to calculate a small number of kmers.

The algorithm is simple and is illustrated below:

Above we considered that each dimension has the same importance but our algorithm considers that different positions have different importance by using a geometric approach of the distance from the point to a query line (n-dimensional plane). This algorithm for n-dimensions is using the formula for the distance from a point to a n-dimensional plane:

·        given the point with coordinates(x01, x02, …, x0n) and the plane A1 x1 + A2 x2 + … + An xn + C = 0

           

ALGORITHM in pseudocod – this algorithm is implemented fast in C++ (see appendices for complete listing or website: http://www.lmc.cs.sunysb.edu/~pfodor/motif)

 

Input:

·        the alphabet size m

·        the string size n (n-dimensions)

·        the matrix P(n, m)

·        output k most probable strings

·        profile matrix m x n

 

1. Sort the columns of the matrix by probability and remember the mappings with the original matrix (keep in each element <probability, index i in alphabet>).

We use the following data structure: composite-point <point1, point2, …, pointn>.

3. We create a priority queue candidate of composite points in which the priority of the composite point is given by the distance from the point to the n-dimensional plane (query line).

4. En-queue the point with the highest probability on all n dimensions.

5. Start iteration until we found k points – we de-queued n point from the priority queue (do iteration k times):

de-queue the highest priority composite point and insert n neigbours (calculate the distance for en-queue).

To understand the algorithm completely consider the simplest case X+Y sorting, which corresponds to a profile of length 2 (2-dimension matrix) on an m-letter alphabet. This can be solved in O(log2 m) time per element (if the probabilities are already sorted). We use a priority queue (from STL C++) where each element consists of the (i, j) index pair keyed on the Xi + Yj. In each iteration, we remove the min remaining sum, and enqueue the two successor sums (i+1, j) and (i, j+1).

This case can be visualized (length n = 2 and an alphabet of size |Σ| = m) in Figure 1 and the execution of the algorithm can be seen in Figure 2.

Figure 1: Geometric interpretation in 2

Create matrix – original view                          Step 1: find 1st max – choose max in all dimensions

                                                                                                - next candidates: (p’m,2, p’m-1,1), (p’m-1,2, p’m,1)

Step 2: find 2nd max                                                      Step 3: find 3rd max

- next candidates: (p’m,2, p’m-2,1), (p’m-1,2, p’m-1,1) …. (p’m-1,2, p’m,1)

Figure 2: Algorithm execution for 2 dimensions

For three dimensions case the reader can see Figure 3:

Figure 2: Geometric interpretation in 3

Considering the time this algorithm is much faster than the brute force approach since the preprocessing cost was O(nm logm) and to calculate each next kmer is O(lognm) per element (we insert the new candidates in the priority queue – we used STL C++ priority queue). Now is obvious that for small k’s this is much further than the brute force which required exponential time O(mn).

Next it seems the set of active profiles (those on heap) can approach mn-1, where m is the alphabet size and n is the length of the profile (close to the diagonal), but this case is very improbable since we always want the first small number of k’s motifs and we won’t reach near the diagonal.

Our program includes different execution options, like:

For computing only the k-th motif we tried to come with a faster algorithm (by binary search on n-dinemsions) but we realized that this is not possible – starting point was the Zone Theorem (which is like a binary search method on n dimensions) which says that we can “march" through the zone of a query line q in O(mn). In addition, |Zone|=O(mn-1)  implies S(m, n) = O(mn-1), but all this does not say anything about how to choose the position of the query line only how many points are above the query line after you already choose a query line.

We also explored different directions of the problem like:

This project opens many future directions but one of the most important is to implement a fast motif finding program. This was over our time possibilities but is a good and not very complex future direction. It that takes as input: a gene sequence and a set of profile matrices (with either counts or thresholds) and output: where on the sequence does the profile adequately or best match.

 

Algorithm:

·        Build suffix tree of the sequence.

·        For each profile, do an exact search in the suffix tree with the best k-mers for it in decreasing order until it matches (print the motifs and positions :-)


3.    Performance Evaluation and Experiments

 

We realized multiple experiments concerning its performance that cover multiple directions (we took input DNA profiles from multiple sources  - see chapter with experiments for sources):

 

We also tried to see for typical motifs in motif databases how many patterns are expected at a given probability threshold, or what is the expected probability of the i-th motif. We wanted to find a faster algorithm for what if we just want to know all motifs with probability over a threshold, or even just how many there are (binary search).

 

4.    Conclusion

A DNA profile matrix gives probability of each base appearing in each position, so a length-n profile matrix scores the likelihood of each of 4n sequences (for DNA sequences the alphabet size is 4). Similar with DNA profiles are the protein profiles but the alphabet size in this case varies with the number of different amino acids (usually the alphabet size m is 20, the alphabet being the symbols of the 20 amino acids represented in the genetic code – the alphabet size can vary since in nature were discovered until now more than 500 amino acids – see http://en.wikipedia.org/wiki/Amino_acids).

Given a profile matrix and appropriate data structure, we want to generate these sequences in order of decreasing weight. By taking logs of the probabilities, this reduces to scoring sequences in terms of sums of weight elements.

The simplest case is online X+Y sorting, which corresponds to a profile of length 2 (2-dimension matrix) on an m-letter alphabet. This can be solved in O(log2 m) time per element (if the probabilities are already sorted). We use a priority queue (from STL C++) where each element consists of the (i, j) index pair keyed on the Xi + Yj. In each iteration, we remove the min remaining sum, and enqueue the two successor sums (i+1, j) and (i, j+1).

A similar thing can be done for larger profiles (n-dimension matrix), but it seems the set of active profiles (those on heap) can approach mn-1, where m is the alphabet size and n is the length of the profile (close to the diagonal). When properly viewed, the algorithm becomes clean and simple.

We created an efficient implementation (C++) of the proposed algorithm with a nice and usable interface for naive users  - website on Internet (in Java Server Pages and Servlets):

Our program includes different execution options, like: give first k-mers, give all k-mers over a given probability, give the k-th number motif. We also explored different directions of the problem like: what about motifs with don’t care terms and ways to incorporate motifs from motif databases in our program.

We realized multiple experiments concerning its performance that cover multiple directions (we took input DNA profiles from multiple sources  - see chapter with experiments for sources):

 

We also tried to see for typical motifs in motif databases how many patterns are expected at a given probability threshold, or what is the expected probability of the i-th motif. We wanted to find a faster algorithm for what if we just want to know all motifs with probability over a threshold, or even just how many there are (binary search).

 

We also tried to implement a fast motif finding program that takes as input: a gene sequence and a set of profile matrices (with either counts or thresholds) and output: where on the sequence does the profile adequately or best match – we didn’t have time to implement this algorithm since this was over our time possibilities – the algorithm is described in the future work section.

 

algorithm:

        Build suffix tree (call an existing program)

        for each profile, do an exact search in the suffix tree with the

                best k-mers for it in decreasing order until it matches.

Appendix A – Source Code

 

A.1. kthmp-geometric.cpp

 


// Program to find the kth most probable sequence of length n where albhabet size is m - geometric approach

 

#ifdef __GNUC__  

#define int64 long long  

#else // MSVC, say

#pragma warning(disable: 4786)  

#endif

 

#include <iostream>

#include <fstream>

#include <vector>

#include <utility>

#include <algorithm>

#include <map>

#include <queue>

#include <cmath>

 

using namespace std;

 

///////////////////////////////////////////////////////////////////////////

 

class MatrixElem

{

public:

  float prob;

  int alphabet;

  int sortedIndex;

 

  MatrixElem(float p=0.0, int a=0): prob(p), alphabet(a) { sortedIndex = -1; }

  friend ostream& operator<<(ostream& out, const MatrixElem& me) {

    out << "<" << me.sortedIndex << "," << me.prob << "," << me.alphabet << ">";

    return out;

  }

  bool operator==(const MatrixElem& me) {

    if(prob == me.prob && alphabet == me.alphabet && sortedIndex == me.sortedIndex) return true;

    return false;

  }

  static bool comparator(const MatrixElem& me1, const MatrixElem& me2) {

    if(me1.prob > me2.prob) return true;

    return false;

  }

};

 

///////////////////////////////////////////////////////////////////////////

 

typedef vector<int> VI;

typedef vector<float> VF;

typedef vector<MatrixElem> VE;

typedef vector<vector<MatrixElem> > VVE;

 

///////////////////////////////////////////////////////////////////////////

 

class CompositePoint

{

private:

  CompositePoint() {}

  float computeSumLogProbs(const VE&);

  float computeDistFromNDimPlane(const VF&);

 

public:

  float distFromNDimPlane;

  float sumLogProbs; // sum of log of probabilities of the elems

  VE matElems; // must have n elements

  CompositePoint(VE elems, const VF& coeffs) {

    matElems = elems;

    sumLogProbs = computeSumLogProbs(elems);

    distFromNDimPlane = computeDistFromNDimPlane(coeffs);

  }

 

  class comparator {

  public:

    bool operator()(const CompositePoint& cp1, const CompositePoint& cp2) {

      if(cp1.distFromNDimPlane < cp2.distFromNDimPlane) return true;

      return false;

    }

  };

};

 

float CompositePoint::computeSumLogProbs(const VE& elems)

{

  float sum = 0.0;

  for(int i=0; i<elems.size(); i++) {

    sum += log(elems[i].prob);

  }

  return sum;

}

 

float CompositePoint::computeDistFromNDimPlane(const VF& coeffs)

{

  float coeff_sq_sum=0.0, prob_sum=0.0;

  int n = coeffs.size();

 

  for(int i=0; i<n; i++) {

     prob_sum += coeffs[i] * matElems[i].prob;

     coeff_sq_sum += coeffs[i] * coeffs[i];

  }

 

  return prob_sum / sqrt(coeff_sq_sum);

}

 

///////////////////////////////////////////////////////////////////////////

 

ostream& operator<<(ostream& out, const VE& ve)

{

  int i;

  for(i=0; i<ve.size(); i++) {

    out << ve[i] << " ";

  }

  return out;

}

 

ostream& operator<<(ostream& out, const VVE& vve)

{

  int i;

  for(i=0; i<vve.size(); i++) {

    out << vve[i] << endl;

  }

  return out;

}

 

ostream& operator<<(ostream& out, const CompositePoint& cp)

{

  out << "CP[(" << cp.matElems << "), " << cp.distFromNDimPlane << "]";

  return out;

}

 

///////////////////////////////////////////////////////////////////////////

 

// a simple mapping to the alphabet symbol from the representing integer

string getAlphabet(int index)

{

  switch(index) {

    case 0: return "A";

    case 1: return "C";

    case 2: return "T";

    case 3: return "G";

    default: return "X";

  }

}

 

///////////////////////////////////////////////////////////////////////////

 

string getSequence(const CompositePoint& cp)

{