Generating (DNA and Protein) Motifs from Profile Matrixes

Paul FODOR , Joy DUTTA , Steven SKIENA

Stony Brook University, Stony Brook, NY 11720

# Abstract

A DNA profile matrix gives probability of each base appearing in each position, so a length-n profile matrix scores the likelihood of each of 4n sequences (for DNA sequences the alphabet size is 4). Similar with DNA profiles are the protein profiles but the alphabet size in this case varies with the number of different amino acids (usually the alphabet size m is 20, the alphabet being the symbols of the 20 amino acids represented in the genetic code – the alphabet size can vary since in nature were discovered until now more than 500 amino acids). Given a profile matrix and appropriate data structure, we want to generate these sequences in order of decreasing weight. By taking logs of the probabilities, this reduces to scoring sequences in terms of sums of weight elements. The simplest case is online X+Y sorting, which corresponds to a profile of length 2 (2-dimension matrix) on an m-letter alphabet. This can be solved in O(log2 m) time per element (if the probabilities are already sorted). We use a priority queue (from STL C++) where each element consists of the (i, j) index pair keyed on the Xi + Yj. In each iteration, we remove the min remaining sum, and enqueue the two successor sums (i+1, j) and (i, j+1).

A similar thing can be done for larger profiles (n-dimension matrix), but it seems the set of active profiles (those on heap) can approach mn-1, where m is the alphabet size and n is the length of the profile (close to the diagonal). When properly viewed, the algorithm becomes clean and simple.

We created an efficient implementation (C++) of the proposed algorithm with a nice and usable interface for naive users  - website on Internet (in Java Server Pages and Servlets):

Our program includes different execution options, like: give first k-mers, give all k-mers over a given probability, give the k-th number motif. We also explored different directions of the problem like: what about motifs with don’t care terms and ways to incorporate motifs from motif databases in our program.

We realized multiple experiments concerning its performance that cover multiple directions (we took input DNA profiles from multiple sources  - see chapter with experiments for sources):

• comparing the times of our algorithm with the approach by computing the probability of all possible nm motifs and sorting them - for what values of k and counts is our algorithm much faster than generating all patterns and sorting,
• find how does the space depend upon k and for what values of k it becomes impossible.

# 1.    Introduction: About Profiles and Motifs

Certain patterns and protein sequences are conserved by evolution, meaning that they likely have important biological functions and thus, is important to be able to identify possible patterns (called motifs) from a set of related sequences, and be able to test whether a given string contains a given motif.

For example:

• In DNA sequences, promoter binding sites are typically marked by a pattern of 5-10 nucleotides shortly upstream of each co-regulated gene (every gene contains a regulatory region RR typically stretching 100-1000 bp upstream of the transcriptional start site - located within the RR are the Transcription Factor Binding Sites (TFBS) specific for a given transcription factor). One can think of this as that nucleotides in motifs encode for a message in the “genetic” language;
• In protein sequences, the highly conserved regions among similar proteins likely define the most important folds or binding sites.

Highly conserved doesn’t mean completely conserved and such sequence matches may not stand out in any pair wise alignment (these regions can mutate on non important bases).

There are two standard ways to summarize a pattern across multiple sequences:

·        A profile is a matrix of probabilities, where the rows represent possible bases and the columns represent consecutive sequence positions (see Table 1.1) (note: a perfect profile has one base with probability 1.0 in each column, while a perfectly useless profile has entries of equal probability).

 s1 s2 s3 A 0.90 0.26 0.00 C 0.10 0.25 0.40 G 0.00 0.25 0.60 T 0.00 0.24 0.00 Motif A A G

Table 1: Example of profile and motif

·        Consensus sequence is a concise representation of the highest probability characters of the profile, as a string or regular expression. Note 1: One can think of consensus as an “ancestor” motif, from which mutated motifs emerged. Note 2: The distance between a real motif and the consensus sequence is generally less than that for two real motifs.

The Motif Finding Problem is a defined as follows: given a random sample of sequences (DNA or amino acids) find the pattern that is implanted in each of the individual arrays, namely, the motif. In order to solve the problem, one has to record the frequencies of patterns in the nucleotide sequences and previous knowledge of established motifs reduces the complexity of the problem.

Limitations: we don’t have the complete dictionary of motifs, the “genetic” language does not have a standard “grammar”, only a small fraction of nucleotide sequences encode for motifs, the size of data is enormous and human intelligence is focused on creating good algorithms for the problem rather than on making inferences and guesses on the possible motifs in the sequence.

The Motif Testing Problems defined as follows: test whether a given string contains a given motif. When a single contiguous string represents a motif, exact text search methods suffice to see if a given sequence contains it. When a profile does not contain indels, the probability that a given sequence contains the motif can be easily determined by multiplying the probabilities of each character in each position of each window. If one has a probabilistic model of how typical DNA or amino acids sequences are generated, one can easily compute the probability that a given profile score is significant. When profiles or motifs have gaps, the optimal alignment can be computed using a dynamic programming-type algorithm.

# 2.    Description of the Specific Problem and Possible Algorithms

Given a profile matrix and appropriate data structure, we wanted to generate the motif sequences in order of decreasing weight.

Given an alphabet S = {α1, α2, …, αi, … αm} and m x n probability matrix P = [pi,j] with the properties:

·        0 ≤ pi,j ≤ 1

·        Si=1,m pi,j =1

Using the matrix and the alphabet we can construct a string S = s1, s2, …, sj, …, sn, where for each j=1,n sj = αi where i is in [1,m] with probability pi,j.

Example: S = {A, C, G, T} and P is given by the Table 2.

 S s1 S2 S3 α1 = A 0.90 0.26 0.00 α 2 = C 0.10 0.25 0.40 Α3 = G 0.00 0.25 0.60 Α4 = T 0.00 0.24 0.00

Table 2: Example of probability table

The probability for sequence “ACG” is: P(ACG) = (0.9)(0.25)(0.60) = 0.0135

By taking logs of the probabilities, this reduces to scoring sequences in terms of sums of weight elements:

Plog(ACG) = log(0.9) + log(0.25) + log(0.60) = log(0.0135)

Getting the most probable string (consensus sequence) is trivial: greedily selecting the character with highest probability for each position, the cost being O(nm). Finding the second most probable string requires finding the character αk ≠ αi with the lowest penalty. The penalty for changing sj = αi to sj = αk is equal to pi,j – pi,k.

It is seen that continuing in this fashion will not always get you the next most probable string - there's a need to be doing some sort of “backtracking”. For example giving the first 10 motifs for the example above (Table 3), one can easily observe that between positions 3 and 4 or 7 and 8 we have a backtrack:

 Position Motif Probability 1 AAG 0.1404 2 ACG 0.1350 3 AGG 0.1350 4 ATG 0.1296 5 AAC 0.0936 6 ACC 0.0900

 7 AGC 0.09 8 ATC 0.0864 9 CAG 0.0156 10 CGG 0.015

Table 3: Example of “backtracking” in line 4

The problem which we wanted to solve in this project was to find a technique to quickly generate the first k most probable strings from a profile matrix. A brute force method would require to construct all mn different possible motifs, pick the first k possible and sort them – this solution requires exponential time: O(mn) for generating the motifs, computing the probability of each one and selecting the first k and O(k logk) to sort the k-mers.

We looked for a different geometric solution which does not require exponential time and depends on the value of k since usually we just want to calculate a small number of kmers.

The algorithm is simple and is illustrated below:

• The geometric solution starts by sorting each column of the matrix independently by probability – for the matrix P(m, n) the preprocessing time is O(nm logm). The new matrix in which we keep the original positions but the logarithms of probabilities are sorted is P’.
• The main iteration is based on a priority queue in which we initially push the most probable string (consensus sequence). In each iteration, we remove the min remaining sum of logarithm of probabilities and enqueue the lower neighbors of the min: successor sums of probabilities – using the positions in the matrix P’: (d1-1, d2, d3, …, di, …, dn), (d1, d2-1, d3, …, di, …, dn),  (d1, d2, d3-1, …, di, …, dn), …, (d1, d2, d3, …, di-1, …, dn), …, and (d1, d2, d3, …, di, …, dn-1). We continue the iteration until we find the desired kmers.

Above we considered that each dimension has the same importance but our algorithm considers that different positions have different importance by using a geometric approach of the distance from the point to a query line (n-dimensional plane). This algorithm for n-dimensions is using the formula for the distance from a point to a n-dimensional plane:

·        given the point with coordinates(x01, x02, …, x0n) and the plane A1 x1 + A2 x2 + … + An xn + C = 0

ALGORITHM in pseudocod – this algorithm is implemented fast in C++ (see appendices for complete listing or website: http://www.lmc.cs.sunysb.edu/~pfodor/motif)

Input:

·        the alphabet size m

·        the string size n (n-dimensions)

·        the matrix P(n, m)

·        output k most probable strings

·        profile matrix m x n

1. Sort the columns of the matrix by probability and remember the mappings with the original matrix (keep in each element <probability, index i in alphabet>).

We use the following data structure: composite-point <point1, point2, …, pointn>.

3. We create a priority queue candidate of composite points in which the priority of the composite point is given by the distance from the point to the n-dimensional plane (query line).

4. En-queue the point with the highest probability on all n dimensions.

5. Start iteration until we found k points – we de-queued n point from the priority queue (do iteration k times):

de-queue the highest priority composite point and insert n neigbours (calculate the distance for en-queue).

To understand the algorithm completely consider the simplest case X+Y sorting, which corresponds to a profile of length 2 (2-dimension matrix) on an m-letter alphabet. This can be solved in O(log2 m) time per element (if the probabilities are already sorted). We use a priority queue (from STL C++) where each element consists of the (i, j) index pair keyed on the Xi + Yj. In each iteration, we remove the min remaining sum, and enqueue the two successor sums (i+1, j) and (i, j+1).

This case can be visualized (length n = 2 and an alphabet of size |Σ| = m) in Figure 1 and the execution of the algorithm can be seen in Figure 2.

Figure 1: Geometric interpretation in 2

Create matrix – original view                          Step 1: find 1st max – choose max in all dimensions

- next candidates: (p’m,2, p’m-1,1), (p’m-1,2, p’m,1)

Step 2: find 2nd max                                                      Step 3: find 3rd max

- next candidates: (p’m,2, p’m-2,1), (p’m-1,2, p’m-1,1) …. (p’m-1,2, p’m,1)

Figure 2: Algorithm execution for 2 dimensions

For three dimensions case the reader can see Figure 3:

Figure 2: Geometric interpretation in 3

Considering the time this algorithm is much faster than the brute force approach since the preprocessing cost was O(nm logm) and to calculate each next kmer is O(lognm) per element (we insert the new candidates in the priority queue – we used STL C++ priority queue). Now is obvious that for small k’s this is much further than the brute force which required exponential time O(mn).

Next it seems the set of active profiles (those on heap) can approach mn-1, where m is the alphabet size and n is the length of the profile (close to the diagonal), but this case is very improbable since we always want the first small number of k’s motifs and we won’t reach near the diagonal.

• We created an efficient implementation (C++) of the proposed algorithm with a nice and usable interface for naive users  - website on Internet (in Java Server Pages and Servlets).

Our program includes different execution options, like:

• give first k-mers,
• give all k-mers over a given probability – compute the motifs until we reach the required probability,
• give the k-th number motif – like the first case but give only the k-th motif.

For computing only the k-th motif we tried to come with a faster algorithm (by binary search on n-dinemsions) but we realized that this is not possible – starting point was the Zone Theorem (which is like a binary search method on n dimensions) which says that we can “march" through the zone of a query line q in O(mn). In addition, |Zone|=O(mn-1)  implies S(m, n) = O(mn-1), but all this does not say anything about how to choose the position of the query line only how many points are above the query line after you already choose a query line.

We also explored different directions of the problem like:

• what about motifs with don’t care terms – solution: simple by choosing different coefficients for the query line,
• ways to incorporate motifs from motif databases in our program – the only way we found was to check in the database the motif but only after we computed the first kmers. If we already know the motif we can apply the Zone Theorem and find the position k of this motif in the profile. Does this help when there are lots of motifs (hundreds and thousands) in the database? Does this guarantee that we find the good motifs? Is faster to compute the first k-mers and to check their existence in the database.

This project opens many future directions but one of the most important is to implement a fast motif finding program. This was over our time possibilities but is a good and not very complex future direction. It that takes as input: a gene sequence and a set of profile matrices (with either counts or thresholds) and output: where on the sequence does the profile adequately or best match.

Algorithm:

·        Build suffix tree of the sequence.

·        For each profile, do an exact search in the suffix tree with the best k-mers for it in decreasing order until it matches (print the motifs and positions :-)

# 3.    Performance Evaluation and Experiments

We realized multiple experiments concerning its performance that cover multiple directions (we took input DNA profiles from multiple sources  - see chapter with experiments for sources):

• comparing the times of our algorithm with the approach by computing the probability of all possible nm motifs and sorting them - for what values of k and counts is our algorithm much faster than generating all patterns and sorting,
• find how does the space depend upon k and for what values of k it becomes impossible.

We also tried to see for typical motifs in motif databases how many patterns are expected at a given probability threshold, or what is the expected probability of the i-th motif. We wanted to find a faster algorithm for what if we just want to know all motifs with probability over a threshold, or even just how many there are (binary search).

# 4.    Conclusion

A DNA profile matrix gives probability of each base appearing in each position, so a length-n profile matrix scores the likelihood of each of 4n sequences (for DNA sequences the alphabet size is 4). Similar with DNA profiles are the protein profiles but the alphabet size in this case varies with the number of different amino acids (usually the alphabet size m is 20, the alphabet being the symbols of the 20 amino acids represented in the genetic code – the alphabet size can vary since in nature were discovered until now more than 500 amino acids – see http://en.wikipedia.org/wiki/Amino_acids).

Given a profile matrix and appropriate data structure, we want to generate these sequences in order of decreasing weight. By taking logs of the probabilities, this reduces to scoring sequences in terms of sums of weight elements.

The simplest case is online X+Y sorting, which corresponds to a profile of length 2 (2-dimension matrix) on an m-letter alphabet. This can be solved in O(log2 m) time per element (if the probabilities are already sorted). We use a priority queue (from STL C++) where each element consists of the (i, j) index pair keyed on the Xi + Yj. In each iteration, we remove the min remaining sum, and enqueue the two successor sums (i+1, j) and (i, j+1).

A similar thing can be done for larger profiles (n-dimension matrix), but it seems the set of active profiles (those on heap) can approach mn-1, where m is the alphabet size and n is the length of the profile (close to the diagonal). When properly viewed, the algorithm becomes clean and simple.

We created an efficient implementation (C++) of the proposed algorithm with a nice and usable interface for naive users  - website on Internet (in Java Server Pages and Servlets):

Our program includes different execution options, like: give first k-mers, give all k-mers over a given probability, give the k-th number motif. We also explored different directions of the problem like: what about motifs with don’t care terms and ways to incorporate motifs from motif databases in our program.

We realized multiple experiments concerning its performance that cover multiple directions (we took input DNA profiles from multiple sources  - see chapter with experiments for sources):

• comparing the times of our algorithm with the approach by computing the probability of all possible nm motifs and sorting them - for what values of k and counts is our algorithm much faster than generating all patterns and sorting,
• find how does the space depend upon k and for what values of k it becomes impossible.

We also tried to see for typical motifs in motif databases how many patterns are expected at a given probability threshold, or what is the expected probability of the i-th motif. We wanted to find a faster algorithm for what if we just want to know all motifs with probability over a threshold, or even just how many there are (binary search).

We also tried to implement a fast motif finding program that takes as input: a gene sequence and a set of profile matrices (with either counts or thresholds) and output: where on the sequence does the profile adequately or best match – we didn’t have time to implement this algorithm since this was over our time possibilities – the algorithm is described in the future work section.

algorithm:

Build suffix tree (call an existing program)

for each profile, do an exact search in the suffix tree with the

best k-mers for it in decreasing order until it matches.

# Appendix A – Source Code

A.1. kthmp-geometric.cpp

// Program to find the kth most probable sequence of length n where albhabet size is m - geometric approach

#ifdef __GNUC__

#define int64 long long

#else // MSVC, say

#pragma warning(disable: 4786)

#endif

#include <iostream>

#include <fstream>

#include <vector>

#include <utility>

#include <algorithm>

#include <map>

#include <queue>

#include <cmath>

using namespace std;

///////////////////////////////////////////////////////////////////////////

class MatrixElem

{

public:

float prob;

int alphabet;

int sortedIndex;

MatrixElem(float p=0.0, int a=0): prob(p), alphabet(a) { sortedIndex = -1; }

friend ostream& operator<<(ostream& out, const MatrixElem& me) {

out << "<" << me.sortedIndex << "," << me.prob << "," << me.alphabet << ">";

return out;

}

bool operator==(const MatrixElem& me) {

if(prob == me.prob && alphabet == me.alphabet && sortedIndex == me.sortedIndex) return true;

return false;

}

static bool comparator(const MatrixElem& me1, const MatrixElem& me2) {

if(me1.prob > me2.prob) return true;

return false;

}

};

///////////////////////////////////////////////////////////////////////////

typedef vector<int> VI;

typedef vector<float> VF;

typedef vector<MatrixElem> VE;

typedef vector<vector<MatrixElem> > VVE;

///////////////////////////////////////////////////////////////////////////

class CompositePoint

{

private:

CompositePoint() {}

float computeSumLogProbs(const VE&);

float computeDistFromNDimPlane(const VF&);

public:

float distFromNDimPlane;

float sumLogProbs; // sum of log of probabilities of the elems

VE matElems; // must have n elements

CompositePoint(VE elems, const VF& coeffs) {

matElems = elems;

sumLogProbs = computeSumLogProbs(elems);

distFromNDimPlane = computeDistFromNDimPlane(coeffs);

}

class comparator {

public:

bool operator()(const CompositePoint& cp1, const CompositePoint& cp2) {

if(cp1.distFromNDimPlane < cp2.distFromNDimPlane) return true;

return false;

}

};

};

float CompositePoint::computeSumLogProbs(const VE& elems)

{

float sum = 0.0;

for(int i=0; i<elems.size(); i++) {

sum += log(elems[i].prob);

}

return sum;

}

float CompositePoint::computeDistFromNDimPlane(const VF& coeffs)

{

float coeff_sq_sum=0.0, prob_sum=0.0;

int n = coeffs.size();

for(int i=0; i<n; i++) {

prob_sum += coeffs[i] * matElems[i].prob;

coeff_sq_sum += coeffs[i] * coeffs[i];

}

return prob_sum / sqrt(coeff_sq_sum);

}

///////////////////////////////////////////////////////////////////////////

ostream& operator<<(ostream& out, const VE& ve)

{

int i;

for(i=0; i<ve.size(); i++) {

out << ve[i] << " ";

}

return out;

}

ostream& operator<<(ostream& out, const VVE& vve)

{

int i;

for(i=0; i<vve.size(); i++) {

out << vve[i] << endl;

}

return out;

}

ostream& operator<<(ostream& out, const CompositePoint& cp)

{

out << "CP[(" << cp.matElems << "), " << cp.distFromNDimPlane << "]";

return out;

}

///////////////////////////////////////////////////////////////////////////

// a simple mapping to the alphabet symbol from the representing integer

string getAlphabet(int index)

{

switch(index) {

case 0: return "A";

case 1: return "C";

case 2: return "T";

case 3: return "G";

default: return "X";

}

}

///////////////////////////////////////////////////////////////////////////

string getSequence(const CompositePoint& cp)

{

string ret = "";

for(int i=0; i<cp.matElems.size(); i++) {

ret += getAlphabet(cp.matElems[i].alphabet);

}

return ret;

}

///////////////////////////////////////////////////////////////////////////

void readInput(const char *fname, int& m, int& n, VF& coeffs, int& k, VVE& profMatRowMajor)

{

ifstream is(fname);

if(!is.is_open()) {

cout << "ERROR: input file could not be opened !" << endl;

exit(1);

}

cout << "############ start of input ############" << endl;

cout << "Alphabet size(m): ";

is >> m; cout << m << endl;

cout << "Sequence length(n): ";

is >> n; cout << n << endl;

cout << "n coefficients of the n-dimensional line: " << endl;

int i, j;

float temp;

for(i=0; i<n; i++) {

is >> temp; // read the coeff

cout << temp << " ";

coeffs.push_back(temp);

}

cout << endl;

cout << "k: ";

is >> k; cout << k << endl;

cout << "Profile matrix of probabilities (m by n): " << endl;

VE onerow;

for(i=0; i<m; i++) { // i is also the row/alphabet index

onerow.clear();

for(j=0; j<n; j++) {

is >> temp; // read the probablity

cout << temp << " ";

MatrixElem me(temp, i);

onerow.push_back(me);

}

cout << endl;

profMatRowMajor.push_back(onerow);

}

cout << "############ end of input ############\n" << endl;

}

///////////////////////////////////////////////////////////////////////////

void convertProfileMatrix(VVE& profMatRowMajor, VVE& profMatColMajor)

{

int i, j;

int m = profMatRowMajor.size(); // alphabet size

int n = profMatRowMajor[0].size(); // sequence size

VE onecol;

for(j=0; j<n; j++) {

onecol.clear();

for(i=0; i<m; i++) {

onecol.push_back(profMatRowMajor[i][j]);

}

profMatColMajor.push_back(onecol);

}

}

///////////////////////////////////////////////////////////////////////////

void sortIndividualColumns(VVE& profMatColMajor)

{

int i, j;

for(i=0; i<profMatColMajor.size(); i++) {

VE& onecol = profMatColMajor[i];

sort(onecol.begin(), onecol.end(), MatrixElem::comparator);

for(j=0; j<onecol.size(); j++) {

onecol[j].sortedIndex = j;

}

}

}

///////////////////////////////////////////////////////////////////////////

vector<CompositePoint> getNeighbors(const CompositePoint& cp, const VVE& profMatColMajor, const VF& coeffs)

{

vector<CompositePoint> neighbors;

int i, ii;

cout << "invoked getNeighbors: " << cp << endl;

// remember: element_i of CP is in the row_i of col major matrix

// and element_i.sortedIndex is the col of the element_i in the col major matrix

int n = profMatColMajor.size();

int m = profMatColMajor[0].size();

for(i=0; i<n; i++) {

VE temp;

MatrixElem me = cp.matElems[i];

if(me.sortedIndex >= m-1) continue; // end of the col reached

// else we can have a neighbor (+1) due to that particular column

for(ii=0; ii<n; ii++) {

if(me == cp.matElems[ii]) temp.push_back(profMatColMajor[ii][me.sortedIndex+1]);

else temp.push_back(cp.matElems[ii]);

}

CompositePoint oneNeighbor(temp, coeffs);

neighbors.push_back(oneNeighbor);

cout << "oneNeighbor:" << oneNeighbor << endl;

}

return neighbors;

}

///////////////////////////////////////////////////////////////////////////

int main()

{

int m, n, k;

VF coeffs; // size = n

// we input row major, then convert to col major for ease of computations

VVE profMatRowMajor, profMatColMajor;

readInput("/home/stufs1/pfodor/cse549/inp.txt", m, n, coeffs, k, profMatRowMajor);

convertProfileMatrix(profMatRowMajor, profMatColMajor);

cout << "original row major input matrix:" << endl;

cout << profMatRowMajor << endl;

cout << "converted column major matrix:" << endl;

cout << profMatColMajor << endl;

// sort the profile matrix columns individually

sortIndividualColumns(profMatColMajor);

cout << "sorted column major matrix:" << endl;

cout << profMatColMajor << endl;

// make the priority queue central to this algorithm

priority_queue<CompositePoint, vector<CompositePoint>, CompositePoint::comparator> pq;

// now make the first composite point and put it in pq

// first composite point is the row of points in profMatColMajor

VE elemsForFirstCP;

int i, j;

for(j=0; j<n; j++) elemsForFirstCP.push_back(profMatColMajor[j][0]);

CompositePoint firstCP(elemsForFirstCP, coeffs);

cout << "first composite point\n" << firstCP << endl;

pq.push(firstCP);

// make a vector to hold the topmost points

vector<CompositePoint> topmostPoints;

// run the main algo

for(i=0; i<k-1; i++) {

CompositePoint top = pq.top();

topmostPoints.push_back(top);

pq.pop();

cout << "i: " << i << ", pq.size: " << pq.size() << endl;

vector<CompositePoint> neighbors = getNeighbors(top, profMatColMajor, coeffs);

for(j=0; j<neighbors.size(); j++) pq.push(neighbors[j]);

cout << "i: " << i << ", k: " << k << ", pq.size: " << pq.size() << endl;

}

topmostPoints.push_back(pq.top());

cout << "\n\n\nthe topmost points are shown in the following table: " << endl;

cout << "the format is: <k>, <composite point>, <sequence>, <sum of log of probs>" << endl;

for(i=0; i<topmostPoints.size(); i++) {

CompositePoint cp = topmostPoints[i];

cout << "k=" << i+1 << "\t" << cp << "\t" << getSequence(cp).c_str() << "\t" << cp.sumLogProbs << endl;

}

// the kth most probable sequence is given by the top element now

//cout << "kth (k=" << k << ") most probabale point:\n" << pq.top() << endl;

//cout << "the sequence: " << getSequence(pq.top()).c_str() << endl;

return 0;

}

A.2. kthmp-brute.cpp

// Program to find the kth most probable sequence of length n where albhabet size is m

// IMP: this is the brute force version where we sort all possible sequences based on their

// probability.

#ifdef __GNUC__

#define int64 long long

#else // MSVC, say

#pragma warning(disable: 4786)

#endif

#include <iostream>

#include <fstream>

#include <vector>

#include <utility>

#include <algorithm>

#include <map>

#include <queue>

#include <cmath>

using namespace std;

///////////////////////////////////////////////////////////////////////////

class MatrixElem

{

public:

float prob;

int alphabet;

int sortedIndex;

MatrixElem(float p=0.0, int a=0): prob(p), alphabet(a) { sortedIndex = -1; }

friend ostream& operator<<(ostream& out, const MatrixElem& me) {

out << "<" << me.sortedIndex << "," << me.prob << "," << me.alphabet << ">";

return out;

}

bool operator==(const MatrixElem& me) {

if(prob == me.prob && alphabet == me.alphabet && sortedIndex == me.sortedIndex) return true;

return false;

}

static bool comparator(const MatrixElem& me1, const MatrixElem& me2) {

if(me1.prob > me2.prob) return true;

return false;

}

};

///////////////////////////////////////////////////////////////////////////

typedef vector<int> VI;

typedef vector<float> VF;

typedef vector<MatrixElem> VE;

typedef vector<vector<MatrixElem> > VVE;

///////////////////////////////////////////////////////////////////////////

class CompositePoint

{

private:

CompositePoint() {}

float computeSumLogProbs(const VE&);

float computeDistFromNDimPlane(const VF&);

public:

float distFromNDimPlane;

float sumLogProbs; // sum of log of probabilities of the elems

VE matElems; // must have n elements

CompositePoint(VE elems, const VF& coeffs) {

matElems = elems;

sumLogProbs = computeSumLogProbs(elems);

distFromNDimPlane = computeDistFromNDimPlane(coeffs);

}

class comparator {

public:

bool operator()(const CompositePoint& cp1, const CompositePoint& cp2) {

//if(cp1.distFromNDimPlane < cp2.distFromNDimPlane) return true;

if(cp1.sumLogProbs < cp2.sumLogProbs) return true;

return false;

}

};

};

float CompositePoint::computeSumLogProbs(const VE& elems)

{

float sum = 0.0;

for(int i=0; i<elems.size(); i++) {

sum += log(elems[i].prob);

}

return sum;

}

float CompositePoint::computeDistFromNDimPlane(const VF& coeffs)

{

float coeff_sq_sum=0.0, prob_sum=0.0;

int n = coeffs.size();

for(int i=0; i<n; i++) {

prob_sum += coeffs[i] * matElems[i].prob;

coeff_sq_sum += coeffs[i] * coeffs[i];

}

return prob_sum / sqrt(coeff_sq_sum);

}

///////////////////////////////////////////////////////////////////////////

ostream& operator<<(ostream& out, const VE& ve)

{

int i;

for(i=0; i<ve.size(); i++) {

out << ve[i] << " ";

}

return out;

}

ostream& operator<<(ostream& out, const VVE& vve)

{

int i;

for(i=0; i<vve.size(); i++) {

out << vve[i] << endl;

}

return out;

}

ostream& operator<<(ostream& out, const CompositePoint& cp)

{

out << "CP[(" << cp.matElems << "), " << cp.distFromNDimPlane << "]";

return out;

}

///////////////////////////////////////////////////////////////////////////

// a simple mapping to the alphabet symbol from the representing integer

string getAlphabet(int index)

{

switch(index) {

case 0: return "A";

case 1: return "C";

case 2: return "T";

case 3: return "G";

default: return "X";

}

}

///////////////////////////////////////////////////////////////////////////

string getSequence(const CompositePoint& cp)

{

string ret = "";

for(int i=0; i<cp.matElems.size(); i++) {

ret += getAlphabet(cp.matElems[i].alphabet);

}

return ret;

}

///////////////////////////////////////////////////////////////////////////

void readInput(const char *fname, int& m, int& n, VF& coeffs, int& k, VVE& profMatRowMajor)

{

ifstream is(fname);

if(!is.is_open()) {

cout << "ERROR: input file could not be opened !" << endl;

exit(1);

}

cout << "############ start of input ############" << endl;

cout << "Alphabet size(m): ";

is >> m; cout << m << endl;

cout << "Sequence length(n): ";

is >> n; cout << n << endl;

cout << "n coefficients of the n-dimensional line: " << endl;

int i, j;

float temp;

for(i=0; i<n; i++) {

is >> temp; // read the coeff

cout << temp << " ";

coeffs.push_back(temp);

}

cout << endl;

cout << "k: ";

is >> k; cout << k << endl;

cout << "Profile matrix of probabilities (m by n): " << endl;

VE onerow;

for(i=0; i<m; i++) { // i is also the row/alphabet index

onerow.clear();

for(j=0; j<n; j++) {

is >> temp; // read the probablity

cout << temp << " ";

MatrixElem me(temp, i);

onerow.push_back(me);

}

cout << endl;

profMatRowMajor.push_back(onerow);

}

cout << "############ end of input ############\n" << endl;

}

///////////////////////////////////////////////////////////////////////////

void convertProfileMatrix(VVE& profMatRowMajor, VVE& profMatColMajor)

{

int i, j;

int m = profMatRowMajor.size(); // alphabet size

int n = profMatRowMajor[0].size(); // sequence size

VE onecol;

for(j=0; j<n; j++) {

onecol.clear();

for(i=0; i<m; i++) {

onecol.push_back(profMatRowMajor[i][j]);

}

profMatColMajor.push_back(onecol);

}

}

///////////////////////////////////////////////////////////////////////////

void sortIndividualColumns(VVE& profMatColMajor)

{

int i, j;

for(i=0; i<profMatColMajor.size(); i++) {

VE& onecol = profMatColMajor[i];

sort(onecol.begin(), onecol.end(), MatrixElem::comparator);

for(j=0; j<onecol.size(); j++) {

onecol[j].sortedIndex = j;

}

}

}

///////////////////////////////////////////////////////////////////////////

vector<CompositePoint> getNeighbors(const CompositePoint& cp, const VVE& profMatColMajor, const VF& coeffs)

{

vector<CompositePoint> neighbors;

int i, ii;

cout << "invoked getNeighbors: " << cp << endl;

// remember: element_i of CP is in the row_i of col major matrix

// and element_i.sortedIndex is the col of the element_i in the col major matrix

int n = profMatColMajor.size();

int m = profMatColMajor[0].size();

for(i=0; i<n; i++) {

VE temp;

MatrixElem me = cp.matElems[i];

if(me.sortedIndex >= m-1) continue; // end of the col reached

// else we can have a neighbor (+1) due to that particular column

for(ii=0; ii<n; ii++) {

if(me == cp.matElems[ii]) temp.push_back(profMatColMajor[ii][me.sortedIndex+1]);

else temp.push_back(cp.matElems[ii]);

}

CompositePoint oneNeighbor(temp, coeffs);

neighbors.push_back(oneNeighbor);

cout << "oneNeighbor:" << oneNeighbor << endl;

}

return neighbors;

}

///////////////////////////////////////////////////////////////////////////

int main()

{

int m, n, k;

VF coeffs; // size = n

// we input row major, then convert to col major for ease of computations

VVE profMatRowMajor, profMatColMajor;

readInput("/home/stufs1/pfodor/cse549/inp.txt", m, n, coeffs, k, profMatRowMajor);

int i, j;

// now make a list of all possible sequences

VVE allPossibleSeqList;

for(i=0; i<m; i++) {

VE oneSeq;

oneSeq.push_back(profMatRowMajor[i][0]);

allPossibleSeqList.push_back(oneSeq);

}

for(j=1; j<n; j++) {

// copy the seq list twice

int sz = allPossibleSeqList.size();

for(i=0; i<sz; i++) {

allPossibleSeqList.push_back(allPossibleSeqList[i]);

}

for(i=0; i<sz*2; i++) {

allPossibleSeqList.push_back(allPossibleSeqList[i]);

}

for(i=0; i<sz*4; i++) {

allPossibleSeqList[i].push_back(profMatRowMajor[i/sz][j]);

//cout << allPossibleSeqList[i] << endl;

}

//cout << "--------------\n";

}

// now make a pq and put all possible seq in it

priority_queue<CompositePoint, vector<CompositePoint>, CompositePoint::comparator> pq;

for(i=0; i<allPossibleSeqList.size(); i++) {

CompositePoint cp(allPossibleSeqList[i], coeffs); // 2nd arg is just placeholder

pq.push(cp);

}

for(i=0; i<k; i++) {

CompositePoint cp = pq.top();

pq.pop();

cout << "k=" << i+1 << "\t" << cp << "\t" << getSequence(cp).c_str() << "\t" << cp.sumLogProbs << endl;

}

return 0;

}

A.3. index.jsp

<%@page contentType="text/html"%>

<%@page pageEncoding="UTF-8"%>

<html>

<body>

<%-- <jsp:useBean id="beanInstanceName" scope="session" class="beanPackage.BeanClassName" /> --%>

<%-- <jsp:getProperty name="beanInstanceName"  property="propertyName" /> --%>

<jsp:include page="inputform.jsp"/>

</body>

</html>

A.4. inputform.jsp

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<title>computational biology project</title>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<!-- <link href="cse549.css" rel="stylesheet" type="text/css"> -->

<body bgcolor="#FFFFFF">

<jsp:useBean id="ibean" scope="session" class="cse549.InputBean" />

<%--<jsp:setProperty name="ibean"  property="m_prot" value="" />--%>

<font face="helvetica">

<h1>Generating Motifs in a Profile </h1>

<% if(!ibean.isValid()) { %>

<font color="red"><h4>This form has the following errors. Please fix them before the algo can be run.</h4>

<%= ibean.getErrorList() %>

</font>

<% } %>

<form name="form1" method="post" action="http://allnet.all.cs.sunysb.edu:12345/motif/AlgoServlet"> <!--  -->

<table width="100%" border="1">

<tr>

<td>Alphabet size: </td>

<td>

<input type="radio" name="m" value="dna" <% if(ibean.getM().equals("dna")) { %> checked <% } %>>

DNA motifs (alphabet size 4, default)

<br>

<input type="radio" name="m" value="protein" <% if(ibean.getM().equals("protein")) { %> checked <% } %>>

Protein motifs (including other kinds of motifs) - alphabet size:

<input name="m_prot" type="text" size="4" maxlength="4" value="<jsp:getProperty name='ibean'  property='m_prot' />">

(can be any number of letters)

<table width="75%" border="1">

<tr>

<td>Alphabet Letters (if not default): </td>

<td><textarea name="alpha_prot" cols="40" rows="2" disabled>now disabled</textarea> </td>

</tr>

<tr>

<td><font color="blue">Example alphabet (a[1]..a[m]):

</font></td>

<td>A C T G</td>

</tr>

</table>

</td>

</tr>

<tr>

<td>Motif size: </td>

<td>n :

<input name="n" type="text" size="4" maxlength="4" value="<jsp:getProperty name='ibean'  property='n' />">

</td>

</tr>

<tr>

<td>Query line coeffs (n-dim line): </td>

<td><textarea name="coeffs" cols="40" rows="2"><jsp:getProperty name='ibean'  property='coeffs' /></textarea>

<br>

<font color="blue">Example n-dimensinal coeficients (c[1]..c[n]): </font>1 1 1

</td>

</tr>

<tr>

<td>Profile Matrix: </td>

<td>

<p>

<input type="radio" name="matrix_type" value="prob" <% if(ibean.getMatrix_type().equals("prob")) { %> checked <% } %>>

Probability matrix

<br>

&nbsp; &nbsp; &nbsp; &nbsp;

<input type="radio" name="matrix_type_prob" value="abs" <% if(ibean.getMatrix_type_prob().equals("abs")) { %> checked <% } %>>

Absolute values (values in [0, 1) that will be multiplied

to calculate the probability of a sequence)

<br>

&nbsp; &nbsp; &nbsp; &nbsp;

<input type="radio" name="matrix_type_prob" value="log" disabled>

Logarithm values (to be added to calculate the logarithm of the probability

of the sequence)

<br>

<input type="radio" name="matrix_type" value="freq" disabled>

Frequency matrix

</p>

<textarea name="profile_matrix" cols="80" rows="6"><jsp:getProperty name='ibean'  property='profile_matrix' /></textarea>

</td>

</tr>

<tr>

<td>&nbsp;

</td>

<td>

<font color="blue">Example profile matrix - absolute probability values: </font>

<table width="75%" border="1">

<tr>

<td>&nbsp;

</td>

<td><div align="center"><strong>s1</strong></div></td>

<td><div align="center"><strong>s2</strong></div></td>

<td><div align="center"><strong>s3</strong></div></td>

</tr>

<tr>

<td><div align="center"><strong>A</strong></div></td>

<td><div align="center"> 0.90 </div></td>

<td><div align="center"> 0.26 </div></td>

<td><div align="center"> 0.00 </div></td>

</tr>

<tr>

<td><div align="center"><strong>C</strong></div></td>

<td><div align="center"> 0.10 </div></td>

<td><div align="center"> 0.25 </div></td>

<td><div align="center"> 0.40 </div></td>

</tr>

<tr>

<td><div align="center"><strong>T</strong></div></td>

<td><div align="center"> 0.00 </div> </td>

<td><div align="center"> 0.25 </div> </td>

<td><div align="center"> 0.60 </div> </td>

</tr>

<tr>

<td><div align="center"> <strong>G</strong></div></td>

<td><div align="center"> 0.00 </div> </td>

<td><div align="center"> 0.24 </div> </td>

<td><div align="center"> 0.00 </div> </td>

</tr>

</table>

</td>

</tr>

<tr>

<td>Output: </td>

<td>

<input type="radio" name="want" value="1" <% if(ibean.getWant().equals("1")) { %> checked <% } %> disabled>

Give all k-mers above a given probability:

<input name="want_1_prob" type="text" size="4" maxlength="4" value="<jsp:getProperty name='ibean' property='want_1_prob' />">

<br>

<input type="radio" name="want" value="2" <% if(ibean.getWant().equals("2")) { %> checked <% } %>>

Give first k-mers, where k:

<input name="want_2_k" type="text" size="4" maxlength="4" value="<jsp:getProperty name='ibean' property='want_2_k' />">

<br>

<input type="radio" name="want" value="3" <% if(ibean.getWant().equals("3")) { %> checked <% } %>>

Do selection - What is the k-th motif? k:

<input name="want_3_k" type="text" size="4" maxlength="4" value="<jsp:getProperty name='ibean' property='want_3_k' />">

</td>

</tr>

</table>

<br>

<div align="center">

<input type="submit">

</div>

</form>

</font>

</body>

</html>

A.5. AlgoServlet.java

/*

* AlgoServlet.java

*

* Created on December 6, 2004, 8:24 PM

*/

package cse549;

import java.io.*;

import java.net.*;

import java.util.*;

import javax.servlet.*;

import javax.servlet.http.*;

/**

*

* @author  Joy

* @version

*/

public class AlgoServlet extends HttpServlet {

/** Initializes the servlet.

*/

public void init(ServletConfig config) throws ServletException {

super.init(config);

}

/** Destroys the servlet.

*/

public void destroy() {

}

/** Processes requests for both HTTP <code>GET</code> and <code>POST</code> methods.

* @param request servlet request

* @param response servlet response

*/

protected void processRequest(HttpServletRequest request, HttpServletResponse response)

throws ServletException, IOException {

response.setContentType("text/plain");

PrintWriter out = response.getWriter();

HttpSession sess = request.getSession();

InputBean ibean = (InputBean) sess.getAttribute("ibean");

ibean.setM(request.getParameter("m"));

ibean.setM_prot(request.getParameter("m_prot"));

ibean.setN(request.getParameter("n"));

ibean.setCoeffs(request.getParameter("coeffs"));

ibean.setMatrix_type(request.getParameter("matrix_type"));

ibean.setMatrix_type_prob(request.getParameter("matrix_type_prob"));

ibean.setProfile_matrix(request.getParameter("profile_matrix"));

ibean.setWant(request.getParameter("want"));

ibean.setWant_1_prob(request.getParameter("want_1_prob"));

ibean.setWant_2_k(request.getParameter("want_2_k"));

ibean.setWant_3_k(request.getParameter("want_3_k"));

// check for errors

ibean.validateInput();

if(!ibean.isValid()) {

RequestDispatcher rd = request.getRequestDispatcher("index.jsp");

rd.forward(request, response);

}

out.println("::::::::::: form data ::::::::::::");

for(Enumeration e=request.getParameterNames(); e.hasMoreElements(); ) {

String param = (String) e.nextElement();

String value = request.getParameter(param);

out.println(param + " = " + value);

}

out.println("::::::::::::::::::::::::::::::::::");

// TODO: write input text file, call pgm with input file which

// will write output to an output file, then send the output back

Runtime rt = Runtime.getRuntime();

try {

PrintWriter pw = new PrintWriter(new FileWriter("c:\\_cse549\\motif\\inp.txt"));

pw.println(request.getParameter("m").equals("dna")? "4": request.getParameter("m_prot"));

pw.println(request.getParameter("n"));

pw.println(request.getParameter("coeffs"));

if(request.getParameter("want").equals("2")) {

pw.println(request.getParameter("want_2_k"));

} else {

pw.println(request.getParameter("want_3_k"));

}

pw.println(request.getParameter("profile_matrix"));

pw.close();

Process algoProc = rt.exec("c:\\_cse549\\motif\\proj.exe");

String outputline;

while((outputline=br.readLine()) != null) {

out.println(outputline);

}

algoProc.waitFor();

br.close();

} catch(Exception e) { out.println(e); }

out.close();

}

/** Handles the HTTP <code>GET</code> method.

* @param request servlet request

* @param response servlet response

*/

protected void doGet(HttpServletRequest request, HttpServletResponse response)

throws ServletException, IOException {

processRequest(request, response);

}

/** Handles the HTTP <code>POST</code> method.

* @param request servlet request

* @param response servlet response

*/

protected void doPost(HttpServletRequest request, HttpServletResponse response)

throws ServletException, IOException {

processRequest(request, response);

}

/** Returns a short description of the servlet.

*/

public String getServletInfo() {

return "Short description";

}

}

# Appendix C – Sample Input

File 100values_0010.txt

Alphabet size: 4 (DNA)

Number of letters: 100

Query plan coefficients: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

K-mers: 10

Profile matrix: 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

5 4 3 5 3 5 7 5 6 8 5 4 3 5 3 5 7 5 6 8 5 4 3 5 3 5 7 5 6 8 5 4 3 5 3 5 7 5 6 8 5 4 3 5 3 5 7 5 6 8 5 4 3 5 3 5 7 5 6 8 5 4 3 5 3 5 7 5 6 8 5 4 3 5 3 5 7 5 6 8 5 4 3 5 3 5 7 5 6 8 5 4 3 5 3 5 7 5 6 8

1 2 3 4 5 7 8 7 5 4 1 2 3 4 5 7 8 7 5 4 1 2 3 4 5 7 8 7 5 4 1 2 3 4 5 7 8 7 5 4 1 2 3 4 5 7 8 7 5 4 1 2 3 4 5 7 8 7 5 4 1 2 3 4 5 7 8 7 5 4 1 2 3 4 5 7 8 7 5 4 1 2 3 4 5 7 8 7 5 4 1 2 3 4 5 7 8 7 5 4

1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

For all the other input files please see website http://www.lmc.cs.sunysb.edu/~pfodor/motif

Some of the sample inputs from MEME are attached to the report in the next pages.

Note: MEME motifs are represented by position-specific probability matrices that specify the probability of each possible letter appearing at each possible position in an occurrence of the motif. In order to make it easier to see which letters are most likely in each of the columns of the motif, the simplified motif shows the letter probabilities multiplied by 10 rounded to the nearest integer ("a" means 10). Zeros are replaced by ":" (the colon) for readability.

# Appendix D – Sample Output

############ start of input ############

Alphabet size(m): 4

Sequence length(n): 3

n coefficients of the n-dimensional line:

1 1 1

k: 14

Profile matrix of probabilities (m by n):

0.9 0.26 0

0.1 0.25 0.4

0 0.25 0.6

0 0.24 0

############ end of input ############

original row major input matrix:

<-1,0.9,0> <-1,0.26,0> <-1,0,0>

<-1,0.1,1> <-1,0.25,1> <-1,0.4,1>

<-1,0,2> <-1,0.25,2> <-1,0.6,2>

<-1,0,3> <-1,0.24,3> <-1,0,3>

converted column major matrix:

<-1,0.9,0> <-1,0.1,1> <-1,0,2> <-1,0,3>

<-1,0.26,0> <-1,0.25,1> <-1,0.25,2> <-1,0.24,3>

<-1,0,0> <-1,0.4,1> <-1,0.6,2> <-1,0,3>

sorted column major matrix:

<0,0.9,0> <1,0.1,1> <2,0,2> <3,0,3>

<0,0.26,0> <1,0.25,1> <2,0.25,2> <3,0.24,3>

<0,0.6,2> <1,0.4,1> <2,0,0> <3,0,3>

first composite point

CP[(<0,0.9,0> <0,0.26,0> <0,0.6,2> ), 1.01614]

i: 0, pq.size: 0

invoked getNeighbors: CP[(<0,0.9,0> <0,0.26,0> <0,0.6,2> ), 1.01614]

oneNeighbor:CP[(<1,0.1,1> <0,0.26,0> <0,0.6,2> ), 0.554256]

oneNeighbor:CP[(<0,0.9,0> <1,0.25,1> <0,0.6,2> ), 1.01036]

oneNeighbor:CP[(<0,0.9,0> <0,0.26,0> <1,0.4,1> ), 0.900666]

i: 0, k: 14, pq.size: 3

i: 1, pq.size: 2

invoked getNeighbors: CP[(<0,0.9,0> <1,0.25,1> <0,0.6,2> ), 1.01036]

oneNeighbor:CP[(<1,0.1,1> <1,0.25,1> <0,0.6,2> ), 0.548483]

oneNeighbor:CP[(<0,0.9,0> <2,0.25,2> <0,0.6,2> ), 1.01036]

oneNeighbor:CP[(<0,0.9,0> <1,0.25,1> <1,0.4,1> ), 0.894893]

i: 1, k: 14, pq.size: 5

i: 2, pq.size: 4

invoked getNeighbors: CP[(<0,0.9,0> <2,0.25,2> <0,0.6,2> ), 1.01036]

oneNeighbor:CP[(<1,0.1,1> <2,0.25,2> <0,0.6,2> ), 0.548483]

oneNeighbor:CP[(<0,0.9,0> <3,0.24,3> <0,0.6,2> ), 1.00459]

oneNeighbor:CP[(<0,0.9,0> <2,0.25,2> <1,0.4,1> ), 0.894893]

i: 2, k: 14, pq.size: 7

i: 3, pq.size: 6

invoked getNeighbors: CP[(<0,0.9,0> <3,0.24,3> <0,0.6,2> ), 1.00459]

oneNeighbor:CP[(<1,0.1,1> <3,0.24,3> <0,0.6,2> ), 0.542709]

oneNeighbor:CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]

i: 3, k: 14, pq.size: 8

i: 4, pq.size: 7

invoked getNeighbors: CP[(<0,0.9,0> <0,0.26,0> <1,0.4,1> ), 0.900666]

oneNeighbor:CP[(<1,0.1,1> <0,0.26,0> <1,0.4,1> ), 0.438786]

oneNeighbor:CP[(<0,0.9,0> <1,0.25,1> <1,0.4,1> ), 0.894893]

oneNeighbor:CP[(<0,0.9,0> <0,0.26,0> <2,0,0> ), 0.669726]

i: 4, k: 14, pq.size: 10

i: 5, pq.size: 9

invoked getNeighbors: CP[(<0,0.9,0> <2,0.25,2> <1,0.4,1> ), 0.894893]

oneNeighbor:CP[(<1,0.1,1> <2,0.25,2> <1,0.4,1> ), 0.433013]

oneNeighbor:CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]

oneNeighbor:CP[(<0,0.9,0> <2,0.25,2> <2,0,0> ), 0.663953]

i: 5, k: 14, pq.size: 12

i: 6, pq.size: 11

invoked getNeighbors: CP[(<0,0.9,0> <1,0.25,1> <1,0.4,1> ), 0.894893]

oneNeighbor:CP[(<1,0.1,1> <1,0.25,1> <1,0.4,1> ), 0.433013]

oneNeighbor:CP[(<0,0.9,0> <2,0.25,2> <1,0.4,1> ), 0.894893]

oneNeighbor:CP[(<0,0.9,0> <1,0.25,1> <2,0,0> ), 0.663953]

i: 6, k: 14, pq.size: 14

i: 7, pq.size: 13

invoked getNeighbors: CP[(<0,0.9,0> <1,0.25,1> <1,0.4,1> ), 0.894893]

oneNeighbor:CP[(<1,0.1,1> <1,0.25,1> <1,0.4,1> ), 0.433013]

oneNeighbor:CP[(<0,0.9,0> <2,0.25,2> <1,0.4,1> ), 0.894893]

oneNeighbor:CP[(<0,0.9,0> <1,0.25,1> <2,0,0> ), 0.663953]

i: 7, k: 14, pq.size: 16

i: 8, pq.size: 15

invoked getNeighbors: CP[(<0,0.9,0> <2,0.25,2> <1,0.4,1> ), 0.894893]

oneNeighbor:CP[(<1,0.1,1> <2,0.25,2> <1,0.4,1> ), 0.433013]

oneNeighbor:CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]

oneNeighbor:CP[(<0,0.9,0> <2,0.25,2> <2,0,0> ), 0.663953]

i: 8, k: 14, pq.size: 18

i: 9, pq.size: 17

invoked getNeighbors: CP[(<0,0.9,0> <2,0.25,2> <1,0.4,1> ), 0.894893]

oneNeighbor:CP[(<1,0.1,1> <2,0.25,2> <1,0.4,1> ), 0.433013]

oneNeighbor:CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]

oneNeighbor:CP[(<0,0.9,0> <2,0.25,2> <2,0,0> ), 0.663953]

i: 9, k: 14, pq.size: 20

i: 10, pq.size: 19

invoked getNeighbors: CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]

oneNeighbor:CP[(<1,0.1,1> <3,0.24,3> <1,0.4,1> ), 0.427239]

oneNeighbor:CP[(<0,0.9,0> <3,0.24,3> <2,0,0> ), 0.658179]

i: 10, k: 14, pq.size: 21

i: 11, pq.size: 20

invoked getNeighbors: CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]

oneNeighbor:CP[(<1,0.1,1> <3,0.24,3> <1,0.4,1> ), 0.427239]

oneNeighbor:CP[(<0,0.9,0> <3,0.24,3> <2,0,0> ), 0.658179]

i: 11, k: 14, pq.size: 22

i: 12, pq.size: 21

invoked getNeighbors: CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]

oneNeighbor:CP[(<1,0.1,1> <3,0.24,3> <1,0.4,1> ), 0.427239]

oneNeighbor:CP[(<0,0.9,0> <3,0.24,3> <2,0,0> ), 0.658179]

i: 12, k: 14, pq.size: 23

the topmost points are shown in the following table:

the format is: <k>, <composite point>, <sequence>, <sum of log of probs>

k=1      CP[(<0,0.9,0> <0,0.26,0> <0,0.6,2> ), 1.01614]        AAT    -1.96326

k=2      CP[(<0,0.9,0> <1,0.25,1> <0,0.6,2> ), 1.01036]        ACT    -2.00248

k=3      CP[(<0,0.9,0> <2,0.25,2> <0,0.6,2> ), 1.01036]        ATT     -2.00248

k=4      CP[(<0,0.9,0> <3,0.24,3> <0,0.6,2> ), 1.00459]        AGT    -2.0433

k=5      CP[(<0,0.9,0> <0,0.26,0> <1,0.4,1> ), 0.900666]      AAC    -2.36872

k=6      CP[(<0,0.9,0> <2,0.25,2> <1,0.4,1> ), 0.894893]      ATC    -2.40795

k=7      CP[(<0,0.9,0> <1,0.25,1> <1,0.4,1> ), 0.894893]      ACC    -2.40795

k=8      CP[(<0,0.9,0> <1,0.25,1> <1,0.4,1> ), 0.894893]      ACC    -2.40795

k=9      CP[(<0,0.9,0> <2,0.25,2> <1,0.4,1> ), 0.894893]      ATC    -2.40795

k=10    CP[(<0,0.9,0> <2,0.25,2> <1,0.4,1> ), 0.894893]      ATC    -2.40795

k=11    CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]      AGC    -2.44877

k=12    CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]      AGC    -2.44877

k=13    CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]      AGC    -2.44877

k=14    CP[(<0,0.9,0> <3,0.24,3> <1,0.4,1> ), 0.889119]      AGC    -2.44877

# References

1. http://www.cs.sunysb.edu/~skiena/549/lectures/msa/msa.html
2. An introduction to Bioinformatics Algorithms, Neil C. Jones, Pavel A. Pevzner - slides on motif searching: http://www.bioalgorithms.info/presentations/Ch04_Motifs.pdf
3. http://www.sbc.su.se/~arne/kurser/swell/profiles.html
4. www.csupomona.edu/~drlivesay/Chm417/chm417_lect7.ppt
5. http://compbio.uchsc.edu/hunter/bioi7711/lecture4.ppt

DNA and Protein Sequence Analysis: A Practical Approach - MJ Bishop and C.J. Rawlings

page 160

protein profile:

rounded log odds weight matrix based on the frequences - 14 aminoacids long

Developing Bioinformatis`cs Computer Skills

DNA and protein sequence analysis : a practical approach / edited by M.J. Bishop and C.J. Rawlings, 1997

Developing bioinformatics computer skills / Cynthia Gibas and Per Jambeck, 2001

page 207 -figure 8-5

Bioinformatics : sequence and genome analysis / David W. Mount, 2001

page 162 fig 4.11

page 167 fig 4.12

page 176 table 4.2

page 182 fig 4.15

http://www.psc.edu/general/software/packages/profiless/profiless.html