Identifying DNA and protein patterns with statistically significant alignments of multiple sequences
Autors: Gerald Z. Hertz and Gary D. Stormo
This paper talks about what is expected to occur in DNA sequences.
It shows an example where you can expect a protein can bind two DNA sequences related. This pattern is identified by a pattern containing shared by a group of related sequences, and are also identified during the process of alignment of the sequences in order to maximize a sequence.
The objective of the model is to summarize the alignment for the set of sequences can be described in a more concise. The simplest method is the consensus sequence, which contains the most highly concervadas letters in each position of the alignment. However, some letters may occur more frequently than others.
The interest of this study is focused on the sequences that are related by common function. Therefore, we use a matrix model Which does not include phylogenetic information.
One of the simplest types is the alignment matrix, which lists the number of occurrences of each letter in each position of alignment, for example Figure 1a. Another type of matrix, is the matrix of weights, which uses the weights used to mark a test sequence to measure how close the word sequence matches the pattern described by the matrix. (Fig. 1b)
Arrays can also describe more complex patterns containing gaps. However, in this work only interested in sequences that can be aligned without incersiones and deletions.
A good alignment is assumed to be one Whose alignment matrix is rarely expected to Occur by chance. A standard statistic for scoring the relative likelihood of an alignment matrix is the log-likelihood ratio. We compare alignments using a variant of the log-likelihood ratio we call information content and determine alignments from functionally related, unaligned sequences using a greedy algorithm.
In this paper, we present an efficient method for calculating the P value of an information-content score. In our case, the P value is the probability of obtaining an information content greater than or equal to the observed value, given the number of sequences in the alignment and its width. This method combines numerical calculations with a technique from large-deviation statistics.
Information content of an alignment matrix
In our comparison of alignment matrices, we assume that the letters of a sequence are independent and identically distributed. Thus, the a priori probability of a sequence of letters is the product of the a priori probability of the individual letters. The a priori probability of the individual letters might be the overall frequency of the letters within all sequences of an organism.
Given the assumption that the distribution of letters is independent and identically distributed, the probability of an alignment matrix is determined by the multinomial distribution:
where i refers to the rows of the matrix (e.g. the bases A, C, G, T for a DNA alignment), j refers to the columns of the matrix (i.e. the positions of the letters within the alignment pattern), A is the total number of letters in the sequence alphabet (four for DNA and 20 for protein), L is the total number of columns in the matrix (six in Figure 1), pi is the a priori probability of letter i, ni, j is the occurrence of letter i at position j, and N is the total number of sequences in the alignment (four in Figure 1).
The most commonly used measures for scoring the divergence from the a priori probabilities of a set of letters are the χ2 statistic and the log-likeli-hood ratio. In our work, we use statistics based on the log-li-
kelihood ratio rather than the more ad hoc χ2 statistic. The standard log-likelihood ratio statistic is
When motivated by information theory, this formula is called the Kullback–Leibler information (Kullback and Leibler, 1951) or relative entropy. When derived from large-deviation principles, it is called the large-deviation rate function (Bucklew, 1990).
The P value of an information content
We ultimately wish to calculate the P value of the statistic, i.e. the probability of observing an alignment having the observed information content or greater, given the width of the alignment and the number of sequences in the alignment. Thus, the null model for the alignment matrix is that the distribution of letters in each alignment column is an independent multinomial distribution [formula (1)].
Under the above assumptions, when the information content is small and the number of sequences is large, 2NIseq tends to a χ2 distribution with L(A – 1) degrees of freedom since –NIseq is a log-likelihood ratio (discussed in many introductory statistics books). Unfortunately, our conditions generally involve very large scores and frequently few sequences; thus, the χ2 distribution tends to give poor probability estimates.
However, we are able to obtain very accurate estimates of the P value using a technique from large-deviation statistics.
A large-deviation technique for approximating P value
Applying the large-deviation technique to multiple sequence alignments
To apply the techniques described above, we need to be able to calculate efficiently the moment-generating function M(θ) and its first two derivatives M′(θ) and M′′(θ) for the statistic of interest. In this subsection, we describe these calculations for the statistic NIseq, where N is the total number of sequences in the alignment and Iseq is the information content statistic defined in formula (2).
Since we restrict ourselves to simple alignment models in which each column is independent, the moment-generating function only needs to be calculated for a single column since the overall moment-generating function is only dependent on Mc , the moment-generating function for an individual column, and L, the width of the alignment:
By the definition in equation (3), the moment-generating function for NIseq for an individual column is:
where ni is the occurrence of letter i. The outer summation is taken over all combinations of the ni summing to N. The total number of such combinations is (N + A – 1)!/N!/(A – 1)!, i.e. O(N A – 1).
A brute force calculation for M(θ) involving all these combinations is not too bad for a DNA alignment
where A = 4, but is unacceptable for proteins where there are 20 letters in the amino acid alphabet. Therefore, we use the following dynamic programming algorithm, whose complexity is only O[(A – 2)N 2] in time and O(N) in space.
Approximating the P value numerically
In this section, we describe an alternative method for approximating the P values of NIseq. This method creates a table of P values for the statistic after it has been transformed into integer values. The statistic NIseq is transformed into an integer value I′ after multiplying NIseq by some factor α:
in which the ‘int’ function rounds a real number to its closest integer. α is chosen so that the maximum ( I Ȁmax ) and minimum ( I Ȁmin ) values of I′ differ by some desired amount. The greater the difference between I Ȁmax and I Ȁmin , the more accurate the estimation of the P value.
Algorithms for determining the alignment having the optimum information content
Our ultimate goal is to apply our models and statistics for sequence alignments to identify optimal alignments and determine consensus patterns describing functional relationships.
The goal of all these algorithms is to determine a sequence alignment that maximizes a log-likelihood statistic. In this section, we describe the current version of our greedy algorithm.
Here, we describe two related alignment algorithms that are order independent. The first algorithm, requires the user to specify the width of the pattern being sought based on previous biochemical knowledge. The second algorithm determines the width of the alignment, but requires that the user adjust a bias that is subtracted from the information content so that the average score is negative.
Fig. 2. An example of the algorithm for finding sequence alignments of a fixed width, assuming each sequence contributes exactly once to the final alignment. Alignments of width 4 are being sought from the three single-stranded DNA sequences listed at the top. Each base has an a priori probability of 25%.
The user specifies the width of the alignment
The user first designates the maximum number of alignments that can be saved (e.g. 100 or 1000). Typically, less alignments are ultimately saved because some will be identical. Besides the width, there are various constraints the user can impose.
The user specifies a standard deviation bias
A similar algorithm in which the user does not explicitly specify the width of the alignment. A property of the information content is that it is always non-negative. However, for this local alignment algorithm to work, we need the score to be negative on average so that an interesting alignment can appear as a region of positive information.
|
| Fig. 3. The ratio of the P value calculated by the large-deviation (LD) method to the P value determined numerically (NUM) for various values | of NIseq. The horizontal axis is the numerically determined P value, which is assumed to be accurate. The vertical axis is the ratio. |
Results
The accuracy of the P value approximations
The large-deviation method for determining the P value was compared to the numerical method. As expected, because the estimate of Pγ () is based on the Central Limit Theorem, the accuracy of the large-deviation method increases as the width of the alignment increases.
Information content and related statistics have proven their usefulness for identifying and analyzing sequence alignments (Schneider et al., 1986; Berg and von Hippel, 1987; Stormo and Hartzell, 1989; Hertz et al., 1990; Lawrence and Reilly, 1990; Lawrence et al., 1993).
However, a major deficiency has been the lack of an accurate measure of the P value of a particular information content. In this paper, we use large-deviation statistics and an efficient algorithm for determining the moment-generating function to estimate the P value of an information content accurately. The large-deviation approach is also applicable to the scores of weight matrices.
My Contribution
My contribution to this work, is that I think you could apply a bit more testing and check if it really is feasible to apply to sequences that have a wide or a long too big.
To understand a little more about the analyzes that are performed below left a direct link where they can find the full paper, plus graphs containing more tests.
PAPER: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences
References
Hertz, G. and Stormo, G. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. February 22, p.15.