Lecture notes 19 Jan 2001 Per Kraulis
Making an alignment by hand is possible, but tedious. In some cases, when one has a lot of information about the proteins, such as active site residues, secondary structure, 3D structure, mutations, etc, it may still be necessary to make a manual alignment (or at least edit an alignment) to fit all the data. The available automatic methods may not be able to produce a good enough alignment in such cases.
Of course, we would like to have a completely automatic method to perform sequence alignment. The method of choice is based on so-called dynamic programming, which is a general algorithm for solving certain optimization problems. The word "programming" does not mean that it has to be a computer program; this is just mathematical jargon for using a fixed set of rules to arrive at a solution.
For any automatic method to work, we need to be explicit about the assumptions that should go into it. We therefore need to have an explicit scheme for the gap penalties and for the substitution matrix (see the previous page). The chosen gap penalties and substitution matrices are often collectively called the scoring scheme.
There are clearly many different possible scoring schemes. One may also complicate things further by allowing position-specific scores: If one knows from other sources (3D structure) that a gap should absolutely not be allowed in a certain part of a sequence, then the gap-open penalty could be set to a very high value in that part.
Given a scoring scheme, how does an alignment algorithm work? Let us use the classical Needleman-Wunsch-Sellers algorithm to demonstrate how a dynamic-programming algorithm can work. Please note that there are other variants of dynamic programming in sequence analysis.
The Needleman-Wunsch-Sellers algorithm sets up a matrix where each sequence is placed along the sides of the matrix. Each element in the matrix represents the two residues of the sequences being aligned at that position. To calculate the score in every position (i, j) one looks at the alignment that has already been made up to that point, and finds the best way to continue. Having gone through the entire matrix in this way, one can go back and trace which way through the matrix gives the best alignment.
Let us use the following gap-penalty function, where k is the length of the gap, c_{open} the gap-open penalty constant, and c_{length} the gap-length penalty constant:
W(k) = c_{open} + c_{length} * k
The formulat describing the Needleman-Wunsch-Sellers method is recursive, and for the position (i, j) is as follows, where D is value of element (i, j) in the matrix and subst is the substitution matrix:
D_{i, j} = max { | D_{i - 1, j - 1} + subst(A_{i}, B_{j}) |
D_{i - 1, j - k} + W(k) (where k = 1, ..., j - 1) | |
D_{i - k, j - 1} + W(k) (where k = 1, ..., i - 1) |
After one has applied this to the matrix, one finds the optimal alignment by tracing backwards from the diagonal element backwards to the previous highest value, and so on.
I have found a good tutorial describing dynamic programming for sequence alignment of the Needleman-Wunsch variant. The tutorial was written by Eric C. Rouchka (Washington University in St. Louis), and walks through an example in detail.