The solution to the Manhattan problem described here only returned us the maximum score, but did not give instructions on how exactly we should walk through Manhattan to achieve this score. To lay out the path (and to use the solution in bioinformatics to align aminoacid sequences) we need an additional step: keep the history of our walk so that we could "backtrack" from the last intersection all the way back to the start. Essentially, it is simple: when we arrive to a new node, we always come from somewhere. This is the node we need to remember.
We should also consider that, unlike Manhattan graph, where we only had streets going "west" or "south", we now have "diagonal" streets if we want to adapt the problem to align aminoacid sequences. Consider the image below:
Alignment graph matches
The red diagonal arrows show the places where one sequence matches the other. We can give these edges a higher "score" so the path that goes through most of such edges is the highest scored. In this case we have three ways to reach any node s[i,j]: we can come from s[i - 1,j] by moving down, from s[i,j - 1] by moving right, or from s[i - 1, j - 1] by moving diagonally. In the simplest case, we'll calculate anything other than a match as a 0, and a match as 1. Then the score for any s[i,j] node will be calculated using the following formula:
Global alignment formula
In more complex scenarios, we may use scoring matrices, where each combination of two aminoacids is given a certain score, depending on how biologically reasonable is this combination.
Therefore, the following algorithm is used to calculate the score matrix s, and at the same time fill the backtracking matrix backtrack.
AlignmentMatrix(v, w) for i ← 0 to |v| si, 0 ← 0 for j ← 0 to |w| s0, j ← 0 for i ← 1 to |v| for j ← 1 to |w| si, j = max{si-1, j, si,j-1, si-1, j-1 + 1 (if vi = wj)} backtracki, j = ↓ if si, j = si-1, j ; → if si, j = si, j-1 ; ↘ if si, j = si-1, j-1 + 1 return s|v|, |w| , backtrack
After both the scoring matrices are filled, we will use the following algorithm to backtrack from the s[i,j] all the way back to the starting node. Along the way we will output two string. In places where there is a match between the strings, a symbol will be placed, and in other places we will fill in a dash. Following is the backtracking algorithm. We are filling two strings, v and w. If we came to the point i,j by a diagonal edge, we have a match, and output the same symbol into both strings. If we came by a vertical or horisontal edge, one of the strings receives a symbol, and the other a "mismatch" - a "-".
OUTPUTALIGNMENT(backtrack, v, w, i, j) if i = 0 or j = 0 return if backtracki, j = ↓ output vi = "-" outpuv wi OUTPUTLCS(backtrack, v, i - 1, j) else if backtracki, j = → output vi output wi = "-" OUTPUTLCS(backtrack, v, i, j - 1) else OUTPUTLCS(backtrack, v, i - 1, j - 1) output vi output wi
This is the C# implementation of the algorithms.
public static void GlobalAlignment(string s1, string s2) { int s1l = s1.Length; int s2l = s2.Length; int[,] matrix = new int[s1l + 1, s2l + 1]; int[,] backtrack = new int[s1l + 1, s2l + 1]; for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) matrix[i, j] = 0; for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) backtrack[i, j] = 0; for (int i = 0; i <= s1l; i++) matrix[i, 0] = 0; for (int j = 0; j <= s2l; j++) matrix[0, j] = 0; for (int i = 1; i <= s1l; i++) { for (int j = 1; j <= s2l; j++) { matrix[i, j] = Math.Max(Math.Max(matrix[i - 1, j], matrix[i, j - 1]), matrix[i - 1, j - 1] + 1); if (matrix[i, j] == matrix[i - 1, j]) backtrack[i, j] = -1; else if (matrix[i, j] == matrix[i, j - 1]) backtrack[i, j] = -2; else if (matrix[i, j] == matrix[i - 1, j - 1] + 1) backtrack[i, j] = 1; } } OUTPUTLCSALIGN(backtrack, s2, s1, s1l, s2l); string aligned1 = ReverseString(Result); string aligned2 = ReverseString(Result2); } public static void OUTPUTLCSALIGN(int[,] backtrack, string v, string w, int i, int j) { if (i == 0 || j == 0) return; if (backtrack[i, j] == -1) { Result = Result + "-"; Result2 = Result2 + w[i - 1]; OUTPUTLCSALIGN(backtrack, v, w, i - 1, j); } else if (backtrack[i, j] == -2) { Result = Result + v[j - 1]; Result2 = Result2 + "-"; OUTPUTLCSALIGN(backtrack, v, w, i, j - 1); } else { Result = Result + v[j - 1]; Result2 = Result2 + w[i - 1]; OUTPUTLCSALIGN(backtrack, v, w, i - 1, j - 1); } } public static string ReverseString(string input) { char[] charArray = input.ToCharArray(); Array.Reverse(charArray); return new string(charArray); }
A slightly modified version of the function takes into account a penalty for an insertion / deletion (-5 in this case) and uses a BLOSUM62 scoring matrix
public static void GlobalAlignment(string s1, string s2) { int indelPenalty = -5; int s1l = s1.Length; int s2l = s2.Length; int[,] matrix = new int[s1l + 1, s2l + 1]; int[,] backtrack = new int[s1l + 1, s2l + 1]; for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) matrix[i, j] = 0; for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) backtrack[i, j] = 0; for (int i = 0; i <= s1l; i++) matrix[i, 0] = indelPenalty * i; for (int j = 0; j <= s2l; j++) matrix[0, j] = indelPenalty * j; for (int i = 1; i <= s1l; i++) { for (int j = 1; j <= s2l; j++) { //deletion int delCoef = matrix[i - 1, j] + indelPenalty; //insertion int insCoef = matrix[i, j - 1] + indelPenalty; //match / mismatch int mCoef = matrix[i - 1, j - 1] + BlosumCoeff(s1[i - 1], s2[j - 1]); matrix[i, j] = Math.Max(Math.Max(delCoef, insCoef), mCoef); if (matrix[i, j] == delCoef) backtrack[i, j] = -1; // else if (matrix[i, j] == insCoef) backtrack[i, j] = -2; // else if (matrix[i, j] == mCoef) backtrack[i, j] = 1; } } OUTPUTLCSALIGN(backtrack, s2, s1, s1l, s2l); string aligned1 = ReverseString(Result); string aligned2 = ReverseString(Result2); }
Given the following two strings
PLEASANTLY
MEANLY
The following alignment will be returned
PLEASANTLY
-MEA--N-LY