Transforming into a Bioinformatician

Friday, March 14, 2014

Global Alignment Problem

The solution to the Manhattan problem described here only returned us the maximum score, but did not give instructions on how exactly we should walk through Manhattan to achieve this score. To lay out the path (and to use the solution in bioinformatics to align aminoacid sequences) we need an additional step: keep the history of our walk so that we could "backtrack" from the last intersection all the way back to the start. Essentially, it is simple: when we arrive to a new node, we always come from somewhere. This is the node we need to remember.

We should also consider that, unlike Manhattan graph, where we only had streets going "west" or "south", we now have "diagonal" streets if we want to adapt the problem to align aminoacid sequences. Consider the image below:

Alignment graph matches

The red diagonal arrows show the places where one sequence matches the other. We can give these edges a higher "score" so the path that goes through most of such edges is the highest scored. In this case we have three ways to reach any node s[i,j]: we can come from s[i - 1,j] by moving down, from s[i,j - 1] by moving right, or from s[i - 1, j - 1] by moving diagonally. In the simplest case, we'll calculate anything other than a match as a 0, and a match as 1. Then the score for any s[i,j] node will be calculated using the following formula:

Global alignment formula

In more complex scenarios, we may use scoring matrices, where each combination of two aminoacids is given a certain score, depending on how biologically reasonable is this combination.

Therefore, the following algorithm is used to calculate the score matrix s, and at the same time fill the backtracking matrix backtrack.

AlignmentMatrix(v, w)
 for i ← 0 to |v|
  si, 0 ← 0
 for j ← 0 to |w| 
  s0, j ← 0
 for i ← 1 to |v|
  for j ← 1 to |w|
   si, j = max{si-1, j, si,j-1, si-1, j-1 + 1 (if vi = wj)}
   backtracki, j = ↓ if si, j = si-1, j ; → if si, j = si, j-1 ; ↘ if si, j = si-1, j-1 + 1
 return s|v|, |w| , backtrack

After both the scoring matrices are filled, we will use the following algorithm to backtrack from the s[i,j] all the way back to the starting node. Along the way we will output two string. In places where there is a match between the strings, a symbol will be placed, and in other places we will fill in a dash. Following is the backtracking algorithm. We are filling two strings, v and w. If we came to the point i,j by a diagonal edge, we have a match, and output the same symbol into both strings. If we came by a vertical or horisontal edge, one of the strings receives a symbol, and the other a "mismatch" - a "-".

OUTPUTALIGNMENT(backtrack, v, w, i, j)
 if i = 0 or j = 0
  return
 if backtracki, j = ↓
  output vi = "-"
  outpuv wi
  OUTPUTLCS(backtrack, v, i - 1, j)
 else if backtracki, j = →
  output vi
  output wi = "-"
  OUTPUTLCS(backtrack, v, i, j - 1)
 else
  OUTPUTLCS(backtrack, v, i - 1, j - 1)
  output vi
  output wi

This is the C# implementation of the algorithms.

public static void GlobalAlignment(string s1, string s2)
{
 int s1l = s1.Length;
 int s2l = s2.Length;

 int[,] matrix = new int[s1l + 1, s2l + 1];
 int[,] backtrack = new int[s1l + 1, s2l + 1];

 for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) matrix[i, j] = 0;
 for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) backtrack[i, j] = 0;

 for (int i = 0; i <= s1l; i++) matrix[i, 0] = 0;
 for (int j = 0; j <= s2l; j++) matrix[0, j] = 0;

 for (int i = 1; i <= s1l; i++)
 {
  for (int j = 1; j <= s2l; j++)
  {
   matrix[i, j] = Math.Max(Math.Max(matrix[i - 1, j], matrix[i, j - 1]), matrix[i - 1, j - 1] + 1);

   if (matrix[i, j] == matrix[i - 1, j]) backtrack[i, j] = -1;
   else if (matrix[i, j] == matrix[i, j - 1]) backtrack[i, j] = -2;
   else if (matrix[i, j] == matrix[i - 1, j - 1] + 1) backtrack[i, j] = 1;
  }
 }

 OUTPUTLCSALIGN(backtrack, s2, s1, s1l, s2l);

 string aligned1 = ReverseString(Result);
 string aligned2 = ReverseString(Result2);
}

public static void OUTPUTLCSALIGN(int[,] backtrack, string v, string w, int i, int j)
{
 if (i == 0 || j == 0)
  return;
 if (backtrack[i, j] == -1)
 {
  Result = Result + "-";
  Result2 = Result2 + w[i - 1];
  OUTPUTLCSALIGN(backtrack, v, w, i - 1, j);
 }
 else if (backtrack[i, j] == -2)
 {
  Result = Result + v[j - 1];
  Result2 = Result2 + "-";
  OUTPUTLCSALIGN(backtrack, v, w, i, j - 1);
 }
 else
 {
  Result = Result + v[j - 1];
  Result2 = Result2 + w[i - 1];
  OUTPUTLCSALIGN(backtrack, v, w, i - 1, j - 1);
 }
}

public static string ReverseString(string input)
{
 char[] charArray = input.ToCharArray();
 Array.Reverse(charArray);
 return new string(charArray);
}

A slightly modified version of the function takes into account a penalty for an insertion / deletion (-5 in this case) and uses a BLOSUM62 scoring matrix

public static void GlobalAlignment(string s1, string s2)
{
 int indelPenalty = -5;

 int s1l = s1.Length;
 int s2l = s2.Length;

 int[,] matrix = new int[s1l + 1, s2l + 1];
 int[,] backtrack = new int[s1l + 1, s2l + 1];

 for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) matrix[i, j] = 0;
 for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) backtrack[i, j] = 0;

 for (int i = 0; i <= s1l; i++) matrix[i, 0] = indelPenalty * i;
 for (int j = 0; j <= s2l; j++) matrix[0, j] = indelPenalty * j;

 for (int i = 1; i <= s1l; i++)
 {
  for (int j = 1; j <= s2l; j++)
  {
   //deletion
   int delCoef = matrix[i - 1, j] + indelPenalty;
   //insertion
   int insCoef = matrix[i, j - 1] + indelPenalty;
   //match / mismatch
   int mCoef = matrix[i - 1, j - 1] + BlosumCoeff(s1[i - 1], s2[j - 1]);

   matrix[i, j] = Math.Max(Math.Max(delCoef, insCoef), mCoef);

   if (matrix[i, j] == delCoef) backtrack[i, j] = -1; //
   else if (matrix[i, j] == insCoef) backtrack[i, j] = -2; //
   else if (matrix[i, j] == mCoef) backtrack[i, j] = 1;
  }
 }

 OUTPUTLCSALIGN(backtrack, s2, s1, s1l, s2l);

 string aligned1 = ReverseString(Result);
 string aligned2 = ReverseString(Result2);
}

Given the following two strings

PLEASANTLY
MEANLY

The following alignment will be returned

PLEASANTLY
-MEA--N-LY

by Evgeny. Also posted on my website

Tuesday, March 11, 2014

Manhattan Tourist Problem

An introductory exercise to aligning amino acid sequences is the Manhattan tourist problem. Suppose a tourist starts his route in the top left corner of the map and wants to visit as many attractions on the way as possible, and finish in the down right corner. There is a restriction however - he can only move either to the right or down.

A Simplified Map of Manhattan

To describe the problem in numbers and data structures, the map is converted to the directed grid shown below. A tourist can move along the edges of the grid and the score of 1 is added if he passes an attraction, which are assigned to the bold edges. Of all possible paths we are interested in the one with the highest score.

Map of Manhattan as a Graph

The solution to this problem is equivalent to a more general problem. This time every "street", or the edge of the graph, is assigned a score. The goal is to find a path which gains the highest score.

Generic Manhattan Problem

The problem can be brute forced, of course, by calculating the score for all possible paths. This is not practical for the larger graphs of course. A better approach is to calculate the highest possible score at every intersection, starting from the top left corner. Here is how the logic goes: If the tourist goes only towards the left in the grid, or only down, there are no choices for him. So it is easy to calculate the highest scores for the top and left row of intersections. Now we have these scores calculated:

Generic Manhattan Problem - First Row and Column Solved

Now when that's done, we can proceed to the next row. How can the tourist get to intersection (1, 1)? He can either go to (0, 1) first, and gain 3 + 0 = 3 score, or go to (1, 0) and gain 1 + 3 = 4 score. Therefore, the maximum score he can gain at (1, 1) is 4.

Generic Manhattan Problem - Cell (1,1) Solved

The generic formula to use is quite intuitive: To reach an intersection, you have to arrive from one of two possible previous locations, and traverse a corresponding edge. You choose the path where the sum of the score of a previous location, plus the score of the edge, is higher.

Generic Manhattan Formula

We continue calculations along the column all the way down. After that, we can move on to the next row.

Generic Manhattan Problem - Second Column Solved

Following this logic, it is easy to compute the maximum score for each intersection and find out the maximum score that can be gained while traversing Manhattan.

Generic Manhattan Problem - Fully Solved

Along the way the algorithm also becomes clear: the maximum score at any intersection is the maximum of the following two values:

MANHATTANTOURIST(n, m, down, right)
 s0, 0 ← 0
 for i ← 1 to n
  si, 0 ← si-1, 0 + downi, 0
 for j ← 1 to m
  s0, j ← s0, j−1 + right0, j
 for i ← 1 to n
  for j ← 1 to m
   si, j ← max{si - 1, j + downi, j, si, j - 1 + righti, j}
 return sn, m

The following C# function implements the algorithm and returns the highest possible score, assuming we input all the scores for the edges towards the right and all the edges pointing down in the form of two-dimension arrays.

public static int ManhattanProblem(int[,] RightMatrix, int[,] DownMatrix)
{
 int n = RightMatrix.GetLength(0) + 1;
 int m = DownMatrix.GetLength(1) + 1;
 int[,] ManhattanMatrix = new int[n, m];

 ManhattanMatrix[0, 0] = 0;

 for (int i = 1; i <= n; i++)
 {
  ManhattanMatrix[i, 0] = ManhattanMatrix[i - 1, 0] + DownMatrix[i - 1, 0];
 }

 for (int j = 1; j <= m; j++)
 {
  ManhattanMatrix[0, j] = ManhattanMatrix[0, j - 1] + RightMatrix[0, j - 1];
 }

 for (int i = 1; i <= n; i++)
 {
  for (int j = 1; j <= m; j++)
  {
   ManhattanMatrix[i, j] =
    Math.Max(ManhattanMatrix[i - 1, j] + DownMatrix[i - 1, j], 
    ManhattanMatrix[i, j - 1] + RightMatrix[i, j - 1]);
  }
 }

 return ManhattanMatrix.Cast<int>().Max();
}

Understanding this problem turns out to be important to understanding sequence alignment in bioinformatics.

by Evgeny. Also posted on my website

Saturday, July 13, 2013

Stochastic and deterministic modelling.

Stochastic and deterministic modelling.

1. Purposes of stochastic modelling

Deterministic modelling assumes the systems to be continuous and evolve deterministically. The behaviour of the system can be described using ODEs, which are then solved. However, such models ignore the phenomena that occur due to the fact that each system consists of a finite number of discrete particles, such as random fluctuations. For systems with very small particle numbers the deterministic models are not even appropriate because the concentrations are not continuous.

Stochastic modelling takes into account the fact that each system is composed of a finite and countable number of particles and considers the number of those particles similar to the way the deterministic system considers concentrations.

2. Drawbacks of stochastic modelling

2.1. Limits on particle numbers

Considering the fact that the number of particles in the system is very large, computational modelling of a stochastic method is very demanding and developing an algorithm is a complex task.

2.2. Lack of analysis methods

Stochastic modelling does not have such rigorously developed analysis methods as metabolic control analysis for deterministic modelling.

3. Drawbacks of deterministic modelling

3.1. Systems with small particle numbers

Stochastic methods consider random fluctuations which lead to significant change in system behaviour when the number of particles is small. Species are allowed to become extinct. In deterministic models the fluctuations are not accounted for and species concentrations never fall to zero. Therefore, in linear processes, the deterministic model behaviour will only be determined by difference in concentrations. The stochastic models can behave differently. This remains true even if stochastic systems have the same marginal distribution of system states.

3.2. Bi-Stable systems

Under deterministic simulation the system which is bi-stable will converge to the same stable steady state if the initial concentrations remain the same. Under stochastic simulation the system will converge to one of the two stable states, and it can not be predicted to which one. The probability of the system converging to each state, however, can be calculated.

4. Difference between the deterministic solution and the mean of stochastic solutions

It should be noted that if we repeat the stochastic simulation many times and calculate the mean, we will not end up with the same solution as the deterministic. This is only true for linear systems, but the solutions for nonlinear systems can be totally different.

5. Conclusion

Stochastic modelling should definitely be chosen when the particle numbers are in range where the concept of continuous concentration is no longer applicable or when the stochastic phenomena are themselves the object of research. The limit on the application of stochastic model is generally enforced at a certain particle number where computation becomes not feasible.

References

Pahle J, Biochemical simulations: stochastic, approximate stochastic and hybrid approaches, Briefings in Bioinformatics 2009, 10(1), pp 53-64

by Evgeny. Also posted on my website

Monday, June 10, 2013

Project ROSALIND: Finding a shortest superstring

For a collection of strings, a larger string containing every one of the smaller strings as a substring is called a superstring. This may be useful, for example, if we have a large number of pieces of a DNA and want to figure out how the full DNA could look like.

For example, for the following strings

ATTAGACCTG
CCTGCCGGAA
AGACCTGCCG
GCCGGAATAC

The shortest superstring will be

ATTAGACCTGCCGGAATAC

The following code is a naive approach to solve the problem. The logic is as follows: Sort the list of strings by length. Take the longest one and call it a superstring. Next, iterate through the list to find the string that has the longest intersection with the superstring. Remove that string from the list and attach to the superstring. Continue until the list is empty, the resulting superstring should be the shortest possible.

public static string ShortestSuperstring(List<string> input)
{
 input = input.OrderByDescending(x => x.Length).ToList();

 string superstring = input[0];
 input.RemoveAt(0);
 int counter = input.Count;
 for (int i = 0; i < counter; i++)
 {
  List<IntBoolString> items = new List<IntBoolString>();

  for (int j = 0; j < input.Count; j++)
  {
   items.Add(GetIntersection(superstring, input[j]));
  }

  IntBoolString chosen = items.OrderByDescending(x => x.intValue).First();

  superstring = CombineIntoSuper(superstring, chosen);
  input.Remove(chosen.stringValue);
 }

 return superstring;
}

private static IntBoolString GetIntersection(string super, string candidate)
{
 IntBoolString result = new IntBoolString();
 result.stringValue = candidate;

 int i = 0;

 while (candidate.Length > i)
 {
  int testlen = candidate.Length - i;
  string leftcan = candidate.Substring(0, testlen);
  string rightcan = candidate.Substring(i, testlen);
  string leftsuper = super.Substring(0, testlen);
  string rightsuper = super.Substring(super.Length - testlen, testlen);

  if (leftcan == rightsuper || rightcan == leftsuper)
  {
   result.boolValue = (leftcan == rightsuper) ? true : false;
   result.intValue = testlen;
   return result;
  }

  i++;
 }

 return result;
}

private static string CombineIntoSuper(string superstring, IntBoolString chosen)
{
 string toAppend = string.Empty;
 int lenToAppend = chosen.stringValue.Length - chosen.intValue;

 toAppend = (chosen.boolValue == true) ?
  chosen.stringValue.Substring(chosen.stringValue.Length - lenToAppend, lenToAppend) :
  chosen.stringValue.Substring(0, lenToAppend);

 superstring = (chosen.boolValue == true) ?
  superstring + toAppend :
  toAppend + superstring;

 return superstring;
}

public struct IntBoolString
{
 public string stringValue;
 public int intValue;
 public bool boolValue;
}

by Evgeny. Also posted on my website

Saturday, June 1, 2013

Some string manipulations for future use.

1. Using an array of characters, return all possible permutations of this array (without repetitions).

public static List<string> StringPermutations(char[] list)
{
 List<string> result = new List<string>();
 int x=list.Length-1;
 go(list,0,x, result);
 return result;
}

private static void go (char[] list, int k, int m, List<string> result)
{
 int i;
 if (k == m)
 {
  result.Add(new string(list));
 }
 else
 for (i = k; i <= m; i++)
 {
  swap (ref list[k],ref list[i]);
  go (list, k+1, m, result);
  swap (ref list[k],ref list[i]);
 }
}

private static void swap(ref char a, ref char b)
{
 if (a == b) return;
 a ^= b;
 b ^= a;
 a ^= b;
}

Sample usage

List<string> permutations = Helper.StringPermutations(new char[] {'D', 'N', 'A'});

Sample output

DNA
DAN
NDA
NAD
AND
ADN

2. Using an array of characters ("alphabet"), return all possible words generated from this alphabet of a specified length

public static IEnumerable<String> GetWordsWithRepetition(Int32 length, char[] alphabet)
{
 if (length <= 0)
  yield break;

 for(int i = 0; i < alphabet.Length; i++) // (Char c = 'A'; c <= 'Z'; c++)
 {
  char c = alphabet[i];
  if (length > 1)
  {
   foreach (String restWord in GetWordsWithRepetition(length - 1, alphabet))
    yield return c + restWord;
  }
  else
   yield return "" + c;
 }
}

3. Further can be used to get full "dictionary" with all possible words up to a specified length

public static string ALPHABET = "D N A";

public static List<string> Dictionary(int length)
{
 char[] alphabet = Helper.AlphabetFromString(ALPHABET);

 List<string> final = new List<string>();

 for (int i = 1; i <= length; i++)
 {
  List<string> result = Helper.GetWordsWithRepetition(i, alphabet).ToList();
  final.AddRange(result);
 }
 return final;
}

public static char[] AlphabetFromString(string input)
{
 string[] split = input.Split(' ');
 char[] alphabet = new char[split.Count()];
 for (int i = 0; i < alphabet.Length; i++)
 {
  alphabet[i] = split[i][0];
 }
 return alphabet;
}

4. Further can be used to sort the words of the dictionary according to the alphabet provided using a comparer

public static int WordComparer(string one, string two)
{
 char[] alphabet = AlphabetFromString(ALPHABET);

 int len = Math.Min(one.Length, two.Length);
 for (int i = 0; i < len; i++)
 { 
  int posOne = Array.IndexOf(alphabet, one[i]);
  int posTwo = Array.IndexOf(alphabet, two[i]);
  if (posOne == posTwo)
  {
   continue;
  }
  else if(posTwo > posOne)
  {
   return -1;
  }
  return 1;
 }
 return two.Length > one.Length ? -1 : 1;
}

Sample usage

List<string> final = Dictionary(3).Sort(WordComparer);

Sample output

D
DD
DDD
DDN
DDA
DN
DND
DNN
DNA
DA
DAD
DAN
DAA
N
ND
NDD
NDN
NDA
NN
NND
NNN
NNA
NA
NAD
NAN
NAA
A
AD
ADD
ADN
ADA
AN
AND
ANN
ANA
AA
AAD
AAN
AAA

by Evgeny. Also posted on my website

Tuesday, May 14, 2013

Metabolic Control Analysis and Enzyme Kinetics

1. Drawbacks of rate-limiting step concept

At a steady state, the flux through each pathway in a biochemical network is a function of the individual enzyme kinetic properties. The activities of the enzyme affect the concentration of its reactants and products and influence the flux through pathways. Metabolic control analysis (MCA) provides a mathematical framework to study the distribution of metabolic fluxes and concentrations among the pathways that comprise the model. It replaces the principle of the rate-limiting step, which proved to be ineffective in practice. The control of the system as a whole is much more distributed than it was appreciated, making rate-limited step not very useful.

2. Purpose of MCA

The purpose of the MCA is to identify the steps which have the strongest effect on the levels of metabolites and fluxes. Its basis is the overall steady state flux with respect to the individual enzyme activities.

3. MCA coefficients

The challenge in analysing a metabolic network is determination of flux control coefficients (FCC). The FCC is a measure of how the flux changes in response to small perturbations in the activity or concentration of the enzyme. The value of the FCC is a measure of how important a particular enzyme is in the determination of the steady state flux. Another set of variables are elasticity coefficients. They quantify the influence of the pool levels on the individual pathway reactions.

4. MCA theorems

MCA uses two theorems. First is the summation theorem, which states that the sum of all FCC related to a particular pathway equal to 1. A more important theorem is the connectivity theorem; as it provides understanding of the way enzyme kinetics affect the values of FCC. It states that the sum of the products of the FCC of all steps that are affected by X and their elasticity coefficients towards X, is zero

5. Estimating FCC

There are several ways of estimating FCC, which can be roughly divided into experimental estimation and modelling.

5.1 Experimental estimation

Changes can be introduced into enzyme activities and changes in flux measured.
Elasticity coefficients can be calculated if the kinetics of each step of the pathway are known, then FCC can be calculated from elasticity coefficients
In-vitro titration of enzyme activities

5.2 Estimation through modelling

From their definition by small change in reaction rate and calculation of the resulting change in flux or concentration
From matrix methods that use summation and connectivity theorems. The first approach is based on two matrices, one containing elasticity coefficients and another containing FCC. This approach works but is hard to implement in software. Alternative approach, developed by Reder, requires only knowledge of stoichiometry matrix and elasticity coefficients. This method is best for software calculation of FCC from elasticity coefficients.

by Evgeny. Also posted on my website

Tuesday, April 30, 2013

Project ROSALIND: Finding a Protein Motif

The following piece of code is an attempt to solve the "Finding a Protein Motif" puzzle from the Project Rosalind.

The input is a list of UniProt Protein Database access IDs. For each ID, the code reads the protein aminoacid sequence from the url in the form of http://www.uniprot.org/uniprot/uniprot_id.fasta. Then, for each protein, it searches for the N-glycosylation motif (a motif is a significant amino acid pattern), which is written as N{P}[ST]{P}. In this format, [X] means any aminoacid, and {X} means any amino acid except X.

The code properly handles overlaps, i.e. in the NMSNSSS string there are two overlapping substrings that satisfy the motif: NMSN and NSSS. The overlaps are not handled properly by the Regex.Matches method (some matches are missed), so some additional string manipulations were required.

The url http://prosite.expasy.org/scanprosite/ can be used to verify the output.

List<string> proteins = new List<string>();

string line;
using (StreamReader reader = new StreamReader("input.txt"))
{
 while ((line = reader.ReadLine()) != null)
 {
  proteins.Add(line);
 }
}

WebClient client = new WebClient();
Dictionary<string, string> proteinsDict = new Dictionary<string, string>();
foreach (string id in proteins)
{
 Stream stream = client.OpenRead("http://www.uniprot.org/uniprot/" + id + ".fasta");

 if (stream != null)
  using (StreamReader reader = new StreamReader(stream))
  {
   string protein = string.Empty;
   while ((line = reader.ReadLine()) != null)
   {
    if (!line.StartsWith(">"))
    {
     protein += line;
    }
   }

   proteinsDict.Add(id, protein);
  }
}

const string pattern = @"N[^P][ST][^P]";

using (StreamWriter writer = new StreamWriter("output.txt"))
{
 foreach (KeyValuePair<string, string> kvp in proteinsDict)
 {
  string val = kvp.Value;
  List<int> matches = new List<int>();
  int removed = 0;
  bool done = false;
  while (done == false)
  {
   Match match = Regex.Match(val, pattern);
   if(match.Success)
   {
    int index = val.IndexOf(match.Value);
    matches.Add(index + removed + 1);
    removed += index + 1;
    val = val.Substring(index + 1, val.Length - (index + 1));
   }
   else
   {
    done = true;
   }
  }

  if(matches.Count > 0)
  {
   string indices = string.Empty;
   writer.WriteLine(kvp.Key);
   indices = matches.Aggregate(indices, (current, index) => current + index + " ");
   writer.WriteLine(indices);
  }
 }
}

References

Finding a Protein Motif
My Profile at Project ROSALIND
by Evgeny. Also posted on my website

Transforming into a Bioinformatician

Friday, March 14, 2014

Global Alignment Problem

Tuesday, March 11, 2014

Manhattan Tourist Problem

Saturday, July 13, 2013

Stochastic and deterministic modelling.

Monday, June 10, 2013

Project ROSALIND: Finding a shortest superstring

Saturday, June 1, 2013

Some string manipulations for future use.

Tuesday, May 14, 2013

Metabolic Control Analysis and Enzyme Kinetics

Tuesday, April 30, 2013

Project ROSALIND: Finding a Protein Motif

Followers

Blog Archive