Transforming into a Bioinformatician: 2013

Saturday, July 13, 2013

Stochastic and deterministic modelling.

Stochastic and deterministic modelling.

1. Purposes of stochastic modelling

Deterministic modelling assumes the systems to be continuous and evolve deterministically. The behaviour of the system can be described using ODEs, which are then solved. However, such models ignore the phenomena that occur due to the fact that each system consists of a finite number of discrete particles, such as random fluctuations. For systems with very small particle numbers the deterministic models are not even appropriate because the concentrations are not continuous.

Stochastic modelling takes into account the fact that each system is composed of a finite and countable number of particles and considers the number of those particles similar to the way the deterministic system considers concentrations.

2. Drawbacks of stochastic modelling

2.1. Limits on particle numbers

Considering the fact that the number of particles in the system is very large, computational modelling of a stochastic method is very demanding and developing an algorithm is a complex task.

2.2. Lack of analysis methods

Stochastic modelling does not have such rigorously developed analysis methods as metabolic control analysis for deterministic modelling.

3. Drawbacks of deterministic modelling

3.1. Systems with small particle numbers

Stochastic methods consider random fluctuations which lead to significant change in system behaviour when the number of particles is small. Species are allowed to become extinct. In deterministic models the fluctuations are not accounted for and species concentrations never fall to zero. Therefore, in linear processes, the deterministic model behaviour will only be determined by difference in concentrations. The stochastic models can behave differently. This remains true even if stochastic systems have the same marginal distribution of system states.

3.2. Bi-Stable systems

Under deterministic simulation the system which is bi-stable will converge to the same stable steady state if the initial concentrations remain the same. Under stochastic simulation the system will converge to one of the two stable states, and it can not be predicted to which one. The probability of the system converging to each state, however, can be calculated.

4. Difference between the deterministic solution and the mean of stochastic solutions

It should be noted that if we repeat the stochastic simulation many times and calculate the mean, we will not end up with the same solution as the deterministic. This is only true for linear systems, but the solutions for nonlinear systems can be totally different.

5. Conclusion

Stochastic modelling should definitely be chosen when the particle numbers are in range where the concept of continuous concentration is no longer applicable or when the stochastic phenomena are themselves the object of research. The limit on the application of stochastic model is generally enforced at a certain particle number where computation becomes not feasible.

References

Pahle J, Biochemical simulations: stochastic, approximate stochastic and hybrid approaches, Briefings in Bioinformatics 2009, 10(1), pp 53-64

by Evgeny. Also posted on my website

Monday, June 10, 2013

Project ROSALIND: Finding a shortest superstring

For a collection of strings, a larger string containing every one of the smaller strings as a substring is called a superstring. This may be useful, for example, if we have a large number of pieces of a DNA and want to figure out how the full DNA could look like.

For example, for the following strings

ATTAGACCTG
CCTGCCGGAA
AGACCTGCCG
GCCGGAATAC

The shortest superstring will be

ATTAGACCTGCCGGAATAC

The following code is a naive approach to solve the problem. The logic is as follows: Sort the list of strings by length. Take the longest one and call it a superstring. Next, iterate through the list to find the string that has the longest intersection with the superstring. Remove that string from the list and attach to the superstring. Continue until the list is empty, the resulting superstring should be the shortest possible.

public static string ShortestSuperstring(List<string> input)
{
 input = input.OrderByDescending(x => x.Length).ToList();

 string superstring = input[0];
 input.RemoveAt(0);
 int counter = input.Count;
 for (int i = 0; i < counter; i++)
 {
  List<IntBoolString> items = new List<IntBoolString>();

  for (int j = 0; j < input.Count; j++)
  {
   items.Add(GetIntersection(superstring, input[j]));
  }

  IntBoolString chosen = items.OrderByDescending(x => x.intValue).First();

  superstring = CombineIntoSuper(superstring, chosen);
  input.Remove(chosen.stringValue);
 }

 return superstring;
}

private static IntBoolString GetIntersection(string super, string candidate)
{
 IntBoolString result = new IntBoolString();
 result.stringValue = candidate;

 int i = 0;

 while (candidate.Length > i)
 {
  int testlen = candidate.Length - i;
  string leftcan = candidate.Substring(0, testlen);
  string rightcan = candidate.Substring(i, testlen);
  string leftsuper = super.Substring(0, testlen);
  string rightsuper = super.Substring(super.Length - testlen, testlen);

  if (leftcan == rightsuper || rightcan == leftsuper)
  {
   result.boolValue = (leftcan == rightsuper) ? true : false;
   result.intValue = testlen;
   return result;
  }

  i++;
 }

 return result;
}

private static string CombineIntoSuper(string superstring, IntBoolString chosen)
{
 string toAppend = string.Empty;
 int lenToAppend = chosen.stringValue.Length - chosen.intValue;

 toAppend = (chosen.boolValue == true) ?
  chosen.stringValue.Substring(chosen.stringValue.Length - lenToAppend, lenToAppend) :
  chosen.stringValue.Substring(0, lenToAppend);

 superstring = (chosen.boolValue == true) ?
  superstring + toAppend :
  toAppend + superstring;

 return superstring;
}

public struct IntBoolString
{
 public string stringValue;
 public int intValue;
 public bool boolValue;
}

by Evgeny. Also posted on my website

Saturday, June 1, 2013

Some string manipulations for future use.

1. Using an array of characters, return all possible permutations of this array (without repetitions).

public static List<string> StringPermutations(char[] list)
{
 List<string> result = new List<string>();
 int x=list.Length-1;
 go(list,0,x, result);
 return result;
}

private static void go (char[] list, int k, int m, List<string> result)
{
 int i;
 if (k == m)
 {
  result.Add(new string(list));
 }
 else
 for (i = k; i <= m; i++)
 {
  swap (ref list[k],ref list[i]);
  go (list, k+1, m, result);
  swap (ref list[k],ref list[i]);
 }
}

private static void swap(ref char a, ref char b)
{
 if (a == b) return;
 a ^= b;
 b ^= a;
 a ^= b;
}

Sample usage

List<string> permutations = Helper.StringPermutations(new char[] {'D', 'N', 'A'});

Sample output

DNA
DAN
NDA
NAD
AND
ADN

2. Using an array of characters ("alphabet"), return all possible words generated from this alphabet of a specified length

public static IEnumerable<String> GetWordsWithRepetition(Int32 length, char[] alphabet)
{
 if (length <= 0)
  yield break;

 for(int i = 0; i < alphabet.Length; i++) // (Char c = 'A'; c <= 'Z'; c++)
 {
  char c = alphabet[i];
  if (length > 1)
  {
   foreach (String restWord in GetWordsWithRepetition(length - 1, alphabet))
    yield return c + restWord;
  }
  else
   yield return "" + c;
 }
}

3. Further can be used to get full "dictionary" with all possible words up to a specified length

public static string ALPHABET = "D N A";

public static List<string> Dictionary(int length)
{
 char[] alphabet = Helper.AlphabetFromString(ALPHABET);

 List<string> final = new List<string>();

 for (int i = 1; i <= length; i++)
 {
  List<string> result = Helper.GetWordsWithRepetition(i, alphabet).ToList();
  final.AddRange(result);
 }
 return final;
}

public static char[] AlphabetFromString(string input)
{
 string[] split = input.Split(' ');
 char[] alphabet = new char[split.Count()];
 for (int i = 0; i < alphabet.Length; i++)
 {
  alphabet[i] = split[i][0];
 }
 return alphabet;
}

4. Further can be used to sort the words of the dictionary according to the alphabet provided using a comparer

public static int WordComparer(string one, string two)
{
 char[] alphabet = AlphabetFromString(ALPHABET);

 int len = Math.Min(one.Length, two.Length);
 for (int i = 0; i < len; i++)
 { 
  int posOne = Array.IndexOf(alphabet, one[i]);
  int posTwo = Array.IndexOf(alphabet, two[i]);
  if (posOne == posTwo)
  {
   continue;
  }
  else if(posTwo > posOne)
  {
   return -1;
  }
  return 1;
 }
 return two.Length > one.Length ? -1 : 1;
}

Sample usage

List<string> final = Dictionary(3).Sort(WordComparer);

Sample output

D
DD
DDD
DDN
DDA
DN
DND
DNN
DNA
DA
DAD
DAN
DAA
N
ND
NDD
NDN
NDA
NN
NND
NNN
NNA
NA
NAD
NAN
NAA
A
AD
ADD
ADN
ADA
AN
AND
ANN
ANA
AA
AAD
AAN
AAA

by Evgeny. Also posted on my website

Tuesday, May 14, 2013

Metabolic Control Analysis and Enzyme Kinetics

1. Drawbacks of rate-limiting step concept

At a steady state, the flux through each pathway in a biochemical network is a function of the individual enzyme kinetic properties. The activities of the enzyme affect the concentration of its reactants and products and influence the flux through pathways. Metabolic control analysis (MCA) provides a mathematical framework to study the distribution of metabolic fluxes and concentrations among the pathways that comprise the model. It replaces the principle of the rate-limiting step, which proved to be ineffective in practice. The control of the system as a whole is much more distributed than it was appreciated, making rate-limited step not very useful.

2. Purpose of MCA

The purpose of the MCA is to identify the steps which have the strongest effect on the levels of metabolites and fluxes. Its basis is the overall steady state flux with respect to the individual enzyme activities.

3. MCA coefficients

The challenge in analysing a metabolic network is determination of flux control coefficients (FCC). The FCC is a measure of how the flux changes in response to small perturbations in the activity or concentration of the enzyme. The value of the FCC is a measure of how important a particular enzyme is in the determination of the steady state flux. Another set of variables are elasticity coefficients. They quantify the influence of the pool levels on the individual pathway reactions.

4. MCA theorems

MCA uses two theorems. First is the summation theorem, which states that the sum of all FCC related to a particular pathway equal to 1. A more important theorem is the connectivity theorem; as it provides understanding of the way enzyme kinetics affect the values of FCC. It states that the sum of the products of the FCC of all steps that are affected by X and their elasticity coefficients towards X, is zero

5. Estimating FCC

There are several ways of estimating FCC, which can be roughly divided into experimental estimation and modelling.

5.1 Experimental estimation

Changes can be introduced into enzyme activities and changes in flux measured.
Elasticity coefficients can be calculated if the kinetics of each step of the pathway are known, then FCC can be calculated from elasticity coefficients
In-vitro titration of enzyme activities

5.2 Estimation through modelling

From their definition by small change in reaction rate and calculation of the resulting change in flux or concentration
From matrix methods that use summation and connectivity theorems. The first approach is based on two matrices, one containing elasticity coefficients and another containing FCC. This approach works but is hard to implement in software. Alternative approach, developed by Reder, requires only knowledge of stoichiometry matrix and elasticity coefficients. This method is best for software calculation of FCC from elasticity coefficients.

by Evgeny. Also posted on my website

Tuesday, April 30, 2013

Project ROSALIND: Finding a Protein Motif

The following piece of code is an attempt to solve the "Finding a Protein Motif" puzzle from the Project Rosalind.

The input is a list of UniProt Protein Database access IDs. For each ID, the code reads the protein aminoacid sequence from the url in the form of http://www.uniprot.org/uniprot/uniprot_id.fasta. Then, for each protein, it searches for the N-glycosylation motif (a motif is a significant amino acid pattern), which is written as N{P}[ST]{P}. In this format, [X] means any aminoacid, and {X} means any amino acid except X.

The code properly handles overlaps, i.e. in the NMSNSSS string there are two overlapping substrings that satisfy the motif: NMSN and NSSS. The overlaps are not handled properly by the Regex.Matches method (some matches are missed), so some additional string manipulations were required.

The url http://prosite.expasy.org/scanprosite/ can be used to verify the output.

List<string> proteins = new List<string>();

string line;
using (StreamReader reader = new StreamReader("input.txt"))
{
 while ((line = reader.ReadLine()) != null)
 {
  proteins.Add(line);
 }
}

WebClient client = new WebClient();
Dictionary<string, string> proteinsDict = new Dictionary<string, string>();
foreach (string id in proteins)
{
 Stream stream = client.OpenRead("http://www.uniprot.org/uniprot/" + id + ".fasta");

 if (stream != null)
  using (StreamReader reader = new StreamReader(stream))
  {
   string protein = string.Empty;
   while ((line = reader.ReadLine()) != null)
   {
    if (!line.StartsWith(">"))
    {
     protein += line;
    }
   }

   proteinsDict.Add(id, protein);
  }
}

const string pattern = @"N[^P][ST][^P]";

using (StreamWriter writer = new StreamWriter("output.txt"))
{
 foreach (KeyValuePair<string, string> kvp in proteinsDict)
 {
  string val = kvp.Value;
  List<int> matches = new List<int>();
  int removed = 0;
  bool done = false;
  while (done == false)
  {
   Match match = Regex.Match(val, pattern);
   if(match.Success)
   {
    int index = val.IndexOf(match.Value);
    matches.Add(index + removed + 1);
    removed += index + 1;
    val = val.Substring(index + 1, val.Length - (index + 1));
   }
   else
   {
    done = true;
   }
  }

  if(matches.Count > 0)
  {
   string indices = string.Empty;
   writer.WriteLine(kvp.Key);
   indices = matches.Aggregate(indices, (current, index) => current + index + " ");
   writer.WriteLine(indices);
  }
 }
}

References

Finding a Protein Motif
My Profile at Project ROSALIND
by Evgeny. Also posted on my website

Friday, April 5, 2013

Project ROSALIND: Rabbits and Recurrence Relations

I came across the project ROSALIND which is described as learning bioinformatics through problem solving. It is intriguing and well-designed, so I started with solving some introductory ones.

The first interesting problem was modified Fibonacchi sequence. Actually, I did not know that the background of the Fibonacci sequence was modelling of rabbit reproduction. It assumed that rabbits reach reproductive age after one month, and that every mature pair of rabbits produced a pair of newborn rabbits each month. A modified problem, however, suggested that every mature pair of rabbits produced k pairs of newborn rabbits each month. The task is to calculate a total number of rabbit pairs after n months, assuming we have one pair of newborn rabbits at the start.

While the problem could be solved by recursion, the cost of calculation would be high. Every successive month the program would re-calculate the full solution for each previous month. A better approach is dynamic programming (which, in essence, is just remembering and reusing the already calculated values). Here is the modified solution in C#.

/// <summary>
/// Modified Fibonacchi problem: each rabbit pair matures in 1 month and produces "pairs" of newborn rabbit pairs each month
/// </summary>
/// <param name="pairs">Number of newborn rabbit pairs produced by a mature pair each month</param>
/// <param name="to">Number of months</param>
/// <returns>Total number of rabbit pairs after "to" months</returns>
static Int64 Fibonacci(int pairs, int to)
{
 if (to == 0)
 {
  return 0;
 }

 Int64 mature = 0;
 Int64 young = 1;

 Int64 next_mature;
 Int64 next_young;
 Int64 result = 0;
 for (int i = 0; i < to; i++)
 {
  result = mature + young;

  next_mature = mature + young;
  next_young = mature * pairs;

  mature = next_mature;
  young = next_young;
 }
 return result;
}

Note: the result grows fast! When trying to use the default Int32 (32 bit, or up to ~2 billion) and calculate the result for 4 pairs and 32 months, the value overflowed at around month 23.

The next problem was another variation on the rabbit simulation. In this case, the rabbits are mortal and die after k months. My solution was to have a counter for rabbits of each age at each step. I keep the counters in the dictionary, where the key is the age of a rabbit pair and the value is the number of rabbit pairs of that age on that step.

/// <summary>
/// Mortal Rabbits Fibonacci sequence variation
/// </summary>
/// <param name="months">How many months does the simulation run for</param>
/// <param name="lifespan">Rabbit lifespan</param>
/// <returns>A count of rabbit pairs alive at the end</returns>
static UInt64 MortalRabbits(int months, int lifespan)
{
 Dictionary<int, UInt64> dRabbits = GetEmptyDictionary(lifespan);
 dRabbits[0]++;

 for (int i = 0; i < months - 1; i++)
 {
  Dictionary<int, UInt64> newRabbits = GetEmptyDictionary(lifespan);
  foreach (KeyValuePair<int, UInt64> pair in dRabbits)
  {
   int age = pair.Key;

   if (age == 0)
   {
    newRabbits[1] = newRabbits[1] + dRabbits[age];
   }
   else if (age > 0 && age < lifespan - 1)
   {
    newRabbits[age + 1] = newRabbits[age + 1] + dRabbits[age];
    newRabbits[0] = newRabbits[0] + dRabbits[age];
   }
   else if (age == lifespan - 1)
   {
    newRabbits[0] = newRabbits[0] + dRabbits[age];
   }
  }
  dRabbits = newRabbits;
 }

 UInt64 count = 0;
 foreach (KeyValuePair<int, UInt64> pair in dRabbits)
 {
  count = count + pair.Value;
 }

 return count;
}

/// <summary>
/// Creates an dictionary where keys are integers from 0 to lifespan - 1, and all values are zeros
/// </summary>
/// <param name="lifespan"></param>
/// <returns>An empty dictionary</returns>
static Dictionary<int, UInt64> GetEmptyDictionary(int lifespan)
{
 Dictionary<int, UInt64> dRabbits = new Dictionary<int, UInt64>();

 for (int i = 0; i < lifespan; i++)
 {
  dRabbits.Add(i, 0);
 }
 return dRabbits;
}

References

Project ROSALIND
Modified Fibonacci Problem
Mortal Fibonacci Rabbits
Fibonacci Series
by Evgeny. Also posted on my website

Transforming into a Bioinformatician

Saturday, July 13, 2013

Stochastic and deterministic modelling.

Monday, June 10, 2013

Project ROSALIND: Finding a shortest superstring

Saturday, June 1, 2013

Some string manipulations for future use.

Tuesday, May 14, 2013

Metabolic Control Analysis and Enzyme Kinetics

Tuesday, April 30, 2013

Project ROSALIND: Finding a Protein Motif

Friday, April 5, 2013

Project ROSALIND: Rabbits and Recurrence Relations

Followers

Blog Archive