Transforming into a Bioinformatician

Saturday, July 13, 2013

Stochastic and deterministic modelling.

Stochastic and deterministic modelling.

1. Purposes of stochastic modelling

Deterministic modelling assumes the systems to be continuous and evolve deterministically. The behaviour of the system can be described using ODEs, which are then solved. However, such models ignore the phenomena that occur due to the fact that each system consists of a finite number of discrete particles, such as random fluctuations. For systems with very small particle numbers the deterministic models are not even appropriate because the concentrations are not continuous.

Stochastic modelling takes into account the fact that each system is composed of a finite and countable number of particles and considers the number of those particles similar to the way the deterministic system considers concentrations.

2. Drawbacks of stochastic modelling

2.1. Limits on particle numbers

Considering the fact that the number of particles in the system is very large, computational modelling of a stochastic method is very demanding and developing an algorithm is a complex task.

2.2. Lack of analysis methods

Stochastic modelling does not have such rigorously developed analysis methods as metabolic control analysis for deterministic modelling.

3. Drawbacks of deterministic modelling

3.1. Systems with small particle numbers

Stochastic methods consider random fluctuations which lead to significant change in system behaviour when the number of particles is small. Species are allowed to become extinct. In deterministic models the fluctuations are not accounted for and species concentrations never fall to zero. Therefore, in linear processes, the deterministic model behaviour will only be determined by difference in concentrations. The stochastic models can behave differently. This remains true even if stochastic systems have the same marginal distribution of system states.

3.2. Bi-Stable systems

Under deterministic simulation the system which is bi-stable will converge to the same stable steady state if the initial concentrations remain the same. Under stochastic simulation the system will converge to one of the two stable states, and it can not be predicted to which one. The probability of the system converging to each state, however, can be calculated.

4. Difference between the deterministic solution and the mean of stochastic solutions

It should be noted that if we repeat the stochastic simulation many times and calculate the mean, we will not end up with the same solution as the deterministic. This is only true for linear systems, but the solutions for nonlinear systems can be totally different.

5. Conclusion

Stochastic modelling should definitely be chosen when the particle numbers are in range where the concept of continuous concentration is no longer applicable or when the stochastic phenomena are themselves the object of research. The limit on the application of stochastic model is generally enforced at a certain particle number where computation becomes not feasible.

References

Pahle J, Biochemical simulations: stochastic, approximate stochastic and hybrid approaches, Briefings in Bioinformatics 2009, 10(1), pp 53-64

by Evgeny. Also posted on my website

Monday, June 10, 2013

Project ROSALIND: Finding a shortest superstring

For a collection of strings, a larger string containing every one of the smaller strings as a substring is called a superstring. This may be useful, for example, if we have a large number of pieces of a DNA and want to figure out how the full DNA could look like.

For example, for the following strings

ATTAGACCTG
CCTGCCGGAA
AGACCTGCCG
GCCGGAATAC

The shortest superstring will be

ATTAGACCTGCCGGAATAC

The following code is a naive approach to solve the problem. The logic is as follows: Sort the list of strings by length. Take the longest one and call it a superstring. Next, iterate through the list to find the string that has the longest intersection with the superstring. Remove that string from the list and attach to the superstring. Continue until the list is empty, the resulting superstring should be the shortest possible.

public static string ShortestSuperstring(List<string> input)
{
 input = input.OrderByDescending(x => x.Length).ToList();

 string superstring = input[0];
 input.RemoveAt(0);
 int counter = input.Count;
 for (int i = 0; i < counter; i++)
 {
  List<IntBoolString> items = new List<IntBoolString>();

  for (int j = 0; j < input.Count; j++)
  {
   items.Add(GetIntersection(superstring, input[j]));
  }

  IntBoolString chosen = items.OrderByDescending(x => x.intValue).First();

  superstring = CombineIntoSuper(superstring, chosen);
  input.Remove(chosen.stringValue);
 }

 return superstring;
}

private static IntBoolString GetIntersection(string super, string candidate)
{
 IntBoolString result = new IntBoolString();
 result.stringValue = candidate;

 int i = 0;

 while (candidate.Length > i)
 {
  int testlen = candidate.Length - i;
  string leftcan = candidate.Substring(0, testlen);
  string rightcan = candidate.Substring(i, testlen);
  string leftsuper = super.Substring(0, testlen);
  string rightsuper = super.Substring(super.Length - testlen, testlen);

  if (leftcan == rightsuper || rightcan == leftsuper)
  {
   result.boolValue = (leftcan == rightsuper) ? true : false;
   result.intValue = testlen;
   return result;
  }

  i++;
 }

 return result;
}

private static string CombineIntoSuper(string superstring, IntBoolString chosen)
{
 string toAppend = string.Empty;
 int lenToAppend = chosen.stringValue.Length - chosen.intValue;

 toAppend = (chosen.boolValue == true) ?
  chosen.stringValue.Substring(chosen.stringValue.Length - lenToAppend, lenToAppend) :
  chosen.stringValue.Substring(0, lenToAppend);

 superstring = (chosen.boolValue == true) ?
  superstring + toAppend :
  toAppend + superstring;

 return superstring;
}

public struct IntBoolString
{
 public string stringValue;
 public int intValue;
 public bool boolValue;
}

by Evgeny. Also posted on my website

Saturday, June 1, 2013

Some string manipulations for future use.

1. Using an array of characters, return all possible permutations of this array (without repetitions).

public static List<string> StringPermutations(char[] list)
{
 List<string> result = new List<string>();
 int x=list.Length-1;
 go(list,0,x, result);
 return result;
}

private static void go (char[] list, int k, int m, List<string> result)
{
 int i;
 if (k == m)
 {
  result.Add(new string(list));
 }
 else
 for (i = k; i <= m; i++)
 {
  swap (ref list[k],ref list[i]);
  go (list, k+1, m, result);
  swap (ref list[k],ref list[i]);
 }
}

private static void swap(ref char a, ref char b)
{
 if (a == b) return;
 a ^= b;
 b ^= a;
 a ^= b;
}

Sample usage

List<string> permutations = Helper.StringPermutations(new char[] {'D', 'N', 'A'});

Sample output

DNA
DAN
NDA
NAD
AND
ADN

2. Using an array of characters ("alphabet"), return all possible words generated from this alphabet of a specified length

public static IEnumerable<String> GetWordsWithRepetition(Int32 length, char[] alphabet)
{
 if (length <= 0)
  yield break;

 for(int i = 0; i < alphabet.Length; i++) // (Char c = 'A'; c <= 'Z'; c++)
 {
  char c = alphabet[i];
  if (length > 1)
  {
   foreach (String restWord in GetWordsWithRepetition(length - 1, alphabet))
    yield return c + restWord;
  }
  else
   yield return "" + c;
 }
}

3. Further can be used to get full "dictionary" with all possible words up to a specified length

public static string ALPHABET = "D N A";

public static List<string> Dictionary(int length)
{
 char[] alphabet = Helper.AlphabetFromString(ALPHABET);

 List<string> final = new List<string>();

 for (int i = 1; i <= length; i++)
 {
  List<string> result = Helper.GetWordsWithRepetition(i, alphabet).ToList();
  final.AddRange(result);
 }
 return final;
}

public static char[] AlphabetFromString(string input)
{
 string[] split = input.Split(' ');
 char[] alphabet = new char[split.Count()];
 for (int i = 0; i < alphabet.Length; i++)
 {
  alphabet[i] = split[i][0];
 }
 return alphabet;
}

4. Further can be used to sort the words of the dictionary according to the alphabet provided using a comparer

public static int WordComparer(string one, string two)
{
 char[] alphabet = AlphabetFromString(ALPHABET);

 int len = Math.Min(one.Length, two.Length);
 for (int i = 0; i < len; i++)
 { 
  int posOne = Array.IndexOf(alphabet, one[i]);
  int posTwo = Array.IndexOf(alphabet, two[i]);
  if (posOne == posTwo)
  {
   continue;
  }
  else if(posTwo > posOne)
  {
   return -1;
  }
  return 1;
 }
 return two.Length > one.Length ? -1 : 1;
}

Sample usage

List<string> final = Dictionary(3).Sort(WordComparer);

Sample output

D
DD
DDD
DDN
DDA
DN
DND
DNN
DNA
DA
DAD
DAN
DAA
N
ND
NDD
NDN
NDA
NN
NND
NNN
NNA
NA
NAD
NAN
NAA
A
AD
ADD
ADN
ADA
AN
AND
ANN
ANA
AA
AAD
AAN
AAA

by Evgeny. Also posted on my website

Tuesday, May 14, 2013

Metabolic Control Analysis and Enzyme Kinetics

1. Drawbacks of rate-limiting step concept

At a steady state, the flux through each pathway in a biochemical network is a function of the individual enzyme kinetic properties. The activities of the enzyme affect the concentration of its reactants and products and influence the flux through pathways. Metabolic control analysis (MCA) provides a mathematical framework to study the distribution of metabolic fluxes and concentrations among the pathways that comprise the model. It replaces the principle of the rate-limiting step, which proved to be ineffective in practice. The control of the system as a whole is much more distributed than it was appreciated, making rate-limited step not very useful.

2. Purpose of MCA

The purpose of the MCA is to identify the steps which have the strongest effect on the levels of metabolites and fluxes. Its basis is the overall steady state flux with respect to the individual enzyme activities.

3. MCA coefficients

The challenge in analysing a metabolic network is determination of flux control coefficients (FCC). The FCC is a measure of how the flux changes in response to small perturbations in the activity or concentration of the enzyme. The value of the FCC is a measure of how important a particular enzyme is in the determination of the steady state flux. Another set of variables are elasticity coefficients. They quantify the influence of the pool levels on the individual pathway reactions.

4. MCA theorems

MCA uses two theorems. First is the summation theorem, which states that the sum of all FCC related to a particular pathway equal to 1. A more important theorem is the connectivity theorem; as it provides understanding of the way enzyme kinetics affect the values of FCC. It states that the sum of the products of the FCC of all steps that are affected by X and their elasticity coefficients towards X, is zero

5. Estimating FCC

There are several ways of estimating FCC, which can be roughly divided into experimental estimation and modelling.

5.1 Experimental estimation

Changes can be introduced into enzyme activities and changes in flux measured.
Elasticity coefficients can be calculated if the kinetics of each step of the pathway are known, then FCC can be calculated from elasticity coefficients
In-vitro titration of enzyme activities

5.2 Estimation through modelling

From their definition by small change in reaction rate and calculation of the resulting change in flux or concentration
From matrix methods that use summation and connectivity theorems. The first approach is based on two matrices, one containing elasticity coefficients and another containing FCC. This approach works but is hard to implement in software. Alternative approach, developed by Reder, requires only knowledge of stoichiometry matrix and elasticity coefficients. This method is best for software calculation of FCC from elasticity coefficients.

by Evgeny. Also posted on my website

Tuesday, April 30, 2013

Project ROSALIND: Finding a Protein Motif

The following piece of code is an attempt to solve the "Finding a Protein Motif" puzzle from the Project Rosalind.

The input is a list of UniProt Protein Database access IDs. For each ID, the code reads the protein aminoacid sequence from the url in the form of http://www.uniprot.org/uniprot/uniprot_id.fasta. Then, for each protein, it searches for the N-glycosylation motif (a motif is a significant amino acid pattern), which is written as N{P}[ST]{P}. In this format, [X] means any aminoacid, and {X} means any amino acid except X.

The code properly handles overlaps, i.e. in the NMSNSSS string there are two overlapping substrings that satisfy the motif: NMSN and NSSS. The overlaps are not handled properly by the Regex.Matches method (some matches are missed), so some additional string manipulations were required.

The url http://prosite.expasy.org/scanprosite/ can be used to verify the output.

List<string> proteins = new List<string>();

string line;
using (StreamReader reader = new StreamReader("input.txt"))
{
 while ((line = reader.ReadLine()) != null)
 {
  proteins.Add(line);
 }
}

WebClient client = new WebClient();
Dictionary<string, string> proteinsDict = new Dictionary<string, string>();
foreach (string id in proteins)
{
 Stream stream = client.OpenRead("http://www.uniprot.org/uniprot/" + id + ".fasta");

 if (stream != null)
  using (StreamReader reader = new StreamReader(stream))
  {
   string protein = string.Empty;
   while ((line = reader.ReadLine()) != null)
   {
    if (!line.StartsWith(">"))
    {
     protein += line;
    }
   }

   proteinsDict.Add(id, protein);
  }
}

const string pattern = @"N[^P][ST][^P]";

using (StreamWriter writer = new StreamWriter("output.txt"))
{
 foreach (KeyValuePair<string, string> kvp in proteinsDict)
 {
  string val = kvp.Value;
  List<int> matches = new List<int>();
  int removed = 0;
  bool done = false;
  while (done == false)
  {
   Match match = Regex.Match(val, pattern);
   if(match.Success)
   {
    int index = val.IndexOf(match.Value);
    matches.Add(index + removed + 1);
    removed += index + 1;
    val = val.Substring(index + 1, val.Length - (index + 1));
   }
   else
   {
    done = true;
   }
  }

  if(matches.Count > 0)
  {
   string indices = string.Empty;
   writer.WriteLine(kvp.Key);
   indices = matches.Aggregate(indices, (current, index) => current + index + " ");
   writer.WriteLine(indices);
  }
 }
}

References

Finding a Protein Motif
My Profile at Project ROSALIND
by Evgeny. Also posted on my website

Friday, April 5, 2013

Project ROSALIND: Rabbits and Recurrence Relations

I came across the project ROSALIND which is described as learning bioinformatics through problem solving. It is intriguing and well-designed, so I started with solving some introductory ones.

The first interesting problem was modified Fibonacchi sequence. Actually, I did not know that the background of the Fibonacci sequence was modelling of rabbit reproduction. It assumed that rabbits reach reproductive age after one month, and that every mature pair of rabbits produced a pair of newborn rabbits each month. A modified problem, however, suggested that every mature pair of rabbits produced k pairs of newborn rabbits each month. The task is to calculate a total number of rabbit pairs after n months, assuming we have one pair of newborn rabbits at the start.

While the problem could be solved by recursion, the cost of calculation would be high. Every successive month the program would re-calculate the full solution for each previous month. A better approach is dynamic programming (which, in essence, is just remembering and reusing the already calculated values). Here is the modified solution in C#.

/// <summary>
/// Modified Fibonacchi problem: each rabbit pair matures in 1 month and produces "pairs" of newborn rabbit pairs each month
/// </summary>
/// <param name="pairs">Number of newborn rabbit pairs produced by a mature pair each month</param>
/// <param name="to">Number of months</param>
/// <returns>Total number of rabbit pairs after "to" months</returns>
static Int64 Fibonacci(int pairs, int to)
{
 if (to == 0)
 {
  return 0;
 }

 Int64 mature = 0;
 Int64 young = 1;

 Int64 next_mature;
 Int64 next_young;
 Int64 result = 0;
 for (int i = 0; i < to; i++)
 {
  result = mature + young;

  next_mature = mature + young;
  next_young = mature * pairs;

  mature = next_mature;
  young = next_young;
 }
 return result;
}

Note: the result grows fast! When trying to use the default Int32 (32 bit, or up to ~2 billion) and calculate the result for 4 pairs and 32 months, the value overflowed at around month 23.

The next problem was another variation on the rabbit simulation. In this case, the rabbits are mortal and die after k months. My solution was to have a counter for rabbits of each age at each step. I keep the counters in the dictionary, where the key is the age of a rabbit pair and the value is the number of rabbit pairs of that age on that step.

/// <summary>
/// Mortal Rabbits Fibonacci sequence variation
/// </summary>
/// <param name="months">How many months does the simulation run for</param>
/// <param name="lifespan">Rabbit lifespan</param>
/// <returns>A count of rabbit pairs alive at the end</returns>
static UInt64 MortalRabbits(int months, int lifespan)
{
 Dictionary<int, UInt64> dRabbits = GetEmptyDictionary(lifespan);
 dRabbits[0]++;

 for (int i = 0; i < months - 1; i++)
 {
  Dictionary<int, UInt64> newRabbits = GetEmptyDictionary(lifespan);
  foreach (KeyValuePair<int, UInt64> pair in dRabbits)
  {
   int age = pair.Key;

   if (age == 0)
   {
    newRabbits[1] = newRabbits[1] + dRabbits[age];
   }
   else if (age > 0 && age < lifespan - 1)
   {
    newRabbits[age + 1] = newRabbits[age + 1] + dRabbits[age];
    newRabbits[0] = newRabbits[0] + dRabbits[age];
   }
   else if (age == lifespan - 1)
   {
    newRabbits[0] = newRabbits[0] + dRabbits[age];
   }
  }
  dRabbits = newRabbits;
 }

 UInt64 count = 0;
 foreach (KeyValuePair<int, UInt64> pair in dRabbits)
 {
  count = count + pair.Value;
 }

 return count;
}

/// <summary>
/// Creates an dictionary where keys are integers from 0 to lifespan - 1, and all values are zeros
/// </summary>
/// <param name="lifespan"></param>
/// <returns>An empty dictionary</returns>
static Dictionary<int, UInt64> GetEmptyDictionary(int lifespan)
{
 Dictionary<int, UInt64> dRabbits = new Dictionary<int, UInt64>();

 for (int i = 0; i < lifespan; i++)
 {
  dRabbits.Add(i, 0);
 }
 return dRabbits;
}

References

Project ROSALIND
Modified Fibonacci Problem
Mortal Fibonacci Rabbits
Fibonacci Series
by Evgeny. Also posted on my website

Wednesday, December 19, 2012

Stoichiometry matrix

1. Definition

Stoichiometry matrix (SM) is a systematic arrangement of stoichiometric information from the reactions comprising the model. In a system with m species and n reactions the dimensions of the matrix are mxn. Chemical species are represented by rows and reactions – by columns. The elements of the matrix are corresponding stoichiometric coefficients. The selection of the system boundaries defines the complexity of the SM. When the concentration of a specie is considered fixed, the reaction is removed from the matrix.

The set of equations represented in the matrix together expresses the dynamics of the metabolite concentrations as

dS/dt = N*v,

where N is the matrix, v is the vector of fluxes and S is the vector of metabolite concentrations.

2. Applications

SM implies a steady state assuming that at any given time the concentration of the specie is constant. By using SM it is possible to enumerate all possible steady state flux solutions of a given network.

Personally, I like the fact that the SM is a crossroads of mathematics and biology, equally making sense for a person with a background in biology or information technology or mathematics.

2.1. Network reconstruction

The whole table of reactions encoded in the genome may be represented as SM. If the genes that encode for enzymes and reactions that each enzyme carries out are listed, the resulting table can be converted into the SM.

2.2. Mass conservation analysis

SM contains all information about the reaction network, therefore all necessary data to analyse mass conservation. Such relations can be retrieved from the SM as linear combinations of other rows. The result of removing all rows that are linear combinations of other rows is the reduced matrix which is used by software packages such as COPASI.

2.3. Stoichiometric modelling

In stoichiometric modelling, there are three major approaches are metabolic flux analysis (MFA), flux balance analysis (FBA) and metabolic pathway analysis (MPA). All three work by defining a high-dimension solution space of possible metabolic flux distributions based on the SM specifying system conservation relationships. The difference between the three approaches lies in how metabolic flux distributions are selected from the solution space.

MFA is a traditional approach which relies on extensive experimental data and computes a metabolic flux vector for a particular condition. Experimental data is used to simplify the SM.

FBA identifies only one optimal solution while alternative optimal solutions may exist. It very much depends on the validity of the model.

MPA, unlike the other two methods, can identify all metabolic flux vectors in a network. A finite set of solutions is achieved by additional constraints on the flux space.

References

Smolke C, The Metabolic Pathway Engineering Handbook: Fundamentals, CRC Press, 2009

Trinh T, Wlaschin A, Srienc F, Elementary Mode Analysis: A Useful Metabolic Pathway Analysis Tool for Characterizing Cellular Metabolism, Appl Microbiol Biotechnol, 2009, 81(5), pp 813-826

Wang X, Chen J, Quinn P, Reprogramming Microbial Metabolic Pathways, Springer, 2012

by Evgeny. Also posted on my website

Transforming into a Bioinformatician

Saturday, July 13, 2013

Stochastic and deterministic modelling.

Monday, June 10, 2013

Project ROSALIND: Finding a shortest superstring

Saturday, June 1, 2013

Some string manipulations for future use.

Tuesday, May 14, 2013

Metabolic Control Analysis and Enzyme Kinetics

Tuesday, April 30, 2013

Project ROSALIND: Finding a Protein Motif

Friday, April 5, 2013

Project ROSALIND: Rabbits and Recurrence Relations

Wednesday, December 19, 2012

Stoichiometry matrix

Followers

Blog Archive