Tuesday, April 30, 2013

Project ROSALIND: Finding a Protein Motif

The following piece of code is an attempt to solve the "Finding a Protein Motif" puzzle from the Project Rosalind.

The input is a list of UniProt Protein Database access IDs. For each ID, the code reads the protein aminoacid sequence from the url in the form of http://www.uniprot.org/uniprot/uniprot_id.fasta. Then, for each protein, it searches for the N-glycosylation motif (a motif is a significant amino acid pattern), which is written as N{P}[ST]{P}. In this format, [X] means any aminoacid, and {X} means any amino acid except X.

The code properly handles overlaps, i.e. in the NMSNSSS string there are two overlapping substrings that satisfy the motif: NMSN and NSSS. The overlaps are not handled properly by the Regex.Matches method (some matches are missed), so some additional string manipulations were required.

The url http://prosite.expasy.org/scanprosite/ can be used to verify the output.

List<string> proteins = new List<string>();

string line;
using (StreamReader reader = new StreamReader("input.txt"))
{
 while ((line = reader.ReadLine()) != null)
 {
  proteins.Add(line);
 }
}

WebClient client = new WebClient();
Dictionary<string, string> proteinsDict = new Dictionary<string, string>();
foreach (string id in proteins)
{
 Stream stream = client.OpenRead("http://www.uniprot.org/uniprot/" + id + ".fasta");

 if (stream != null)
  using (StreamReader reader = new StreamReader(stream))
  {
   string protein = string.Empty;
   while ((line = reader.ReadLine()) != null)
   {
    if (!line.StartsWith(">"))
    {
     protein += line;
    }
   }

   proteinsDict.Add(id, protein);
  }
}

const string pattern = @"N[^P][ST][^P]";

using (StreamWriter writer = new StreamWriter("output.txt"))
{
 foreach (KeyValuePair<string, string> kvp in proteinsDict)
 {
  string val = kvp.Value;
  List<int> matches = new List<int>();
  int removed = 0;
  bool done = false;
  while (done == false)
  {
   Match match = Regex.Match(val, pattern);
   if(match.Success)
   {
    int index = val.IndexOf(match.Value);
    matches.Add(index + removed + 1);
    removed += index + 1;
    val = val.Substring(index + 1, val.Length - (index + 1));
   }
   else
   {
    done = true;
   }
  }

  if(matches.Count > 0)
  {
   string indices = string.Empty;
   writer.WriteLine(kvp.Key);
   indices = matches.Aggregate(indices, (current, index) => current + index + " ");
   writer.WriteLine(indices);
  }
 }
}

References

Finding a Protein Motif
My Profile at Project ROSALIND
by . Also posted on my website

Friday, April 5, 2013

Project ROSALIND: Rabbits and Recurrence Relations

I came across the project ROSALIND which is described as learning bioinformatics through problem solving. It is intriguing and well-designed, so I started with solving some introductory ones.

The first interesting problem was modified Fibonacchi sequence. Actually, I did not know that the background of the Fibonacci sequence was modelling of rabbit reproduction. It assumed that rabbits reach reproductive age after one month, and that every mature pair of rabbits produced a pair of newborn rabbits each month. A modified problem, however, suggested that every mature pair of rabbits produced k pairs of newborn rabbits each month. The task is to calculate a total number of rabbit pairs after n months, assuming we have one pair of newborn rabbits at the start.

While the problem could be solved by recursion, the cost of calculation would be high. Every successive month the program would re-calculate the full solution for each previous month. A better approach is dynamic programming (which, in essence, is just remembering and reusing the already calculated values). Here is the modified solution in C#.

/// <summary>
/// Modified Fibonacchi problem: each rabbit pair matures in 1 month and produces "pairs" of newborn rabbit pairs each month
/// </summary>
/// <param name="pairs">Number of newborn rabbit pairs produced by a mature pair each month</param>
/// <param name="to">Number of months</param>
/// <returns>Total number of rabbit pairs after "to" months</returns>
static Int64 Fibonacci(int pairs, int to)
{
 if (to == 0)
 {
  return 0;
 }

 Int64 mature = 0;
 Int64 young = 1;

 Int64 next_mature;
 Int64 next_young;
 Int64 result = 0;
 for (int i = 0; i < to; i++)
 {
  result = mature + young;

  next_mature = mature + young;
  next_young = mature * pairs;

  mature = next_mature;
  young = next_young;
 }
 return result;
}

Note: the result grows fast! When trying to use the default Int32 (32 bit, or up to ~2 billion) and calculate the result for 4 pairs and 32 months, the value overflowed at around month 23.

The next problem was another variation on the rabbit simulation. In this case, the rabbits are mortal and die after k months. My solution was to have a counter for rabbits of each age at each step. I keep the counters in the dictionary, where the key is the age of a rabbit pair and the value is the number of rabbit pairs of that age on that step.

/// <summary>
/// Mortal Rabbits Fibonacci sequence variation
/// </summary>
/// <param name="months">How many months does the simulation run for</param>
/// <param name="lifespan">Rabbit lifespan</param>
/// <returns>A count of rabbit pairs alive at the end</returns>
static UInt64 MortalRabbits(int months, int lifespan)
{
 Dictionary<int, UInt64> dRabbits = GetEmptyDictionary(lifespan);
 dRabbits[0]++;

 for (int i = 0; i < months - 1; i++)
 {
  Dictionary<int, UInt64> newRabbits = GetEmptyDictionary(lifespan);
  foreach (KeyValuePair<int, UInt64> pair in dRabbits)
  {
   int age = pair.Key;

   if (age == 0)
   {
    newRabbits[1] = newRabbits[1] + dRabbits[age];
   }
   else if (age > 0 && age < lifespan - 1)
   {
    newRabbits[age + 1] = newRabbits[age + 1] + dRabbits[age];
    newRabbits[0] = newRabbits[0] + dRabbits[age];
   }
   else if (age == lifespan - 1)
   {
    newRabbits[0] = newRabbits[0] + dRabbits[age];
   }
  }
  dRabbits = newRabbits;
 }

 UInt64 count = 0;
 foreach (KeyValuePair<int, UInt64> pair in dRabbits)
 {
  count = count + pair.Value;
 }

 return count;
}

/// <summary>
/// Creates an dictionary where keys are integers from 0 to lifespan - 1, and all values are zeros
/// </summary>
/// <param name="lifespan"></param>
/// <returns>An empty dictionary</returns>
static Dictionary<int, UInt64> GetEmptyDictionary(int lifespan)
{
 Dictionary<int, UInt64> dRabbits = new Dictionary<int, UInt64>();

 for (int i = 0; i < lifespan; i++)
 {
  dRabbits.Add(i, 0);
 }
 return dRabbits;
}

References

Project ROSALIND
Modified Fibonacci Problem
Mortal Fibonacci Rabbits
Fibonacci Series
by . Also posted on my website