Transforming into a Bioinformatician: 2012

Wednesday, December 19, 2012

Stoichiometry matrix

1. Definition

Stoichiometry matrix (SM) is a systematic arrangement of stoichiometric information from the reactions comprising the model. In a system with m species and n reactions the dimensions of the matrix are mxn. Chemical species are represented by rows and reactions – by columns. The elements of the matrix are corresponding stoichiometric coefficients. The selection of the system boundaries defines the complexity of the SM. When the concentration of a specie is considered fixed, the reaction is removed from the matrix.

The set of equations represented in the matrix together expresses the dynamics of the metabolite concentrations as

dS/dt = N*v,

where N is the matrix, v is the vector of fluxes and S is the vector of metabolite concentrations.

2. Applications

SM implies a steady state assuming that at any given time the concentration of the specie is constant. By using SM it is possible to enumerate all possible steady state flux solutions of a given network.

Personally, I like the fact that the SM is a crossroads of mathematics and biology, equally making sense for a person with a background in biology or information technology or mathematics.

2.1. Network reconstruction

The whole table of reactions encoded in the genome may be represented as SM. If the genes that encode for enzymes and reactions that each enzyme carries out are listed, the resulting table can be converted into the SM.

2.2. Mass conservation analysis

SM contains all information about the reaction network, therefore all necessary data to analyse mass conservation. Such relations can be retrieved from the SM as linear combinations of other rows. The result of removing all rows that are linear combinations of other rows is the reduced matrix which is used by software packages such as COPASI.

2.3. Stoichiometric modelling

In stoichiometric modelling, there are three major approaches are metabolic flux analysis (MFA), flux balance analysis (FBA) and metabolic pathway analysis (MPA). All three work by defining a high-dimension solution space of possible metabolic flux distributions based on the SM specifying system conservation relationships. The difference between the three approaches lies in how metabolic flux distributions are selected from the solution space.

MFA is a traditional approach which relies on extensive experimental data and computes a metabolic flux vector for a particular condition. Experimental data is used to simplify the SM.

FBA identifies only one optimal solution while alternative optimal solutions may exist. It very much depends on the validity of the model.

MPA, unlike the other two methods, can identify all metabolic flux vectors in a network. A finite set of solutions is achieved by additional constraints on the flux space.

References

Smolke C, The Metabolic Pathway Engineering Handbook: Fundamentals, CRC Press, 2009

Trinh T, Wlaschin A, Srienc F, Elementary Mode Analysis: A Useful Metabolic Pathway Analysis Tool for Characterizing Cellular Metabolism, Appl Microbiol Biotechnol, 2009, 81(5), pp 813-826

Wang X, Chen J, Quinn P, Reprogramming Microbial Metabolic Pathways, Springer, 2012

by Evgeny. Also posted on my website

Tuesday, November 27, 2012

COPASI and CellDesigner: Comparison

COPASI and CellDesigner are software packages for modelling and simulation of biochemical networks.

1. Common features

Models in both packages are based on compartments, species and reactions. Both packages have an area where the model tree is displayed, the area for entering and editing model parameters

2. Differences

2.1 CellDesigner

Networks are drawn based on the process diagram and are stored using SBML. CellDesigner supports a rich set of graphical elements for various compartments, species and reactions. There are predefines shapes for certain species, such as ‘truncated protein’ or ‘simple molecule’, or reactions, such as ‘transport’ or ‘dissociation’. Elements are added to the module by selecting them from a toolbar. Visual representation of the model is the strong advantage of CellDesigner. Figure 1 represents a model of MAPK-42 opened with CellDesigner. CellDesigner allows entering kinetic reactions by hand, but does not have predefined rate laws so they also have to be entered by hand for each reaction.

Figure 1 - CellDesigner with MAPK-42 model

2.2 COPASI

COPASI is less rich visually, but has advanced functionality for model simulation and analysis. The species are added to the model automatically by typing the attributes into the line in the list. Reactions can also be added directly by typing the chemical equation into the cell in the table. Unlike CellDesigner, COPASI has a number of predefined kinetic laws, can switch between stochastic and deterministic simulation methods and supports a number of methodologies for model analysis – computation of steady states, stoichiometric matrix, parameter estimation and optimisation. An important feature is the mathematical view where the model is described as a set of differential equations. Figure 2 represents the same model of MAPK-42 opened with COPASI, where differential equations are shown.

Figure 2 - COPASI with MAPK-42 model

3. SBML support

Both packages can read and write SBML, which is a file format for exchanging systems biology models. CellDesigner stores visual layout information in SBML annotations, but also has an option of exporting into ‘pure’ SBML in case conversion is required to a different software package which does not support annotations. CellDesigner can read and write SBML, and store rich enhanced functions as SBML annotations. COPASI converts its native model structure to SBML on exporting. Notes and annotations are preserved while importing and re-exporting as long as the objects that contains them is not deleted from the model.

4. Conclusion

While the strongest advantage of CellDesigner is the support for model building and editing via rich graphical representation, COPASI appears to be more suitable for simulation and analysis.

by Evgeny. Also posted on my website

Monday, October 29, 2012

Calculation of Phase Space Models with COPASI

COPASI is an open source modelling and simulation package for biochemical networks, capable of simulation based on ODEs and stochastic kinetics, and models can include discrete events.

A phase space diagram is a type of plots used to describe models that have complex dynamics. The axes represent the variables of the system, and each point in this space represents a state of the system. A trajectory in the phase-space is a line of points that describe the evolution of the state of the system, however time is not explicitly represented in such plots (even though it is implicit).

If the model has two variables, plotting them against each other is an obvious task. What if the model has three variables, like a model of calcium oscillations? At this time, COPASI can only display 2D plots. One solution is to display 2D projections of the phase space. For a model with 3 variables there are 3 possible projections: a vs b, a vs c and b vs c.

To create a phase space model, first download a model. In this case it is described in SBML. COPASI has an option for importing SBML: File -> Import SBML. COPASI may generate some warnings. In the left panel one can see the 3 species and 8 reactions included in the model.

To create a phase-space diagram locate Output Specifications -> Plots down in the left panel. In the Plots panel on the right click new, and a plot with a default name is created.

Double-click the plot in the panel. Now you can give it a meaningful name, and then click New curve. This is where you select the axis of the plot. In this diagram, transient concentrations of species will be plotted. Select a for X axis and b for Y axis and click OK.

Now a 2D curve appears in the right panel. Parameters can be edited to make them more intuitively understandable. Then, click Commit and select Tasks -> Time Course in the left panel. In the Time Course panel, click Run.

The result is the phase space diagram of the transient concentrations of species a vs b. Other plots are created in a similar way.

References

COPASI website
Kummer et al. - Model of calcium oscillations of Kummer et al. by Evgeny. Also posted on my website

Thursday, September 13, 2012

Markov Chains Monte Carlo Bayesian Inference

1. Bayesian Inference

Bayesian rule is a mathematical rule that explains how the estimate of the probability changes as new evidence is collected. It relates the odds of event A₁ to event A₂ before and after conditioning on event B (1).

As an example, assume that there is a bag containing black and white balls, but the percentage of colours is unknown. As more and more balls are drawn from the bag, the probability of the next ball being white (or black) can be calculated with higher precision. In this case, A₁ is the probability of white ball being drawn, A₂ of the black ball, and B is the set of the balls already drawn, such as “of 10 balls 4 were black and 6 were white”.

Frequentist approach to Bayes’ rule is to consider probabilities of A₁ and A₂ as known and fixed, while in Bayesian they are considered random variables. Bayesian approach also considers prior probability, or just the prior, which expresses uncertainty before the data is taken into account. It makes possible to introduce subjective knowledge of the problem into the model in the form of prior information. Among the drawbacks of the approach is the need to evaluate multi-dimensional integrals numerically, which gets worse as the number of parameters in the system increases. Introduction of conjugate prior is an attempt to solve these drawbacks. Conjugate prior is such prior distribution that the prior and posterior are in the same family of distributions. Markov Chains are useful because under certain conditions, a memoryless process can be generated which has the same limiting distribution as the random variable that the calculation is trying to simulate.

2. Markov Chains

A Markov chain is a process which can be in a number of states. The process starts in one of the states and moves successively from one to another. Each move is called a step. At each step a process has a probability to move to another state, and the probability does not depend on any of the previous states of the process. That is why such process is called “memoryless”.

Example: In a weather model, a day can be either nice, rainy or snowy. If the day is nice, the next day will be either rainy or snowy with 50% probability. If the day is rainy or snowy, it will stay the same with 50% probability, or will change to one of the other two possibilities with 25% probability. This is a Markov chain because the process determines its next state base on the current state only, without any prior information. The probabilities can be presented in the transition matrix (2).

The first row presents the probabilities of the next day’s weather if today is rainy (R): ½ that it will remain rainy, ¼ that it will be nice and ¼ that it will be snowy. The sum of probabilities across the row must be equal to 1.

3. Monte Carlo Methods

Markov Chains Monte Carlo (MCMC) methods are computational algorithms which randomly sample from the process. The aim of the methods is to estimate something that is too difficult or time consuming to find deterministically. Using the example above, assume that the probabilities in the matrix are not known. A MCMC could simulate a thousand rainy days and count the number of snowy, rainy and nice days that follow. The same could be repeated for snowy and nice days.

The original Monte Carlo approach was developed to use random number generation to compute integrals. If a complex integral can be decomposed into a function f(x) and a probability density function p(x), then the integral can be expressed as an expectation of f(x) over the density p(x) (3)

Now, if we draw a large number of x₁, ..., x_n of random variables from the density p(x), then we end up with (4), which is referred to as Monte Carlo integration.

3.1. Markov Chain Convergence

It can be shown that if matrix P is the transition matrix for a Markov chain and p_ij element is the probability of transitioning from state i to state j, then the element p_ij⁽ⁿ⁾ of the matrix Pⁿ gives the probability that the Markov chain starting in state s_i will be in state s_j after n steps. Using the transitional matrix (2), on the sixth step of multiplying the matrix by itself the following state will be reached (5). Such Markov chain is called regular (not all Markov chains are regular). The long-range predictions for this chain are independent of the starting state.

Continuing the example, assume that the initial probabilities for the weather conditions on the starting day of the chain are equal. This will be called a probability vector, which in the example will be as shown in (6).

It can be shown that the probability of the chain being in the state s_i after n steps is the ith entry in the vector (7).

If, as in the example, the chain is regular, it will converge to a stationary distribution, when the probability of the chain being in state i is independent of the initial distribution. In the weather example, the stationary distribution is shown in (8).

3.2. Ergodicity and Reversibility

If the process that starts from state i will ever re-enter state i, such state is called recurrent. Additionally, such state is called nonnull if it re-enters state i in n steps, where n is finite. A process is called aperiodic if the greatest common divisor of return steps to any state is 1. For example, if the chain re-enters a state after 2, 4, 6 steps and so on, such chain is not aperiodic. A process is called irreducible if any state can be reached from any other state in a finite number of moves. A chain that is recurrent nonnull, aperiodic and irreducible is called ergodic, and the ergodic theorem states that a unique limiting distribution exists for the chain and such distribution is independent of initial distribution.

A Markov chain is reversible if (9) is satisfied for all i, j, where u is the stationary distribution and p_ij are elements of the transition matrix.

Ergodicity and reversibility are the two principles behind the Metropolis-Hastings algorithms.

3.3. Metropolis-Hastings algorithms

One of the problems of applying Monte Carlo integration is in obtaining samples from some complex probability distribution. Mathematical physicists attempted to integrate complex functions by random sampling. Metropolis showed how to construct the desired transition matrix and later Hastings generalized the algorithm. The essential idea of the algorithm is to construct a reversible Markov chain for given distribution π such, that its stationary distribution is π. It starts with generating the candidate j from a proposal transition matrix Q. This matrix Q transitions from current state i to proposal j. Proposal j is then accepted with some probability α_ij. If the rejection takes place, then the next state retains the current value. The summary of the algorithm is as follows:

Start with arbitrary X₀. Then, at each iteration t = 1, ..., N:
Sample j~q_ij. Q={q_ij} is the transition matrix.
Generate u – a uniformly distributed random number between 0 and 1.
With probability (9.1), if u <= α_ij, set X_t+1 = j, else set X_t+1 = i.

Such algorithm randomly attempts to move about the sample space, sometimes accepting the transitions and sometimes remaining in the same state. Acceptance ratio indicates how probable the proposed sample is in respect to the current sample. If the next state is more probable than the current, the transition is always accepted. If the next step is less probable, it will be occasionally accepted, but more often rejected. Intuitively, this is why this algorithm works, tending to stay away from the regions with low-density and mostly remaining in the high-density regions.

A key issue in the successful implementation of the algorithm is the number of steps until the chain approaches stationarity, which is called the burn-in period. Typically first 1000 to 5000 elements are thrown out and then a test is used to assess whether stationarity has been reached.

3.4. Gibbs Sampling

Gibbs sampling is a special case of Metropolis-Hastings sampling where the random value is always accepted. The key is the fact that only distributions where all random variables but one are assigned fixed values are considered. The benefit is that such distributions are much easier to simulate compared to complex joint distributions. Instead of generating a single n-dimensional vector in a single pass, n random variables are simulated sequentially from the n conditionals.

A Gibbs stopper is an approach to convergence based on comparing the Gibbs sampler and the target distribution. Weights are plotted as a function of the sampler iteration number. As the sampler approaches stationarity, the distribution of weights is expected to spike.

4. Example

4.1. Initial Data

Table 1 contains the field goal shooting record of LeBron James for the seasons 2003 to 2009. The goal of the simulation was to use OpenBUGS to carry out a MCMC Bayesian simulation and produce a posterior distribution to the probability of the player’s shooting success.

Table 1 – LeBron James Goal Shooting Record

4.2. The Model

The model for the success probabilities of LeBron James is given by (10)

where k = 2, ..., 7, corresponding to seasons from 2004/5 to 2009/10, and π₁ is estimated individually as a success probability for 2003/4 season. R_k are performances for the current seasons. A beta prior is the success probability for the first season and gamma priors are relative successes R_k.

Likelihood for this model is given by (11).

The likelihood is written in OpenBUGS as

for (k in 1:YEARS){ y[k]  ~ dbin( pi[k], N[k] ) }

while the model for success probabilities is expressed as follows:

for (k in 2:YEARS){ pi[k]<-pi[k-1]*R[k] }

The data for the model specifies the values of the successes and total attempts (y_i and N_i) for each season. Initial values for the model must be specified for π₁ and R_k as follows:

list(pi=c(0.5, NA, NA, NA, NA, NA, NA), R=c(NA, 1,1,1,1,1,1))

No initial values were specified for π₂ ... π₇ since they are deterministic and for R₁ since it is constant. Actual initial data used for the model is attached as separate initX.txt files, where X is the number of the chain.

4.3. One Chain Calculation

The initial calculation was run with 1000 samples burn-in and 2000 sample points, for one chain. The posterior summaries were presented in Figure 1

Figure 1 – Posterior summaries for one chain calculation

From these results, the expected success rates for James LeBron was 42.24% for the season of 2003/4, increased by a factor of 1.1 for the season of 2004/5, then stayed close to the level of 2003/4 season for the next 2 seasons (posterior means for R₃ = 1.036 and R₄ = 0.9863) and then was increasing steadily for the next 3 seasons.

Since R_i indicates the relative performance of a current season in comparison to the previous one, it can be reported that the player was steadily increasing his performance over the duration of the seven season, with one especially significant increase of 10.5% early in the career.

Further, posterior distributions of the relative performance vector R were analysed by using the Inference -> Compare function and producing the boxplot, which is presented in Figure 2.

Figure 2 – The boxplot of R

The limits of boxes represent the posterior quartiles and the middle bar – the posterior mean. The endings of the whisker lines are 2.5% and 97.5% posterior percentiles.

Posterior density can be estimated by plotting the success rate density for a season. Figure 3 represents the posterior density for each season (π_i)

Figure 3 – Posterior density for each season

Checking convergence is not a straightforward task. While it is usually easy to say when the convergence definitely was not reached, it is difficult to conclusively say if the chain has converged. Convergence can be checked by using the trace option from Inference -> Samples menu. Figure 4 represents the history plots for π_i.

Figure 4 – History Plots for π_i, one chain

It is clear that π₇ has converged, while the result for other seasons is not so obvious. Multiple chain calculation with a higher number of iterations may be helpful. Another reason to use the higher number of iterations is the reduction of Monte Carlo error in case it is too high – it decreases as the number of iteration grows. In fact, the value of Monte Carlo error can be used to decide when to stop simulation.

4.4. Multiple chain calculation

The main difference between the single and multiple chain calculations is the fact that a separate set of initial values is required for each chain. The following initial values were used for the three chains:

list(pi=c(0.5, NA, NA, NA, NA, NA, NA), R=c(NA, 1,1,1,1,1,1))
list(pi=c(0.9, NA, NA, NA, NA, NA, NA), R=c(NA, 1,1,1,1,1,1))
list(pi=c(0.1, NA, NA, NA, NA, NA, NA), R=c(NA, 1,1,1,1,1,1))

The posterior statistics for this calculation were presented in Figure 5.

Figure 5 – Posterior summaries for multiple chain calculation

The trace plots now contain one line for each chain. The history plots for π_i for this calculation were presented in Figure 6.

Figure 6 – History Plots for π_i, multiple chains

References

Ntzoufras I, Bayesian Modeling Using WinBUGS, 2009, Wiley, England

Reich B, Hodges J, Carlin B, Reich A, A Spatial Analysis of Basketball Shot Chart Data, 2005, NSCU Statistics

Walsh B, Markov Chain Monte Carlo and Gibbs Sampling, 2004, MIT Lecture Notes

Mathematics for Metabolic Modelling - Course Materials 2012, University of Manchester, UK

by Evgeny. Also posted on my website

Saturday, August 4, 2012

Statistical Inference

1. Introduction

Inferential statistics give the basis to draw conclusions (or inferences) from the data and to evaluate the confidence in these conclusions being true. The need for inference is dictated by the fact that it is usually impossible to measure the parameter of interest in the whole population. The best alternative is to select a sample from the population and make the measurements. Then a conclusion may be derived about the population as a whole. However, due to a random chance, a sample may not represent the population well enough (i.e. most people in the sample may have higher than average resistance to the disease of interest). Therefore, the probability of such a coincidence has to be estimated as well.

The inferential statistics are generally considered a separate topic from descriptive statistics, which is mostly about the presentation of factual data.

2. Categories of Statistical Inference

2.1 Estimation

The population parameters can be estimated based on measuring the parameters of a sample and then calculating the mean and variance. These values can not tell what exactly the population parameters are. When a different sample is chosen, the values will differ slightly. Therefore, it is only possible to say that there is a certain confidence that the population parameters lie within some range. Generally, the accepted confidence for scientific studies is 95%.

2.2 Hypothesis testing

Hypothesis testing is used when there is a need to decide if some population values can be ruled out based on the result of the sampling. A hypothesis is a statement about the population. Hypothesis testing determines if there is enough evidence in the data to decide whether it is false. What is tested is called a null hypothesis, for example that the drug has no effect on the recovery from a disease. To achieve that, the probability is calculated that the data supports the null hypothesis, which is called p-value. The smaller the p-value, the stronger the evidence against a null hypothesis.

2.3 Decision theory

Decision theory attempts to combine the sample information with the other relevant aspects of the problem and make the best decision. Relevant information may include the consequences of a decision and a prior information that is available from sources other than the sample data. For example, a drug company may consider the decision about marketing a new painkiller. The ultimate decision will be whether to make a drug or not, including estimation of potential price, demand and profits.

3. Frequentist vs Bayesians

The basic definition of probability divides the statistics discipline into two distinct camps: frequentists and bayesians. A simple statement, such as "the probability that the coin lands heads is 1/2" can be viewed in very different ways. The frequentist will consider the following: if the experiment is repeated many times, the ratio of heads and tails will approach 1:1. So the probability is the long-run expected frequency. The bayesian will consider probability to be a measure of one's belief in the outcome.

In other words, frequentist does not know which way the world is. Also, since the data is finite, he can not tell exactly which way it is. So he will list the possibilities and see which ones are ruled out by the data. A bayesian also does not know which way the world is. He collects the finite data and compares how probable are different states of the world based on this data.

In some contexts frequentists and bayesians can coexist, but in some they are deeply in conflict. For example, to assign the numerical probability to the existence of life on Mars, we’ll have to drop the frequentist approach, because in frequentist point of view probability is something that’s determined in a lengthy series of trials.

This essay is written in terms of the frequentist approach.

4. Likelihood and Maximum Likelihood

Likelihood and probability are used as synonyms in everyday language, but from statistical point of view they are very different concepts. For example, the question about probability could be "what is the probability of getting two heads in a row". There is a parameter (P(heads)) and we estimate the probability of a sample given this parameter. The question about likelihood would be of a somewhat reversed direction. Is the coin fair? To what extent does the sample support the hypothesis that P(heads) = 0.5? We start with the observation and make inference about the parameter. In this example, the maximum likelihood of the coin being unfair is 1 as there is no data yet to support the "fair" hypothesis, while the probability of two head in a row is 0.5x0.5 = 0.25.

The likelihood function is defined as (1)

and is the joint probability of observing the data. The maximum likelihood is the estimate of a parameter that maximises the probability of observing the data given a specific model for the data.

Continuing the example with coin tossing, the model that describes the probability of observing head for a coin flip is called Bernoulli distribution. It has the form (2)

where p is the probability of heads and x_i is the outcome of the ith coin flip.

If the likelihood function is applied to the Bernoulli distribution, the likelihood function becomes (3).

To simplify the equation, the logarithm of both sides can be taken, which does not change the value of the p for which the function is maximized (4).

Figure 1 shows the plots of likelihood as function of probability. As expected, the higher the proportion of one of the two outcomes, the more likely it is that the coin is not fair.

Figure 1 - Plots of likelihood as function of probability

The maximum likelihood as described above can also be determined analytically by taking the derivative of the likelihood function with respect to p and finding where the slope is zero. The maximum is found at (5).

which is just the proportion of heads observed in the experiment.

5. Regression

The simplest biological models are based on the concept of correlation. Correlation is observed when a change in one variable is observed together with the change in one or more other variables. For example, when the amount of calories consumed daily changes, other variables may change such as body weight or body fat percentage. Correlation is a measure of the relationship between these variables. It is important to keep in mind that a change in one of the variables does not necessarily cause the change in another. Both changes may, for example, be caused by some other, yet unknown, variable. For example, the length of the foot and the length of the palm are likely to correlate - people with large feet generally have large palms. But that does not mean that having large feet is caused by large palms, or the other way around. It is more likely that the overall body height is what causes the change in both variables.

Regression is a term that includes a number of different techniques that analyse the variables that are correlated and attempt to explore the relationship between those variables and the strength of correlation. The results may be used in developing models for prediction and forecasting.

For the regression analysis to be valid, some assumptions have to be satisfied:

the sample is representative of the whole population
the error is a random variable, so errors are independent of each other
the errors are normally distributed
the errors are not correlated
the predictors, if there are more than one, are linearly independent

In the following discussion of linear and non-linear regression, the data gathered by Kuzmic et al (1996) is used. The data was gathered while studying asparatic proteinase catalysed hydrolysis of a fluorogenic peptide. The data is presented in Table 1.

[S],μM	V₀
0.66	0.0611
1	0.0870
1.5	0.1148
3	0.1407
4	0.1685
6	0.1796

Table 1. Experimental data

5.1 Linear Regression

The simplest form of regression is linear regression. It assumes that correlation between variables can be described by the equation y=ax+b, where x and y are the values of the variables. In this case, the relationship between variables can be plotted as a straight line. If the prediction about the linear nature of correlation is correct, and if the line is chosen well enough, the plotted values of variables will be located close to the "ideal" regression line. The problems here are first, to define the "closeness" of a variable to the predicted line and, second, to formulate a way to choose the line that will be the closest to all plotted values because there are an infinite number of possible lines and visual estimation is not at all precise enough.

The “closeness” is defined as the square of the vertical distance from the point to the regression line. The absolute value of the distance can not be used because it can be demonstrated that for any slope of the line it is possible to find such line that the absolute distances from points to the line, positive and negative, will cancel each other. Therefore, this does not solve the problem of finding a unique solution. If squared distance is used, however, the best line will go in the middle of two points, because the value of a² + b² is smallest when a = b.

Now that the value of the total squared error, which is the sum of squared distances from all points, is known, it is possible to choose which of the lines gives better fit to the data. The final task is to derive a method to predict which line will be the best fit.

The least squares estimates of slope and intercept are obtained as follows:

and

Or S_xy is the sum of x deviations times the sum of y deviations and S_xx is the sum of x deviations squared. Values of a and b are called the least squares estimates of slope and intercept.

As an example, it is possible to apply linear regression to the data from Table 1. That makes no practical sense since it is known beforehand that the correlation is not linear and is done only for illustration.

The data and the regression line can be plotted using the following R script

conc<-c(0.66,1,1.5,3,4,6)
rate1<-c(0.0611,0.0870,0.1148,0.1407,0.1685,0.1796)
plot(conc,rate1,lwd=2,col="red",xlab="[S], micromoles",ylab="V0")
par(new=TRUE)
abline(a=0.0683,b=0.0212,col="blue")
text(3, 0.12, "regression line", col="blue")

The results were presented in Figure 2.

Figure 2 - Linear regression against data from Table 1.

A null hypothesis can be examined, which states that V₀ and S have no correlation. To achieve this, the Excel regression analysis was applied to determine the values of the standard error and the p-values. The result was presented in Figure 2.

Using the Excel function TINV(5%, 4) where 4 is the number of degrees of freedom and 5% means the 95% confidence, the value of the t statistic is 2.776. It is less that the critical value of 5.428, which allows to reject the null hypothesis and conclude that V₀ correlates with S. This does not mean, of course, that the linear regression describes the correlation in the best way – there may exist a non-linear function that gives a better fit.

Another parameter describing the linear correlation is the correlation coefficient. It is a measure of the strength of the relationship of two variables and is defined as

Applying to the sample data,

and

The value of r is a number between -1 and 1, and the closer r is to 1, the stronger is the correlation, meaning that the values lie closer to the regression line. The value of r=0.9426 indicates that correlation is very strong and the linear regression produces a line that fits the data well. The lesson here is that it is easy to mistake a non-linear for a linear especially if there was no prior information about the nature of the relationship and the values are only measured in a limited interval.

5.2 Non-Linear Regression

Linear regression is a fairly simple technique and gives good results when the change in one variable has a constant influence on another. However, in biological and many other processes this is not always the case. As a simple example, as the length of a body increases, the mass increases too. But a specie which body length doubles will not double its mass - the increase in mass is likely to be greater than by the factor of two. To determine the nature of the relationship between variables when linear dependence does not apply, non-linear regression methods are used. Generally, there is no analytical solution for the non-linear least squares method.

In some cases, the non-linear functions can be transformed to the linear form. Consider the non-linear model of enzyme kinetics, the Michaelis-Menten equation (11)

The model can be transformed the following way:

Considering that the K_m and V_max are constants, the equation is now in the linear form of y=ax+b, where y=1/V, a=K_m/V_max and b=1/V_max. The graphical representation of the function is known as Lineweaver-Burk plot and is used for the estimation of the K_m and V_max parameters. It is, however, very error prone because the y-axis takes a reciprocal of reaction rates and small measurement errors are increased.

When the analytical solution for the non-linear function does not exist, numerical methods are used. The algorithms behind these methods generally require an estimate for the parameter of interest, which should be reasonable to make sure the result is correct. In R language, nls function is used to determine nonlinear least-squares estimates of the parameters of a nonlinear model. To apply the function, it would be reasonable to first use the Lineweaver-Burk plot for an estimate of the parameters and then to use the results as an input for the nls function.

The reciprocal values were produced and the Lineweaver-Burk plot was constructed using the data from Table 1. Next, the least squares regression was applied to the plot of reciprocals to establish the values of K_m and V_max. This was achieved by the following R script

concLB<-1/conc
rate1LB<-1/rate1
fit<-lm(rate1LB~concLB)
fit

The script produced the following output.

Call:
lm(formula = rate1LB ~ concLB)

Coefficients:
(Intercept)       concLB  
      4.051        7.853

The results were plotted by the following R script and are presented in Figure 3.

plot(concLB,rate1LB,ylim=range(c(rate1LB)),lwd=2,col="red",xlab="1/[S], 1/micromoles",ylab="1/V0")
par(new=TRUE)
abline(a=4.051,b=7.853,col="blue")
text(1, 10, "regression line", col="blue")

Figure 3-Using Lineweaver-Burk plot to estimate Michaelis-Menten parameters

The output was used to calculate the estimates for K_m and V_max.

1/V_max = 4.051 + 7.853*(1/K_m)
V_max = 1/4.051 = 0.2469
-7.853*(1/K_m) = 4.051
-(1/K_m)=0.5159
K_m = 1.9384

Now that the reasonable estimates are calculated, it is possible to use the non-linear regression to fit Michaelis-Menten function. This is achieved by the following R script.

df<-data.frame(conc, rate1)   
nlsfit <- nls(rate1~Vm*conc/(K+conc),data=df, start=list(K=1.9384, Vm=0.2469))
summary(nlsfit)

The script provides the following output

Formula: rate1 ~ Vm * conc/(K + conc)

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
K    1.6842     0.2141   7.868  0.00141 ** 
Vm   0.2315     0.0112  20.664 3.24e-05 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 0.005917 on 4 degrees of freedom

Number of iterations to convergence: 3 
Achieved convergence tolerance: 2.895e-06

Alternatively, the R language contains a number of ‘self-starting’ models, which can be used without the initial estimates for the parameters. The Michaelis-Menten model is represented by SSmicmen function and can be used as in the following R script

model<-nls(rate1~SSmicmen(conc,a,b))
summary(model)

The script provides the following output

Formula: rate1 ~ SSmicmen(conc, a, b)

Parameters:
  Estimate Std. Error t value Pr(>|t|)    
a   0.2315     0.0112  20.664 3.24e-05 ***
b   1.6842     0.2141   7.868  0.00141 ** 
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 0.005917 on 4 degrees of freedom

Number of iterations to convergence: 0 
Achieved convergence tolerance: 2.637e-06

In this case, the results were identical for the nls function and the SSmicmen function. This probably means that the initial estimates used in the nls function were reasonable. Lastly, the parameters identified by nonlinear regression can be used to plot the resulting function and compare it with experimental data.

The following R script will generate 100 points on the curve predicted by the values of K_m and V_max that were identified by the nonlinear least squares method and plot them against the experimental data.

plot(conc,rate1,lwd=2,col="red",xlab="[S], micromoles",ylab="V0")
par(new=TRUE)
Vm <- 0.2315
K <- 1.6842
x <- seq(0, 6, length=100)
y <- (Vm*x)/(K+x)
lines(x, y, lty="dotted", col="blue")
text(3, 0.13, "regression line", col="blue")

The output is presented in Figure 4.

Figure 4-Fitted curve plotted against experimental data.

To measure how well the fitted curve fits the data, a number of tests may be used. One of them is the F-test, which measures a null hypothesis that a predictor (in this case, [S], has no value in predicting V₀. F is defined as (13)

where is the ith predicted value, is the ith observed value, and is the mean. Specialized tables exist to obtain the critical values of F given the sample size and the required confidence percentage. When the F test is applied to two models, the test measures the probability that they have the same variance.

Another test is the Akaike information criterion (AIC). It does not test the null hypothesis and, therefore, does not tell how well the model fits the data. It estimates how much information is lost when the attempt is made to fit the data to the model. In practice, it is used to select between several model the one that has the highest probability of being correct. If there are more than one model for which the AIC value is the same or very close, the best model may be a weighted sum of those models. AIC is defined as (14)

where k is the number of parameters in the model, and L is the maximised value of the likelihood function for the model.

References:

Aitken M, Broadhurst B, Hladky S, Mathematics for Biological Scientists 2010, Garland Science, NY and London
Crawley, M, The R Book, Wiley, 2007
Kuzmic P, Peranteau A, Garcia-Echeverria C, Rich D, Mechanical Effects on the Kinetics of the HIV Proteinase Deactivation, Biochemical and Biophysical Research Communications 221, 313-317 (1996)
Ott R, Longnecker M, An Introduction to Statistical Methods and Data Analysis, Brooks/Cole, 2010
Trosset M, An Introduction to Statistical Inference and its Applications With R, Chapnam and Hall, 2009
Mathematics for Metabolic Modelling - Course Materials 2012, University of Manchester, UK

by Evgeny. Also posted on my website

Transforming into a Bioinformatician

Wednesday, December 19, 2012

Stoichiometry matrix

Tuesday, November 27, 2012

COPASI and CellDesigner: Comparison

Monday, October 29, 2012

Calculation of Phase Space Models with COPASI

Thursday, September 13, 2012

Markov Chains Monte Carlo Bayesian Inference

Saturday, August 4, 2012

Statistical Inference

Followers

Blog Archive