Models of DNA evolution

From Wikipedia, the free encyclopedia

A number of different Markov models of DNA sequence evolution have been proposed. This is because evolutionary processes vary between genomes and between different regions of a genome, for example different evolutionary processes apply to coding and noncoding regions. These models mostly differ in the parametrization of the rate matrix and in the modeling of rate variation.

1 DNA Evolution as a Continuous Time Markov Chain
2 References

[edit] DNA Evolution as a Continuous Time Markov Chain

[edit] Continuous Time Markov Chains

Continuous-time Markov chains have the usual transition matrices which are, in addition, parameterized by time, $t\$ . Specifically, if $E_1,\ldots,E_4\$ are the states, then the transition matrix

$P(t) = \big(P_{ij}(t)\big)$ where each individual entry, $P_{ij}(t)\$ refers to the probability that state $E_i\$ will change to state $E_j\$ in time $t\$ .

Example: We would like to model the substitution process in DNA sequences (i.e. Jukes-Cantor, Kimura, etc.) in a continuous time fashion. The corresponding transition matrices will look like:

$P(t) = \begin{pmatrix} p_{AA}(t) & p_{AG}(t) & p_{AC}(t) & p_{AT}(t) \\ p_{GA}(t) & p_{GG}(t) & p_{GC}(t) & p_{GT}(t) \\ p_{CA}(t) & p_{CG}(t) & p_{CC}(t) & p_{CT}(t) \\ p_{TA}(t) & p_{TG}(t) & p_{TC}(t) & p_{TT}(t) \end{pmatrix}$

where the top-left and bottom-right $2\times 2\$ blocks correspond to transition probabilities and the top-right and bottom-left $2\times 2\$ blocks corresponds to transversion probabilities.

Assumption: If at some time $t_0\$ , the Markov chain is in state $E_i\$ , then the probability that at time $t_0+t\$ , it will be in state $E_j\$ depends only upon $i\$ , $j\$ and $t\$ . This then allows us to write that probability as $p_{ij}(t)\$ .

Theorem: Continuous-time transition matrices satisfy:

$P(t+\tau) = P(t)P(\tau)\$

[edit] Deriving the Dynamics of Substitution

Consider a DNA sequence of fixed length m evolving in time by base replacement. Assume that the processes followed by the m sites are Markovian independent, identically distributed and constant in time. For a fixed site, let

$\mathbf{P}(t) = (p_A(t),\ p_G(t),\ p_C(t),\ p_T(t))^T$

be the column vector of probabilities of states $A, \$ $\ G, \$ $\ C, \$ and $\ T \$ at time $t \$ . Let

$\mathcal{E} = \{A,\ G, \ C, \ T\}$

be the state-space. For two distinct

$x, y \in \mathcal{E}$ , let $\mu_{xy}\$

be the transition rate from state $x\$ to state $y\$ . Similarly, for any $x\$ , let:

$\mu_x = \sum_{y\neq x}\mu_{xy}$

The changes in the probability distribution $p_A(t)\$ for small increments of time $\Delta t\$ are given by:

$p_A(t+\Delta t) = p_A(t) - p_A(t)\mu_A\Delta t + \sum_{x\neq A}p_x(t)\mu_{xA}\Delta t$

In other words (in frequentist language), the frequency of $A\$ 's at time $t + \Delta t\$ is equal to the frequency at time $t\$ minus the frequency of the lost $A\$ 's plus the frequency of the newly created $A\$ 's.

Similarly for the probabilities $p_G(t), \ p_C(t), \ \mathrm{and} \ p_T(t)$ . We can write these compactly as:

$\mathbf{P}(t+\Delta t) = \mathbf{P}(t) + Q\mathbf{P}(t)\Delta t$

where,

$Q = \begin{pmatrix} -\mu_A & \mu_{GA} & \mu_{CA} & \mu_{TA} \\ \mu_{AG} & -\mu_G & \mu_{CG} & \mu_{TG} \\ \mu_{AC} & \mu_{GC} & -\mu_C & \mu_{TC} \\ \mu_{AT} & \mu_{GT} & \mu_{CT} & -\mu_T \end{pmatrix}$

or, alternately:

$\mathbf{P}'(t) = Q\mathbf{P}(t)$

where, $Q\$ is the rate matrix. Note that by definition, the rows of $Q\$ sum to zero.

[edit] Ergodicity

If all the transition probabilities, $\mu_{xy}\$ are positive, i.e. if all states $x, y \in \mathcal{E}\$ communicate, then the Markov chain has a stationary distribution $\mathbf{\Pi} = \{\pi_x, \ x \in \mathcal{E} \}$ where each $\pi_x \$ is the proportion of time spent in state $x\$ after the Markov chain has run for infinite time, and this probability does not depend upon the initial state of the process. Such a Markov chain is called, ergodic. In DNA evolution, under the assumption of a common process for each site, the stationary frequencies, $\pi_A, \pi_G, \pi_C, \pi_T \$ correspond to equilibrium base compositions.

Definition A Markov process is stationary if its current distribution is the stationary distribution, i.e. $\mathbf{P}(t) = \Pi\$ Thus, by using the differential equation above,

$\frac{d\Pi}{dt} = Q\Pi = 0$

[edit] Time Reversibility

Definition: A stationary Markov process is time reversible if (in the steady state) the amount of change from state $x\$ to $y\$ is equal to the amount of change from $y\$ to $x\$ , (although the two states may occur with different frequencies). This means that:

$\pi_x\mu_{xy} = \pi_y\mu_{yx} \$

Not all stationary processes are reversible, however, almost all DNA evolution models assume time reversibility, which is considered to be a reasonable assumption.

Under the time reversibility assumption, let $s_{xy} = \mu_{xy}/\pi_y\$ , then it is easy to see that:

$s_{xy} = s_{yx} \$

Definition The symmetric term $s_{xy}\$ is called the exchangeability between states $x\$ and $y\$ . In other words, $s_{xy}\$ is the fraction of the frequency of state $x\$ that results as a result of transitions from state $y\$ to state $x\$ .

Corollary The 12 off-diagonal enteries of the rate matrix, $Q\$ (note the off-diagonal enteries determine the diagonal enteries, since the rows of $Q\$ sum to zero) can be completely determined by 9 numbers; these are: 6 exchangeability terms and 3 stationary frequencies $\pi_x\$ , (since the stationary frequencies sum to 1).

[edit] References

Jukes, T.H. and C.R. Cantor. (1969) Evolution of Protein Molecules, pp. 21-132. Academic Press, New York.
Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, 111-120.
Hasegawa, M., H. Kishino, and T. Yano. (1985) Dating of human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution, 22, 160-174.
Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17, 368-376.