COALLIKE -- average priors for coalescent likelihoods

version 3.5c

CONTENTS:

(c) Copyright 1993 by the Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.

DESCRIPTION

This program reads a tree file which contains rooted trees with branch lengths (for example ones produced by
DNAMLK) and computes, for a number of different values of the parameter 4Nu, the average of the prior probabilities of those trees, and prints this out as the estimated likelihood curve for that parameter. The program is quite fast and can handle hundreds of trees without much delay.

If you have never heard of any of the theory below, and are not dealing with sequence samples from populations of a single species, then you probably do not need to use this program, so please do not be surprised if you are lost in what follows.

The rationale for doing this at all is contained in a forthcoming (December of 1992) issue of Genetical Research in a paper by me (Felsenstein, 1992c). The assumption is that we have a random sample of nucleotide sequences evolved in a single random-mating population under a model of neutral mutation. We want to compute the likelihood curve for the compund parameter 4Nu (I would write the neutral mutation rate as mu instead of u but have no Greek letters available). The expression for the likelihood is given by (Felsenstein, 1988b)

  L(4Nu) = L(4N, u) =  Sum  Probability (G given N) Probability (X given G, u)
                    (over all
                    genealogies G)
where X is the observed sequence data and where the sum extends over all possible genealogies (not just all topologies but all combinations of branch lengths as well). If we rescale the time of the genealogies to be given, not in generations, but in units of 1/u generations, the expression changes to
   L(4Nu)  =  Sum  Probability (G given 4Nu) Probability (X given G)
           (over all
           genealogies G)
which is a function of the compound parameter 4Nu (often called by the Greek letter theta).

The prior probability which is the first factor inside the summation is well approximated by the probability of the genealogical tree G (rescaled in time units of 1/u) under the "coalescent" approximation of Kingman (1982a, b). It is a simple product of exponential densities: if we number the interior nodes of G back from the present as n, n-1, n-2, ..., 2 where n is the number of sequences in our sample, and let u(k) be the interval between the time of node k+1 and that of node k then the density function of the prior is

                                 n! (n-1)!        Sum  k(k-1)u(k)
  Probability (G given 4Nu)  =  ----------  exp ( --------------- )
                                      n-1             4Nu
                                 (4Nu)
where the summation is over n, n-1, ... , 2.

The real difficulty in these calculations is the summation over all possible genealogies G. This a huge job, but in the 1992c paper I present an argument that it can be done approximately by Monte Carlo Integration. The boostrap is used to create data sets X* (X star) and each of these has the genealogy estimated by Maximum Likelihood, to get an estimate G* (of course this is very slow if we have lots of bootstrap samples X*). I presented in that paper a rationale that the resulting G*'s are distributed approximately proportionately to the second factor in the summation two formulas above, namely the Probability(X given G), which is the likelihood of the tree under the usual maximum likelihood method for tree estimation. I also showed that if we can sample G's from that distribution, we can approximate the whole sum (up to a constant that is not important to know) by the average of the prior Probability(G given 4Nu).

This forms the basis for the estimation of the likelihood L(4Nu). Take the data, bootstrap sample it using SEQBOOT, estimate trees from each sample using DNAMLK (it is important that we estimate under a molecular clock so that DNAMLK and not DNAML should be used). Then the tree file that is produced by DNAMLK should be fed into COALLIKE and the estimated likelihood curve will come out on the output file.

The program is far faster than the slow step in all this, the estimation of the G's by DNAMLK. In the future we will provide (probably not in the PHYLIP package, however) an alternative method, which is the sampling of trees by the Hasting-Metropolis sampler, which is considerably faster. It too ends up with a tree file and involves use of COALLIKE. The references for that method are also given in my 1992c paper. Another paper that bears on these methods is my 1992a paper which explains why some of the alternative, and much faster, methods of estimating 4Nu are expected to be inefficient.

INPUT AND OUTPUT

The program starts by displaying a menu which looks like this:

Coalescent likelihoods from sampled trees, version 3.5c

Settings for this run:
  N   Minimum value of theta               0.0001000
  X   Maximum value of theta               100.00000
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1  Print indications of progress of run  No

Are these settings correct? (type Y or the letter for one to change)
The program allows you to control whether it will print to the screen a summary of how many trees have been processed so far (option 1), which you may want to turn off if you are going to run the program in background (however this will usually not be necessary). It also gives you control over the range of theta ( = 4Nu) that is used for the calculation, initially set to between 0.0001 and 100. The values considered are spaced a factor of 2 or 2.5 apart: 0.0001, 0.0002, 0.0005, 0.001. 0.002, etc.

The program reads the treefile, computes the prior probability for each tree for all of the values of 4Nu, and sums these over all trees. The result is a table printed on the output file (see the example below). The likelihoods are presented as log-likelihoods. The program detects the end of its input by detecting end-of-file, so you should be careful to ensure that there are not any extra blank lines or spare end-of-line characters (carriage returnsor line-feeds, for example) at the end of the tree file.

The program does not make heavy use of system resouces so that it should not be limited in what it can analyze, even on a small computer.

TEST TREE FILE INPUT


(((J:0.00109,E:0.00109):0.01138,((C:0.00000,H:0.00000):0.00344,((B:0.00000,
F:0.00000):0.00000,((I:0.00000,G:0.00000):0.00000,
D:0.00000):0.00000):0.00344):0.00903):0.00000,A:0.01247);
(((H:0.00000,C:0.00000):0.00101,((B:0.00000,(I:0.00000,
G:0.00000):0.00000):0.00000,(F:0.00000,D:0.00000):0.00000):0.00101):0.01181,
((E:0.00000,J:0.00000):0.00000,A:0.00002):0.01280);
(((((C:0.00000,H:0.00000):0.00000,F:0.00000):0.00000,D:0.00000):0.00000,
(G:0.00000,(I:0.00000,B:0.00000):0.00000):0.00000):0.00514,((E:0.00005,
J:0.00005):0.00006,A:0.00010):0.00503);
(((((C:0.00000,H:0.00000):0.00000,F:0.00000):0.00000,D:0.00000):0.00000,
(G:0.00000,(I:0.00000,B:0.00000):0.00000):0.00000):0.00322,((E:0.00016,
J:0.00016):0.00019,A:0.00035):0.00287);
(((H:0.00000,C:0.00000):0.00101,((D:0.00000,G:0.00000):0.00000,(F:0.00000,
(I:0.00000,B:0.00000):0.00000):0.00000):0.00101):0.00363,((A:0.00006,
J:0.00006):0.00007,E:0.00013):0.00452);
(((B:0.00000,F:0.00000):0.00000,(G:0.00000,(I:0.00000,
D:0.00000):0.00000):0.00000):0.00316,((C:0.00102,H:0.00102):0.00111,((E:0.00000,
J:0.00000):0.00001,A:0.00002):0.00211):0.00104);
(((H:0.00000,C:0.00000):0.00101,((B:0.00000,G:0.00000):0.00000,(F:0.00000,
(I:0.00000,D:0.00000):0.00000):0.00000):0.00101):0.00564,((E:0.00003,
J:0.00003):0.00003,A:0.00006):0.00659);
(((((C:0.00000,H:0.00000):0.00000,F:0.00000):0.00000,D:0.00000):0.00000,
(G:0.00000,(I:0.00000,B:0.00000):0.00000):0.00000):0.00613,((E:0.00003,
J:0.00003):0.00004,A:0.00007):0.00607);
(((H:0.00000,C:0.00000):0.00101,((D:0.00000,G:0.00000):0.00000,(F:0.00000,
(I:0.00000,B:0.00000):0.00000):0.00000):0.00101):0.00265,((A:0.00010,
J:0.00010):0.00012,E:0.00022):0.00344);
(((H:0.00000,C:0.00000):0.00202,((D:0.00000,G:0.00000):0.00000,(F:0.00000,
(I:0.00000,B:0.00000):0.00000):0.00000):0.00202):0.00412,((A:0.00003,
J:0.00003):0.00004,E:0.00007):0.00606);

CONTENTS OF OUTPUT FILE


    Read    10 trees of   10  tips each

      theta        Ln(Likelihood)
      -----        --------------
     0.00010            -7.40952
     0.00020            30.35217
     0.00050            48.51775
     0.00100            51.19906
     0.00200            49.72183
     0.00500            44.77105
     0.01000            39.82787
     0.02000            34.31189
     0.05000            26.54843
     0.10000            20.49795
     0.20000            14.36581
     0.50000             6.18881
     1.00000            -0.02509
     2.00000            -6.25096
     5.00000           -14.49002
    10.00000           -20.72581
    20.00000           -26.96286
    50.00000           -35.20872
   100.00000           -41.44679

Back to the main PHYLIP page
Back to the SEQNET home page
Maintained 15 Jul 1996 -- by Martin Hilbers(e-mail:M.P.Hilbers@dl.ac.uk)