If you have never heard of any of the theory below, and are not dealing with sequence samples from populations of a single species, then you probably do not need to use this program, so please do not be surprised if you are lost in what follows.
The rationale for doing this at all is contained in a forthcoming (December of 1992) issue of Genetical Research in a paper by me (Felsenstein, 1992c). The assumption is that we have a random sample of nucleotide sequences evolved in a single random-mating population under a model of neutral mutation. We want to compute the likelihood curve for the compund parameter 4Nu (I would write the neutral mutation rate as mu instead of u but have no Greek letters available). The expression for the likelihood is given by (Felsenstein, 1988b)
L(4Nu) = L(4N, u) = Sum Probability (G given N) Probability (X given G, u) (over all genealogies G)where X is the observed sequence data and where the sum extends over all possible genealogies (not just all topologies but all combinations of branch lengths as well). If we rescale the time of the genealogies to be given, not in generations, but in units of 1/u generations, the expression changes to
L(4Nu) = Sum Probability (G given 4Nu) Probability (X given G) (over all genealogies G)which is a function of the compound parameter 4Nu (often called by the Greek letter theta).
The prior probability which is the first factor inside the summation is well approximated by the probability of the genealogical tree G (rescaled in time units of 1/u) under the "coalescent" approximation of Kingman (1982a, b). It is a simple product of exponential densities: if we number the interior nodes of G back from the present as n, n-1, n-2, ..., 2 where n is the number of sequences in our sample, and let u(k) be the interval between the time of node k+1 and that of node k then the density function of the prior is
n! (n-1)! Sum k(k-1)u(k) Probability (G given 4Nu) = ---------- exp ( --------------- ) n-1 4Nu (4Nu)where the summation is over n, n-1, ... , 2.
The real difficulty in these calculations is the summation over all possible genealogies G. This a huge job, but in the 1992c paper I present an argument that it can be done approximately by Monte Carlo Integration. The boostrap is used to create data sets X* (X star) and each of these has the genealogy estimated by Maximum Likelihood, to get an estimate G* (of course this is very slow if we have lots of bootstrap samples X*). I presented in that paper a rationale that the resulting G*'s are distributed approximately proportionately to the second factor in the summation two formulas above, namely the Probability(X given G), which is the likelihood of the tree under the usual maximum likelihood method for tree estimation. I also showed that if we can sample G's from that distribution, we can approximate the whole sum (up to a constant that is not important to know) by the average of the prior Probability(G given 4Nu).
This forms the basis for the estimation of the likelihood L(4Nu). Take the data, bootstrap sample it using SEQBOOT, estimate trees from each sample using DNAMLK (it is important that we estimate under a molecular clock so that DNAMLK and not DNAML should be used). Then the tree file that is produced by DNAMLK should be fed into COALLIKE and the estimated likelihood curve will come out on the output file.
The program is far faster than the slow step in all this, the estimation of the G's by DNAMLK. In the future we will provide (probably not in the PHYLIP package, however) an alternative method, which is the sampling of trees by the Hasting-Metropolis sampler, which is considerably faster. It too ends up with a tree file and involves use of COALLIKE. The references for that method are also given in my 1992c paper. Another paper that bears on these methods is my 1992a paper which explains why some of the alternative, and much faster, methods of estimating 4Nu are expected to be inefficient.
The program starts by displaying a menu which looks like this:
The program reads the treefile, computes the prior probability for each
tree for all of the values of 4Nu, and sums these over all trees. The result
is a table printed on the output file (see the example below). The likelihoods
are presented as log-likelihoods. The program detects the end of its input by
detecting end-of-file, so you should be careful to ensure that there are not
any extra blank lines or spare end-of-line characters (carriage returnsor
line-feeds, for example) at the end of the tree file.
The program does not make heavy use of system resouces so that it should
not be limited in what it can analyze, even on a small computer.
INPUT AND OUTPUT
Coalescent likelihoods from sampled trees, version 3.5c
Settings for this run:
N Minimum value of theta 0.0001000
X Maximum value of theta 100.00000
0 Terminal type (IBM PC, VT52, ANSI)? ANSI
1 Print indications of progress of run No
Are these settings correct? (type Y or the letter for one to change)
The program allows you to control whether it will print to the screen a summary
of how many trees have been processed so far (option 1), which you may want to
turn off if you are going to run the program in background (however this will
usually not be necessary). It also gives you control over the range of theta (
= 4Nu) that is used for the calculation, initially set to between 0.0001 and
100. The values considered are spaced a factor of 2 or 2.5 apart: 0.0001,
0.0002, 0.0005, 0.001. 0.002, etc.
TEST TREE FILE INPUT
(((J:0.00109,E:0.00109):0.01138,((C:0.00000,H:0.00000):0.00344,((B:0.00000,
F:0.00000):0.00000,((I:0.00000,G:0.00000):0.00000,
D:0.00000):0.00000):0.00344):0.00903):0.00000,A:0.01247);
(((H:0.00000,C:0.00000):0.00101,((B:0.00000,(I:0.00000,
G:0.00000):0.00000):0.00000,(F:0.00000,D:0.00000):0.00000):0.00101):0.01181,
((E:0.00000,J:0.00000):0.00000,A:0.00002):0.01280);
(((((C:0.00000,H:0.00000):0.00000,F:0.00000):0.00000,D:0.00000):0.00000,
(G:0.00000,(I:0.00000,B:0.00000):0.00000):0.00000):0.00514,((E:0.00005,
J:0.00005):0.00006,A:0.00010):0.00503);
(((((C:0.00000,H:0.00000):0.00000,F:0.00000):0.00000,D:0.00000):0.00000,
(G:0.00000,(I:0.00000,B:0.00000):0.00000):0.00000):0.00322,((E:0.00016,
J:0.00016):0.00019,A:0.00035):0.00287);
(((H:0.00000,C:0.00000):0.00101,((D:0.00000,G:0.00000):0.00000,(F:0.00000,
(I:0.00000,B:0.00000):0.00000):0.00000):0.00101):0.00363,((A:0.00006,
J:0.00006):0.00007,E:0.00013):0.00452);
(((B:0.00000,F:0.00000):0.00000,(G:0.00000,(I:0.00000,
D:0.00000):0.00000):0.00000):0.00316,((C:0.00102,H:0.00102):0.00111,((E:0.00000,
J:0.00000):0.00001,A:0.00002):0.00211):0.00104);
(((H:0.00000,C:0.00000):0.00101,((B:0.00000,G:0.00000):0.00000,(F:0.00000,
(I:0.00000,D:0.00000):0.00000):0.00000):0.00101):0.00564,((E:0.00003,
J:0.00003):0.00003,A:0.00006):0.00659);
(((((C:0.00000,H:0.00000):0.00000,F:0.00000):0.00000,D:0.00000):0.00000,
(G:0.00000,(I:0.00000,B:0.00000):0.00000):0.00000):0.00613,((E:0.00003,
J:0.00003):0.00004,A:0.00007):0.00607);
(((H:0.00000,C:0.00000):0.00101,((D:0.00000,G:0.00000):0.00000,(F:0.00000,
(I:0.00000,B:0.00000):0.00000):0.00000):0.00101):0.00265,((A:0.00010,
J:0.00010):0.00012,E:0.00022):0.00344);
(((H:0.00000,C:0.00000):0.00202,((D:0.00000,G:0.00000):0.00000,(F:0.00000,
(I:0.00000,B:0.00000):0.00000):0.00000):0.00202):0.00412,((A:0.00003,
J:0.00003):0.00004,E:0.00007):0.00606);
CONTENTS OF OUTPUT FILE
Read 10 trees of 10 tips each
theta Ln(Likelihood)
----- --------------
0.00010 -7.40952
0.00020 30.35217
0.00050 48.51775
0.00100 51.19906
0.00200 49.72183
0.00500 44.77105
0.01000 39.82787
0.02000 34.31189
0.05000 26.54843
0.10000 20.49795
0.20000 14.36581
0.50000 6.18881
1.00000 -0.02509
2.00000 -6.25096
5.00000 -14.49002
10.00000 -20.72581
20.00000 -26.96286
50.00000 -35.20872
100.00000 -41.44679
Back to the main PHYLIP page
Back to the SEQNET home page
Maintained 15 Jul 1996 -- by Martin Hilbers(e-mail:M.P.Hilbers@dl.ac.uk)