All of the programs except FACTOR, DNADIST, GENDIST, DNAINVAR, SEQBOOT, CONTRAST, RETREE, COALLIKE and the plotting and consensus tree programs act to construct an estimate of a phylogeny. MOVE, DOLMOVE, and DNAMOVE let you construct it yourself by hand. All of the rest but NEIGHBOR, the "...PENNY" programs and CLIQUE make use of a common approach involving additions and rearrangements. They are trying to minimize or maximize some quantity over the space of all possible evolutionary trees. Each program contains a part that, given the topology of the tree, evaluates the quantity that is being minimized or maximized. The straightforward approach would be to evaluate all possible tree topologies one after another and pick the one which, according to the criterion being used, is best. This would not be possible for more than a small number of species, since the number of possible tree topologies is enormous. A review of the literature on the counting of evolutionary trees will be found one of my papers (Felsenstein, 1978a).
Since we cannot search all topologies, these programs are not guaranteed to always find the best tree, although they seem to do quite well in practice. The strategy they employ is as follows: the species are taken in the order in which they appear in the input file. The first two (in some programs the first three) are taken and a tree constructed containing only those. There is only one possible topology for this tree. Then the next species is taken, and we consider where it might be added to the tree. If the initial tree is (say) a rooted tree with two species and we want the resulting three-species tree to be a bifurcating tree, there are only three places where we could add the third species. Each of these is tried, and each time the resulting tree is evaluated according to the criterion. The best one is chosen to be the basis for further operations. Now we consider adding the fourth species, again at each of the five possible places that would result in a bifurcating tree. Again, the best of these is accepted.
This strategy of adding species and making local rearrangements will look
at about (n-1) times (2n-3) different topologies, though if rearrangements are
frequently successful the number may be larger. I have been describing the
strategy when rooted trees are being considered. For unrooted trees there is a
precisely similar strategy, though the first tree constructed may be a three-
species tree and the rearrangements may not start until after the addition of
the fifth species.
Though we are not guaranteed to have found the best tree topology, we are
guaranteed that no nearby topology (i. e. none accessible by a single local
rearrangement) is better. In this sense we have reached a local optimum of our
criterion. Note that the whole process is dependent on the order in which the
species are present in the input file. We can try to find a different and
better solution by reordering the species in the input file and running the
program again (or, more easily, by using the J option). If none of these
attempts finds a better solution, then we have some indication that we may have
found the best topology, though we can never be certain of this.
Note also that a new topology is never accepted unless it is better than
the previous one, so that the rearrangement process can never fall into an
endless loop. This is also the way ties in our criterion are resolved, namely
by sticking with the tree found first. However, the tree construction programs
other than CLIQUE, CONTML, FITCH, and DNAML do keep a record of all trees found
that are tied with the best one found. This gives you some immediate idea of
which parts of the tree can be altered without affecting the quality of the
result.
The programs doing global optimization print out a dot "." after each
group is removed and re-added to the tree, to give the user some sign that the
rearrangements are proceeding. A new line of dots is started whenever a new
round of global rearrangements is started following an improvement in the tree.
On the line before the dots are printed there is printed a bar of the form
"!--------------!" to show how many dots to expect. The dots will not be
printed out at a uniform rate, but the later dots, which represent removal of
larger groups from the tree and trying them consequently in fewer places, will
print out more quickly. With some compilers each row of dots is not printed
out until it is complete.
It should be noted that PENNY, DOLPENNY, DNAPENNY and CLIQUE use a more
sophisticated strategy of "depth-first search" with a "branch and bound" search
method that guarantees that all of the best trees will be found. In the case
of PENNY, DOLPENNY and DNAPENNY there can be a considerable sacrifice of
computer time if the number of species is greater than about ten: it is a
matter for you to consider whether it is worth it for you to guarantee finding
all the most parsimonious trees, and that depends on how much free computer
time you have! CLIQUE finds all largest cliques, and does so without undue
burning of computer time.
In practice, it is advisable to use the Jumble option to evaluate many
different orderings of the input species. When the programs which have global
branch-swapping as default (such as DNAPARS) are used or when the G option is
employed in other programs IT IS ADVISABLE TO USE THE JUMBLE OPTION AND SPECIFY
THAT IT BE DONE MANY TIMES (AS MANY AS TEN) to use different orderings of the
input species). When the G (Global rearrangement) option is not being used I
have also found it useful to do multiple Jumbles.
People who want a magic "black box" program whose results they do not have
to question (or think about) often are upset that these programs give results
that are dependent on the order in which the species are entered in the data.
To me this property is an advantage, for it permits you to try different
searches for better trees, simply by varying the input order of species. If
you do not use the multiple Jumble option, but do multiple individual runs
instead, you can easily decide which to pay most attention to -- the one or
ones that are best according to the criterion employed (for example, with
parsimony, the one out of the runs that results in the tree with the fewest
changes).
In practice, in a single run, it usually seems best to put species that
are likely to be sources of confusion in the topology last, as by the time they
are added the arrangement of the earlier species will have stabilized into a
good configuration, and then the last few species will by fitted into that
topology. There will be less chance this way of a poor initial topology that
would affect all subsequent parts of the search. However, a variety of
arrangements of the input order of species should be tried, as can be done if
the J option is used, and no species should be kept in a fixed place in the
order of input. Note that the results of the "...PENNY" programs and CLIQUE
are not sensitive to the input order of species, and NEIGHBOR is only slightly
sensistive to it, so that multiple Jumbling is not possible with those
programs. Note also that with global search, which is standard in many
programs and in others is an option, each group (including each individual
species) will be removed and re-added in all possible positions, so that a
species causing confusion will have more chance of moving to a new location
than it would without global rearrangement.
Probably the most important thing to keep in mind while running any of the
parsimony or compatibility programs is not to overinterpret the result. Many
users treat the set of most parsimonious trees as if it were a confidence
interval. If a group appears in all of the most parsimonious trees then they
treat it as well established. Unfortunately THE CONFIDENCE INTERVAL ON
PHYLOGENIES APPEARS TO BE MUCH LARGER THAN THE SET OF ALL MOST PARSIMONIOUS
TREES
(Felsenstein, 1985b). Likewise, variation of result among different
methods will not be a good indicator of the size of the confidence interval.
Consider a simple data set in which, out of 100 binary characters, 51 recommend
the rooted tree ((A,B),C) and 49 the tree (A,(B,C)). Many different methods
will all give the same result on such a data set: they will estimate the tree
as ((A,B),C). Nevertheless it is clear that the 51:49 margin by which this
tree is favored is not significantly different from 50:50. So CONSISTENCY
AMONG DIFFERENT METHODS IS A POOR GUIDE TO STATISTICAL SIGNIFICANCE.
Local Rearrangements
The process continues in this manner, with one important exception. After
each species is added, and before the next is added, a number of rearrangements
of the tree are tried, in an effort to improve it. The algorithms move through
the tree, making all possible local rearrangements of the tree. A local
rearrangement involves an internal segment of the tree in the following manner.
Each internal segment of the tree is of this form (where T1, T2, and T3 are
subtrees -- parts of the tree that can contain further forks and tips):
T1 T2 T3
\ / /
\ / /
\ / /
\/ /
* /
* /
* /
* /
*
!
!
the segment we are discussing being indicated by the asterisks. A local
rearrangement consists of switching the subtrees T1 and T3 or T2 and T3, so as
to obtain one of the following:
T3 T2 T1 T1 T3 T2
\ / / \ / /
\ / / \ / /
\ / / \ / /
\ / / \ / /
\ / \ /
\ / \ /
\ / \ /
\ / \ /
! !
! !
! !
Each time a local rearrangement is successful in finding a better tree, the new
arrangement is accepted. The phase of local rearrangements does not end until
the program can traverse the entire tree, attempting local rearrangements,
without finding any that improve the tree.
Global Rearrangements
A feature of most of the programs, such as PROTPARS, DNAPARS, DNACOMP,
DNAML, DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX, and DOLLOP, is "global"
optimization of the tree. In four of these (CONTML, FITCH, DNAML and DNAMLK)
this is an option, 'G'. In the others it automatically applies. When it is
present there is an additional stage to the search for the best tree. Each
possible subtree is removed from the tree from the tree and added back in all
possible places. This process continues until all subtrees can be removed and
added again without any improvement in the tree. The purpose of this extra
rearrangement is to make it less likely that one or more a species gets "stuck"
in a suboptimal region of the space of all possible trees. The use of global
optimization results in approximately a tripling (3x) of the run-time, which is
why I have left it as an option in some of the slower programs.
Multiple Jumbles
As just mentioned, for most of these programs the search depends on the
order in which the species are entered into the tree. Using the J (Jumble)
option you can supply a random number seed which will allow the program to put
the species in in a random order. A new feature (with version 3.5) is to allow
this to be done multiple times. If you tell the program to do it 10 times, it
will go through the tree-building process 10 times, each with a different
random order of adding species. It will keep a record of the trees tied for
best over the whole process. In other words, it does not just record the best
trees from each of the 10 runs, but records the best ones overall. Of course
this is slow, taking 10 times longer than a single run. But it does give us a
much greater chance of finding all of the most parsimonious trees. In the
terminology of
Maddison (1991) it can find different "islands" of trees. The
present algorithms do not guarantee us to find all trees in a given "island"
from a single run, so multiple runs also help explore those "islands" that are
found.
STRATEGY FOR FINDING THE BEST TREE
A WARNING ON INTERPRETING RESULTS
Back to the main PHYLIP page
Back to the SEQNET home page
Maintained 15 Jul 1996 -- by Martin Hilbers(e-mail:M.P.Hilbers@dl.ac.uk)