NEW FEATURES IN RECENT VERSIONS
Version 3.5 has many new features. They include:
- The programs now exist in C as well as in Pascal. In the future we will
support only the C versions, and as of now will not make any more improvements
in the Pascal version. It will cease to be distributed with the next release
of PHYLIP. A Makefile has been included in the distribution to simplify the
problems of compiling the package. The existence of a C compiler on most
workstations means that we have ceased to directly distribute executables for
workstations, as people can easily create them themselves by following our
instructions.
- All programs now have had the upper limits on the numbers of species and
numbers of sites (or characters) removed. They instead use the "malloc" and
"free" functions of C to try to allocate as much memory as they need. If they
fail to find it they will complain, and you will have to look for a bigger
machine, or install more memory, or remove other jobs that are competing for
the memory. We no longer have to guess how large a computer you have and where
you want to put the tradeoff between species and sites.
- The program SEQBOOT has now fully superseded the former programs DNABOOT,
BOOT, and DOLBOOT, which have been withdrawn. SEQBOOT also now can carry out
block-bootstrapping (Kuensch, 1989), which attempts to correct for correlation
of nearby sites.
- The DNA likelihood programs DNAML and DNAMLK now have a revised
Categories option that allows them to cope with rate variation from site to
site. Instead of the user specifying in advance the rate category of each
site, they need only specify how many categories there are, what their rates
are, what their relative probabilities are, and how long are the patches of
spread of a single rate along the molecule, on average. The program then
computes the likelihood allowing for all of these, and adding up over all
possibilities of rate patterns, without being dependent on assuming that it has
inferred rates at individual sites correctly. This should go far to address
the criticism that maximum likelihood assumes constancy of rate at all sites.
- A new program PROTDIST has been added to compute distance matrices from
protein sequences, using several different methods. This will allow protein
sequence data to be analyzed by distance matrix methods as well as parsimony
methods.
- A new program, RETREE, has been added to allow users easily and
interactively to reroot trees, flip branches around, change or remove branch
lengths, change species names, and so on.
- A new program, COALLIKE, has been added to compute likelihoods for the
parameter combination 4Nu, the product of neutral mutation rate and effective
population size, when we use bootstrapping and DNAMLK to implement the
Bootstrap Monte Carlo method of inferring likelihoods for population parameters
from population samples of molecular sequences. This is described in a
forthcoming paper (Felsenstein, 1992).
- Programs that estimate a tree with branch lengths now all not only can read
in a user tree that has branch lengths and the program can be told to use these
rather than re-estimating the branch lengths (this was already possible for
DNAML and DNAMLK) but the ones that are estimating an unrooted tree (DNAML,
FITCH, RESTML and CONTML) can also read in a tree with branch lengths on some
branches and not on others, and be told to hold the ones it read in constant
while iterating the rest. Thus you can, for example, specify that a certain
branch must have length zero.
- DRAWTREE and DRAWGRAM can now write out a PICT file that can be read by the
MacDraw drawing program. They can also write out the file format for the X-
windows drawing program XFIG, and the input format for the freely-distributed
ray tracing program RAYSHADE (for trees seen in 3 dimensions floating above a
landscape). In addition they allow fonts to be specified for species names
when a Postscript printer is being used, and they can also make an X-windows
X-bitmap file. DRAWTREE has a new option that allows the program to (slowly)
calculate node positions so as to make them avoid each other better. Both
programs now, when plotting on raster devices such as dot-matrix printers, use
round pens to make the lines smoother, and are faster at drawing the lines.
- DNADIST now computes its distances much more quickly. It also can compute
the Nei and Jin (1991) distance that allows for rate variation among sites.
- The programs that estimate trees by adding species sequentially to a tree
(PROTPARS, DNAPARS, DNACOMP, DNAML, DNAMLK, RESTML, FITCH, KITSCH, MIX, and
DOLLOP) now allow the user the specify that multiple tries will be made with
different input orders of species (using the Jumble option) with only the trees
tied for best overall being reported. The trees found will be those that are
tied for best among all of those found by all these runs, not the trees found
as best by each run. This improves the chances of finding the best tree.
Version 3.4 also had many new features. They included:
- All programs were given interactive menus which allow the user to see and
alter option settings. The programs read from a file INFILE and write to a
file OUTFILE, as well as to a treefile TREEFILE. The result should be much
easier for novice users to deal with. Most of the options which once were set
by altering the input file can now be selected using the menu. Only options
that require separate information for each character or site, such as Weights,
Ancestors, Factors, and the Categories option continued to require that
information be entered into the input file (although user-defined trees are put
there also).
- The molecular sequence programs now allowed either interleaved or sequential
sequence input (i.e. sequences put in in "aligned" form or by having all of one
sequence followed by all of another). The choice is made using the interactive
menu.
- Three new programs were added: NEIGHBOR carried out Saitou and Nei's
neighbor-joining method for distance matrix data which is much faster than
FITCH and KITSCH and should be able to handle much larger data sets. It also
carried out the UPGMA clustering method. SEQBOOT allowed the user to bootstrap
nucleotide sequence data sets, protein sequence data sets, or discrete-
characters data sets and write out to a file the multiple data sets that
result. CONTRAST accepted a continuous-characters data set and a series of
user trees, and wrote out the series of contrasts for each character that are
independent under a Brownian motion model of character evolution, as well as
regressions, correlations, and covariances between them.
- All of the programs that inferred trees now accepted multiple data sets.
This allowed us to use SEQBOOT together with this feature to analyze
bootstrapped data sets and find different trees for the different bootstrap
replicates. Their variation could be summarized by the consensus tree program
CONSENSE. Thus almost everything in this package could now be bootstrapped.
- A serious error that made the DNA likelihood programs and DNADIST give
incorrect results when the Categories option was used and there was more than
one category of rates was fixed, in version 3.31. Categories run with these
programs before that should be rerun.
- Almost all programs now printed out trees in the "phenogram" form so that
they grew left-to-right, rather that in the triangular diagram used before.
- The tree-plotting programs DRAWGRAM and DRAWTREE now supported the Hewlett-
Packard Laserjet printers and also could produce output files compatible with
the PC-Paint drawing program. The code for placement of interior nodes in
DRAWGRAM was corrected, and preview of trees using Tektronix graphics was made
easier by having it clear the screen more often.
- The DNA likelihood program DNAML now ran about 60% faster.
- The restriction sites likelihood program RESTML now allowed for the data
arising from digests with multiple enzymes.
COMING ATTRACTIONS, FUTURE PLANS
There are some obvious deficiencies in this version. Some of these holes
will be filled in the next few releases (3.6, 3.7, etc.). They include:
- A program to align molecular sequences on a predefined User Tree may
ultimately be included. This will allow alignment and phylogeny reconstruction
to procede iteratively by successive runs of two programs, one aligning on a
tree and the other finding a better tree based on that alignment. In the
shorter run a simple two-sequence alignment program may be included.
- An interactive "likelihood explorer" for DNA sequences will be written.
This will allow, either with or without the assumption of a molecular clock,
trees to be varied interactively so that the user can get a much better feel
for the shape of the likelihood surface. Likelihood will be able to be plotted
against branch lengths for any branch.
- The DNAML and DNAMLK programs will reinstate the previous Categories
option, where the user specified categories of rates of evolution for each
site, but also retaining the present one, that infers them. The hope is to
allow for variation in rate in 1st, 2nd and 3rd positions in a coding sequence
(these being identified by the user) while also allowing for autocorrelated
rates of evolution in adjacent codons.
- If possible we will find some way of correcting for purine/pyrimidine
richness variations among species, within the framework of the maximum
likelihood programs. That they maximum likelihood programs do not allow for
base composition variation is their major limitation at the moment.
- Inclusion of some kind of protein sequence maximum likelihood program is an
obvious need (right now we have Adachi and Hasegawa's program in the
Unsupported Division).
- The Categories option of DNAML and DNAMLK will be generalized to allow for
rates at sites to gradually change as one moves along the tree, in an attempt
to implement Fitch and Markowitz's (1970) notion of "covarions".
- Obviously we need to start thinking about a more visual X windows interface,
but only if that can be used on most systems.
- Program PENNY and its relatives will improved so as to run faster and find
all most parsimonious trees more quickly.
- A more sophisticated compatibility program should be included, if I can
find one.
- An "evolutionary clock" version of CONTML will be done, and the same may
also be done for RESTML.
- We hope gradually to generalize the tree structures in the programs to
infer multifurcating trees as well as bifurcating ones.
- We hope to economize on the size of the source code, and enforce some
standardization of it, by putting frequently used routines in a library from
which they can be linked into various programs. This will enforce a rather
complete standardization of our code.
- We may decide to gradually move our code to an object-oriented language,
most lkely C++. One could describe the language that version 3.4 was written
in as "Pascal", version 3.5 as "Pascal written in C", version 3.6 as "C written
in C", and maybe version 3.7 as "C++ written in C" and then 3.8 as "C++ written
in C++". At least that scenario is one possibility.
Much of the future development of the package will be in the DNA
likelihood programs and the distance matrix programs. This is for several
reasons. First, I am more interested in those problems. Second, collection of
molecular data is increasing rapidly, and those programs have the most promise
for future development for those data.
Back to the main PHYLIP page
Back to the SEQNET home page
Maintained 15 Jul 1996 -- by Martin Hilbers(e-mail:M.P.Hilbers@dl.ac.uk)