NEW FEATURES IN RECENT VERSIONS

Version 3.5 has many new features. They include:

  1. The programs now exist in C as well as in Pascal. In the future we will support only the C versions, and as of now will not make any more improvements in the Pascal version. It will cease to be distributed with the next release of PHYLIP. A Makefile has been included in the distribution to simplify the problems of compiling the package. The existence of a C compiler on most workstations means that we have ceased to directly distribute executables for workstations, as people can easily create them themselves by following our instructions.
  2. All programs now have had the upper limits on the numbers of species and numbers of sites (or characters) removed. They instead use the "malloc" and "free" functions of C to try to allocate as much memory as they need. If they fail to find it they will complain, and you will have to look for a bigger machine, or install more memory, or remove other jobs that are competing for the memory. We no longer have to guess how large a computer you have and where you want to put the tradeoff between species and sites.
  3. The program SEQBOOT has now fully superseded the former programs DNABOOT, BOOT, and DOLBOOT, which have been withdrawn. SEQBOOT also now can carry out block-bootstrapping (Kuensch, 1989), which attempts to correct for correlation of nearby sites.
  4. The DNA likelihood programs DNAML and DNAMLK now have a revised Categories option that allows them to cope with rate variation from site to site. Instead of the user specifying in advance the rate category of each site, they need only specify how many categories there are, what their rates are, what their relative probabilities are, and how long are the patches of spread of a single rate along the molecule, on average. The program then computes the likelihood allowing for all of these, and adding up over all possibilities of rate patterns, without being dependent on assuming that it has inferred rates at individual sites correctly. This should go far to address the criticism that maximum likelihood assumes constancy of rate at all sites.
  5. A new program PROTDIST has been added to compute distance matrices from protein sequences, using several different methods. This will allow protein sequence data to be analyzed by distance matrix methods as well as parsimony methods.
  6. A new program, RETREE, has been added to allow users easily and interactively to reroot trees, flip branches around, change or remove branch lengths, change species names, and so on.
  7. A new program, COALLIKE, has been added to compute likelihoods for the parameter combination 4Nu, the product of neutral mutation rate and effective population size, when we use bootstrapping and DNAMLK to implement the Bootstrap Monte Carlo method of inferring likelihoods for population parameters from population samples of molecular sequences. This is described in a forthcoming paper (Felsenstein, 1992).
  8. Programs that estimate a tree with branch lengths now all not only can read in a user tree that has branch lengths and the program can be told to use these rather than re-estimating the branch lengths (this was already possible for DNAML and DNAMLK) but the ones that are estimating an unrooted tree (DNAML, FITCH, RESTML and CONTML) can also read in a tree with branch lengths on some branches and not on others, and be told to hold the ones it read in constant while iterating the rest. Thus you can, for example, specify that a certain branch must have length zero.
  9. DRAWTREE and DRAWGRAM can now write out a PICT file that can be read by the MacDraw drawing program. They can also write out the file format for the X- windows drawing program XFIG, and the input format for the freely-distributed ray tracing program RAYSHADE (for trees seen in 3 dimensions floating above a landscape). In addition they allow fonts to be specified for species names when a Postscript printer is being used, and they can also make an X-windows X-bitmap file. DRAWTREE has a new option that allows the program to (slowly) calculate node positions so as to make them avoid each other better. Both programs now, when plotting on raster devices such as dot-matrix printers, use round pens to make the lines smoother, and are faster at drawing the lines.
  10. DNADIST now computes its distances much more quickly. It also can compute the Nei and Jin (1991) distance that allows for rate variation among sites.
  11. The programs that estimate trees by adding species sequentially to a tree (PROTPARS, DNAPARS, DNACOMP, DNAML, DNAMLK, RESTML, FITCH, KITSCH, MIX, and DOLLOP) now allow the user the specify that multiple tries will be made with different input orders of species (using the Jumble option) with only the trees tied for best overall being reported. The trees found will be those that are tied for best among all of those found by all these runs, not the trees found as best by each run. This improves the chances of finding the best tree.
Version 3.4 also had many new features. They included:
  1. All programs were given interactive menus which allow the user to see and alter option settings. The programs read from a file INFILE and write to a file OUTFILE, as well as to a treefile TREEFILE. The result should be much easier for novice users to deal with. Most of the options which once were set by altering the input file can now be selected using the menu. Only options that require separate information for each character or site, such as Weights, Ancestors, Factors, and the Categories option continued to require that information be entered into the input file (although user-defined trees are put there also).
  2. The molecular sequence programs now allowed either interleaved or sequential sequence input (i.e. sequences put in in "aligned" form or by having all of one sequence followed by all of another). The choice is made using the interactive menu.
  3. Three new programs were added: NEIGHBOR carried out Saitou and Nei's neighbor-joining method for distance matrix data which is much faster than FITCH and KITSCH and should be able to handle much larger data sets. It also carried out the UPGMA clustering method. SEQBOOT allowed the user to bootstrap nucleotide sequence data sets, protein sequence data sets, or discrete- characters data sets and write out to a file the multiple data sets that result. CONTRAST accepted a continuous-characters data set and a series of user trees, and wrote out the series of contrasts for each character that are independent under a Brownian motion model of character evolution, as well as regressions, correlations, and covariances between them.
  4. All of the programs that inferred trees now accepted multiple data sets. This allowed us to use SEQBOOT together with this feature to analyze bootstrapped data sets and find different trees for the different bootstrap replicates. Their variation could be summarized by the consensus tree program CONSENSE. Thus almost everything in this package could now be bootstrapped.
  5. A serious error that made the DNA likelihood programs and DNADIST give incorrect results when the Categories option was used and there was more than one category of rates was fixed, in version 3.31. Categories run with these programs before that should be rerun.
  6. Almost all programs now printed out trees in the "phenogram" form so that they grew left-to-right, rather that in the triangular diagram used before.
  7. The tree-plotting programs DRAWGRAM and DRAWTREE now supported the Hewlett- Packard Laserjet printers and also could produce output files compatible with the PC-Paint drawing program. The code for placement of interior nodes in DRAWGRAM was corrected, and preview of trees using Tektronix graphics was made easier by having it clear the screen more often.
  8. The DNA likelihood program DNAML now ran about 60% faster.
  9. The restriction sites likelihood program RESTML now allowed for the data arising from digests with multiple enzymes.

COMING ATTRACTIONS, FUTURE PLANS

There are some obvious deficiencies in this version. Some of these holes will be filled in the next few releases (3.6, 3.7, etc.). They include:

  1. A program to align molecular sequences on a predefined User Tree may ultimately be included. This will allow alignment and phylogeny reconstruction to procede iteratively by successive runs of two programs, one aligning on a tree and the other finding a better tree based on that alignment. In the shorter run a simple two-sequence alignment program may be included.
  2. An interactive "likelihood explorer" for DNA sequences will be written. This will allow, either with or without the assumption of a molecular clock, trees to be varied interactively so that the user can get a much better feel for the shape of the likelihood surface. Likelihood will be able to be plotted against branch lengths for any branch.
  3. The DNAML and DNAMLK programs will reinstate the previous Categories option, where the user specified categories of rates of evolution for each site, but also retaining the present one, that infers them. The hope is to allow for variation in rate in 1st, 2nd and 3rd positions in a coding sequence (these being identified by the user) while also allowing for autocorrelated rates of evolution in adjacent codons.
  4. If possible we will find some way of correcting for purine/pyrimidine richness variations among species, within the framework of the maximum likelihood programs. That they maximum likelihood programs do not allow for base composition variation is their major limitation at the moment.
  5. Inclusion of some kind of protein sequence maximum likelihood program is an obvious need (right now we have Adachi and Hasegawa's program in the Unsupported Division).
  6. The Categories option of DNAML and DNAMLK will be generalized to allow for rates at sites to gradually change as one moves along the tree, in an attempt to implement Fitch and Markowitz's (1970) notion of "covarions".
  7. Obviously we need to start thinking about a more visual X windows interface, but only if that can be used on most systems.
  8. Program PENNY and its relatives will improved so as to run faster and find all most parsimonious trees more quickly.
  9. A more sophisticated compatibility program should be included, if I can find one.
  10. An "evolutionary clock" version of CONTML will be done, and the same may also be done for RESTML.
  11. We hope gradually to generalize the tree structures in the programs to infer multifurcating trees as well as bifurcating ones.
  12. We hope to economize on the size of the source code, and enforce some standardization of it, by putting frequently used routines in a library from which they can be linked into various programs. This will enforce a rather complete standardization of our code.
  13. We may decide to gradually move our code to an object-oriented language, most lkely C++. One could describe the language that version 3.4 was written in as "Pascal", version 3.5 as "Pascal written in C", version 3.6 as "C written in C", and maybe version 3.7 as "C++ written in C" and then 3.8 as "C++ written in C++". At least that scenario is one possibility.
Much of the future development of the package will be in the DNA likelihood programs and the distance matrix programs. This is for several reasons. First, I am more interested in those problems. Second, collection of molecular data is increasing rapidly, and those programs have the most promise for future development for those data.


Back to the main PHYLIP page
Back to the SEQNET home page
Maintained 15 Jul 1996 -- by Martin Hilbers(e-mail:M.P.Hilbers@dl.ac.uk)