Relative speed of the different programs
C compilers differ in efficiency of the code they generate, and some deal
with some features of the language better than with others. Thus a program
which is unusually fast on one computer may be unusually slow on another.
Nevertheless, as a rough guide to relative execution speeds, I have tested the
programs on three data sets, each of which has 10 species and 20 characters.
The first is an imaginary one in which all characters are compatible - ("The
Willi Hennig Memorial Data Set" as J. S. Farris once called it). The second is
the binary recoded form of the fossil horses data set of
Camin and Sokal
(1965). The third data set has data that is completely random: 10 species and
20 characters with a 50% chance that each character state is 0 or 1 (or A or
G). The data sets range from a completely compatible one in which there is no
homoplasy (paralellism or convergence), through the horses data set, which
requires 29 steps where the possible minimum number would be 20, to the random
data set, which requires 49 steps. We can thus see how this increasing
messiness of the data affects running times.
Here are the nucleotide sequence versions of the three data sets:
10 20 A CACACACAAAAAAAAAAACA B CACACAACAAAAAAAAAACA C CACAACAAAAAAAAAAAACA D CAACAAAACAAAAAAAAACA E CAACAAAAACAAAAAAAACA F ACAAAAAAAACACACAAAAC G ACAAAAAAAACACAACAAAC H ACAAAAAAAACAACAAAAAC I ACAAAAAAAAACAAAACAAC J ACAAAAAAAAACAAAAACAC 10 20 MesohippusAAAAAAAAAAAAAAAAAAAA HypohippusAAACCCCCCCAAAAAAAAAC ArchaeohipCAAAAAAAAAAAAAAAACAC ParahippusCAAACAACAACAAAAAAAAC MerychippuCCAACCACCACCCCACACCC M. secunduCCAACCACCACCCACACCCC Nannipus CCAACCACAACCCCACACCC NeohippariCCAACCCCCCCCCCACACCC Calippus CCAACCACAACCCACACCCC PliohippusCCCACCCCCCCCCACACCCC 10 20 A CACACAACCAAACAAACCAC B AAACCACACACACAAACCCA C ACAAAACCAAACCACCCACA D AAAAACACAACACACCAAAC E AAACAACCACACACAACCAA F CCCAAACACCCCCAAAAAAC G ACACCCCCACACCCACCAAC H AAAACAACAACCACCCCACC I ACACAACAACACAAACAACC J CCAAAAACACCCAACCCAACHere are the timings of many of the version 3.5 programs on these three data sets as run after being compiled by Microsoft Quick C on an 16 MHz 80386SX computer under PCDOS 5.0. An 80387 math co-processor was present and was used by the compiled code.
Hennigian Data Horses Data Random Data PROTPARS 82.83 86.23 148.03 DNAPARS 5.98 5.66 11.54 DNAPENNY 46.03 23.51 5305.97 DNACOMP 7.14 6.43 11.86 DNAINVAR 0.61 0.66 0.61 DNAML 1928.99 2069.32 2611.48 DNAMLK 2247.12 6094.81 4993.00 DNADIST 3.57 4.50 5.38 RESTML 6818.34 13422.15 28418.34 FITCH 35.92 48.61 38.17 KITSCH 12.42 12.36 13.18 NEIGHBOR 2.20 2.14 2.903 CONTML 56.85 57.56 59.15 GENDIST 1.00 1.00 1.00 MIX 13.62 14.60 25.92 PENNY 8.41 21.31 3851.1 DOLLOP 26.69 26.86 46.30 DOLPENNY 12.25 56.57 23934.22 CLIQUE 0.77 0.71 0.77 FACTOR 0.39 0.44 0.44In all cases the programs were run under the default options, except as specified here. The data sets used for the discrete characters programs have 0's and 1's instead of A's and C's. For CONTML the 0's and 1's were made into 0.0's and 1.0's and considered as 20 2-allele loci. For the distance programs 10 x 10 distance matrices were computed from the three data sets. Nor does it make much sense to benchmark MOVE, DOLMOVE, or DNAMOVE, although when there are many characters and many species the response time after each alteration of the tree should be proportional to the product of the number of species and the number of characters. For DNAML and DNAMLK the frequencies of the four bases were set to be equal rather than determined empirically as is the default. For RESTML the number of enzymes was set to 1.
Several patterns will be apparent from this. The algorithms (MIX, DOLLOP, CONTML, FITCH, KITSCH, PROTPARS, DNAPARS, DNACOMP, and DNAML, DNAMLK, RESTML) that use the above-described addition strategy have run times that do not depend strongly on the messiness of the data. The only exception to this is that if a data set such as the Random data requires one extra round of global rearrangements it takes longer. The programs differ greatly in run time: the likelihood programs RESTML, DNAML and CONTML are quite a bit slower than the others. The protein sequence parsimony program, which has to do a considerable amount of bookkeeping to keep track of which amino acids can mutate to each other, is also relatively slow.
Another class of algorithms includes PENNY, DOLPENNY, DNAPENNY and CLIQUE. These are branch-and-bound methods: in principle they should have execution times that rise exponentially with the number of species and/or characters, and they might be much more sensitive to messy data. This is apparent with PENNY, DOLPENNY, and DNAPENNY, which go from being reasonably fast with clean data to very slow with messy data. DOLPENNY is paritcularly slow on messy data -- this is because this algorithm cannot make use of some of the lower-bound calculations that are possible with DNAPENNY and PENNY. CLIQUE is very fast on all data sets. Although in theory it should bog down if the number of cliques in the data is very large, that does not happen with random data, which in fact has few cliques and those small ones. Apparently the "worst-case" data sets are much rarer for CLIQUE than for the other branch-and-bound methods.
NEIGHBOR is quite fast compared to FITCH and KITSCH, and should make it possible to run much larger cases, although the results are expected to be a bit rougher than with those programs.
An exception to these rules will be some of the DNA programs that use an
aliasing device to save execution time. In these programs execution time will
not necessarily increase proportional to the number of sites, as sites that
show the same pattern of nucleotides will be detected as identical and the
calculations for them will be done only once, which does not lead to more
execution time. This is particularly likely to happen with few species and
many sites, or with data sets that have small amounts of evolutionary
divergence.
For programs FITCH and KITSCH, the distance matrix is square, so that when
we double the number of species we also double the number of "characters", so
that running times will go up as the fourth power of the number of species
rather than the third power. Thus a 20-species case with FITCH is expected to
run sixteen times more slowly than a 10-species case.
For programs like PENNY and CLIQUE the run times will rise faster than the
cube of the number of species (in fact, they can rise faster than any power
since these algorithms are not guaranteed to work in polynomial time). In
practice, PENNY will frequently bog down above 11 species, while CLIQUE easily
deals with larger numbers.
For NEIGHBOR the speed should vary only as the square of the number of
species, so a case twice as large will take only four times as long. This will
make it an attractive alternative to FITCH and KITSCH for large data sets.
If you are unsure of how long a program will take, try it first on a few
species, then work your way up until you get a feel for the speed and for what
size programs you can afford to run.
Execution time is not the most important criterion for a program,
particularly as computer time gets much cheaper than your time or a
programmer's time. With workstations on which background jobs can be run all
night, execution speed is not overwhelmingly relevant. Some of us have been
conditioned by an earlier era of computing to consider execution speed
paramount. But ease of use, ease of adaptation to your computer system, and
ease of modification are much more important in practice, and in these respects
I think these programs are adequate. Only if you are engaged in 1960's style
mainframe computing is minimization of execution time paramount.
Nevertheless it would have been nice to have made the programs faster.
The present speeds are a compromise between speed and effectiveness: by making
them slower and trying more rearrangements in the trees, or by enumerating all
possible trees, I could have made the programs more likely to find the best
tree. By trying fewer rearrangements I could have speeded them up, but at the
cost of finding worse trees. I could also have speeded them up by writing
critical sections in assembly language, but this would have sacrificed ease of
distribution to new computer systems. There are also some options included in
these programs that make it harder to adopt some of the economies of
bookkeeping that make other programs faster. However to some extent I have
simply made the decision not to spend time trying to speed up program
bookkeeping when there were new likelihood and statistical methods to be
developed.
This use of two separate standards is necessary not because of different
languages but because different versions of the package are being compared.
Thus, the "Time" is the ratio of the Total to that for the 386SX, for the
appropriate standard, so that the Time for the Macintosh Classic for DNAPARS
3.4 on Think Pascal 3 is compared to the Time for the 386/SX running DNAPARS
3.4 on Turbo Pascal 6.0, but the Time for the Macintosh Classic running version
3.5 on Think C is compared to the Time for the 386SX running version 3.5 on
Quick C. The Speed is the reciprocal of the Time.
Note that parallel machines like the Sequent are not really as slow as
indicated by the data here, as these runs did nothing to take advantage of
their parallelism.
For a picture of speeds for a more numerically intensive program, here are
benchmarks using DNAML, with the 16 MHz 386SX with math co-processor active as
the standard. Numbers are total run times (total user time in the case of
Unix) over all three data sets.
Speed with different numbers of species
How will the speed depend on the number of species and the number of
characters? For the sequential-addition algorithms, the speed should be
proportional to the cube of the number of species, and to the number of
characters. Thus a case that has, instead of 10 species and 20 characters, 20
species and 50 characters would take 2 x 2 x 2 x 2.5 = 20 times as long. This
implies that cases with more than 20 species will be slow, and cases with more
than 40 species VERY slow. This places a premium on working on small
subproblems rather than just dumping a whole large data set into the programs.
Relative speed of different machines
It is interesting to compare different machines using DNAPARS as the
standard task. One can rate a machine on the DNAPARS benchmark by summing the
times for all three of the data sets. Here are relative total timings over all
three data sets (done with various versions of DNAPARS) for some machines,
taking Microsoft Quick C running under PCDOS on a 16 MHz 80386 clone as the
standard. Pascal benchmarks from version 3.4 of the program are also included
-- they are compared only with each other and their times are in parentheses.
Machine DOS Compiler Total Time Speed
------- --- -------- ----- ---- -----
Toshiba T1100+ PCDOS Turbo Pascal 3.01A (269) 7.912 0.126
Apple Mac Plus MacOS Lightspeed Pascal 2 (175.84) 5.172 0.193
Toshiba T1100+ PCDOS Turbo Pascal 5.0 (162) 4.765 0.210
Macintosh Classic MacOS Think Pascal 3 (160) 4.706 0.212
Macintosh Classic MacOS Think C 43.0 3.58 0.279
IBM PS2/60 PCDOS Turbo Pascal 5.0 (58.76) 1.728 0.579
80286 (12 Mhz) PCDOS Turbo Pascal 5.0 (47.09) 1.385 0.722
Apple Mac IIcx MacOS Think Pascal 3 (42) 1.235 0.810
Apple Mac SE/30 MacOS Think Pascal 3 (42) 1.235 0.810
Apple Mac IIcx MacOS Lightspeed Pascal 2 (39.84) 1.172 0.853
Apple Mac IIcx MacOS Lightspeed Pascal 2# (39.69) 1.167 0.857
Zenith Z386 (16MHz) PCDOS Turbo Pascal 5.0 (38.27) 1.155 0.866
Macintosh SE/30 MacOS Think C 13.6 1.132 0.883
80386SX (16 MHz) PCDOS Turbo Pascal 6.0 (34) 1.0 1.0
80386SX (16 MHz) PCDOS Microsoft Quick C 12.01 1.0 1.0
Sequent-S81 DYNIX Silicon Valley Pascal (13.0) 0.382 2.615
VAX 11/785 Unix Berkeley Pascal (11.9) 0.35 2.857
80486-33 PCDOS Turbo Pascal 6.0 (11.46) 0.337 2.967
Sun 3/60 SunOS Sun C 3.93 0.327 3.056
NeXT Cube (68030) Mach Gnu C 2.608 0.217 4.605
Sequent S-81 DYNIX Sequent Symmetry C 2.604 0.217 4.612
VAXstation 3500 Unix Berkeley Pascal (7.3) 0.215 4.658
Sequent S-81 DYNIX Berkeley Pascal (5.6) 0.1647 6.07
Unisys 7000/40 Unix Berkeley Pascal (5.24) 0.1541 6.49
VAX 8600 VMS DEC VAX Pascal (3.96) 0.1165 8.59
Sun SPARC IPX SunOS Gnu C version 2.1 1.28 0.1066 9.383
VAX 6000-530 VMS DEC C 0.858 0.0714 13.998
VAXstation 4000 VMS DEC C 0.809 0.0674 14.845
IBM RS/6000 540 AIX XLP Pascal (2.276) 0.0669 14.94
NeXTstation(040/25) Mach Gnu C 0.75 0.0624 16.013
Sun SPARC IPX SunOS Sun C 0.68 0.0566 17.662
486DX (33 MHz) Linux Gnu C # 0.63 0.0525 19.063
Sun SPARCstation-1+ Unix Sun Pascal (1.7) 0.05 20.00
DECstation 5000/200 Unix DEC Ultrix C 0.45 0.0375 26.69
Sun SPARC 1+ SunOS Sun C 0.40 0.0333 30.025
DECstation 3100 Unix DEC Ultrix RISC Pascal (0.77) 0.0226 44.16
IBM 3090-300E AIX Metaware High C 0.27 0.0225 44.48
DECstation 5000/125 Unix DEC Ultrix RISC C 0.267 0.0222 44.98
DECstation 5000/200 Unix DEC Ultrix RISC C 0.256 0.0222 44.98
Sun SPARC 4/50 SunOS Sun C 0.249 0.02073 48.23
DEC 3000/400 AXP Unix DEC C 0.224 0.01865 53.62
DECstation 5000/240 Unix DEC Ultrix RISC C 0.1889 0.01573 63.58
SGI Iris R4000 Unix SGI C 0.184 0.1532 65.27
IBM 3090-300E VM Pascal VS (0.464) 0.0136 73.28
DECstation 5000/200 Unix DEC Ultrix RISC Pascal (0.39) 0.0114 87.18
The Toshiba T1100+ should be exactly as fast as an 8 MHz PC clone. For a
couple of the machines I am not sure that this benchmark is representative of
timings on non-numerical programs in PHYLIP. This is particularly the case for
the DEC 3000/400 AXP (the DEC "Alpha") which is probably quite a bit faster
than indicated here. The numerical programs benchmark below gives it a fairer
test. The IBM RS/6000 is probably up to ten times faster than shown here: it
may have been ill-served by its Pascal compiler.
Operating
Machine System Compiler Seconds Time Speed
------- --------- -------- ------- ---- -----
386SX 16 Mhz PCDOS Turbo Pascal 6 (7826) 1.0 1.0
386SX 16 Mhz PCDOS Quick C 6549.79 1.0 1.0
Compudyne 486DX/33 Linux Gnu C 1599.9 0.2441 4.096
SUN Sparcstation 1+ SunOS Sun C 1402.8 0.2142 4.669
Everex STEP 386/20 PCDOS Turbo Pascal 5.5 (1440.8) 0.1841 5.432
486DX/33 PCDOS Turbo C++ 1107.2 0.1690 5.916
Compudyne 486DX/33 PCDOS Waterloo C/386 1045.78 0.1597 6.263
Sun SPARCstation IPX SunOS Gnu C 960.2 0.1466 6.821
NeXTstation(68040/25) Mach Gnu C 916.6 0.1399 7.146
486DX/33 PCDOS Waterloo C/386 861.0 0.1314 7.607
Sun SPARCstation IPX SunOS Sun C 787.7 0.1203 8.315
486DX/33 PCDOS Gnu C 650.9 0.0994 10.063
VAX 6000-530 VMS DEC C 637.0 0.0973 10.282
DECstation 5000/200 Unix DEC Ultrix RISC C 423.3 0.0646 15.473
IBM 3090-300E AIX Metaware High C 201.8 0.0308 32.46
Convex C240/1024 Unix C 101.6 0.01551 64.47
DEC 3000/400 AXP Unix DEC C 98.29 0.01501 66.64
You are invited to send me figures for your machine for inclusion in
future tables. Use the data sets above and compute the total times for DNAPARS
and for DNAML for the three data sets (setting the frequencies of the four
bases to 0.25 each for the DNAML runs). Be sure to tell me the name and
version of your compiler, and the version of PHYLIP you tested.
Published Benchmarks
Some of you may have seen the "benchmark" published by
Luckow and Pimentel
(1985). PHYLIP's WAGNER (an immediate ancestor of MIX) did not do well in it,
either in terms of the quality of result or execution speed. I do not believe
that this was a fair benchmark. WAGNER was run only with one order of input
species, not ten as recommended here. Had it been, perhaps the shortest tree
would have been found more often. No credit was given to PHYLIP in that
article for its free distribution, availability on microcomputers, availability
in source code form, or portability to new computers. Pimentel's laboratory
commissioned the development of a competing package, PHYSYS, which is a
commercial product, and that involvement was not stated in the article.
The benchmarks by Fink (1986) are fairer, although there are some impressions given by that article which do not apply to the present version. In particular, I have since added to many of the programs the ability to save multiple equally-parsimonious trees, and have changed the outputs so that reconstruction of states in the hypothetical ancestral nodes is much easier, thus answering Fink's major criticisms. I have since eliminated the Metropolis annealing method algorithms which he criticized. I disagree with Fink's view OF PHYLIP that one should "be wary of published results from an analysis using it", as I do not think that a tree slightly longer than the most parsimonious one should be rejected out of hand. Nor do I agree that "it is really too slow to use as a teaching tool", as in teaching one uses small data sets and speed is not of the essence. Rather, simplicity of user interface is paramount, and there PHYLIP does very well (so is ability to run on a variety of computers, in which respect PHYLIP is also superior). In fact, it is widely used as a teaching tool.
Nevertheless MIX is undoubtably not as fast or as sophisticated as PAUP or Hennig86. The present version of PHYLIP is closer to its competitors in quality of result than was the version Fink reviewed.
Platnick's (1987) benchmarks concentrated, as did the other benchmarkers (all of them members of the same school of systematists) on parsimony as the only phylogeny criterion worthy of attention. He concluded that PHYLIP could be used effectively, especially if up to ten different input orders of species were used. Again, as with the other benchmarks, no credit was given for diversity of methods, portability, price, or availability of source code.
Platnick's second benchmark paper (1989) concentrates on Hennig86 and Paup, and concludes that PHYLIP has not kept up with those programs in its features. Again, the review is entirely concerned with parsimony, and only the barest mention is made of ... (you can complete this sentence).
Sanderson's (1990) benchmark paper breaks with the method of the others by specifying 36 features of the packages rated and giving separate ratings in each. Like the other benchmark papers it concentrates almost exclusively on parsimony as applied to morphological characters, but does at least give some credit where credit is due.
My own, obviously biased, feeling is that there is a discrepancy between the benchmarkers' projections of how satisfied users of PHYLIP will be, and how satisfied they actually are. And that this discrepancy is in PHYLIP's favor.
Here are some comments about PHYLIP. Explanatory material in square
brackets is my own:
From the pages of Cladistics:
ENDORSEMENTS
"Under no circumstances can we recommend PHYLIP/WAG [their name for the
Wagner parsimony option of MIX]."
Luckow, M. and R. A. Pimentel (1985)
"PHYLIP has not proven very effective in implementing parsimony (Luckow and Pimentel, 1985)."J. Carpenter (1987a)
"... PHYLIP. This is the computer program where every newsletter concerning it is mostly bug-catching, some of which have been put there by previous corrections. As Platnick (1987) documents, through dint of much labor useful results may be attained with this program, but I would suggest an easier way: FORMAT b:"J. Carpenter (1987b)
"PHYLIP is bug-infested and both less effective and orders of magnitude slower than other programs ....""T. N. Nayenizgani" [J. S. Farris] (1990)
"Hennig86 [by J. S. Farris] provides such substantial improvements over previously available programs (for both mainframes and microcomputers) that it should now become the tool of choice for practising systematists."N. Platnick (1989)
"The availability, within PHYLIP of distance, compatibility, maximum likelihood, and generalized 'invariants' algorithms (Cavender and Felsenstein, 1987) sets it apart from other packages .... One of the strengths of PHYLIP is its documentation ...."Michael J. Sanderson (1990)
(Sanderson also criticizes PHYLIP for slowness and inflexibility of its parsimony algorithms, and compliments other packages on their strengths).
"This package of programs has gradually become a basic necessity to anyone working seriously on various aspects of phylogenetic inference .... The package includes more programs than any other known phylogeny package. But it is not just a collection of cladistic and related programs. The package has great value added to the whole, and for this it is unique and of extreme importance .... its various strengths are in the great array of methods provided ...."Bernard R. Baum (1989)