When you run most of these programs, a menu will appear offering you choices of the various options available for that program. The data that the program reads should be in an input file called (in most cases) "infile". If there is no such file the programs will ask you for the name of the input file. Below we describe the input file format, and then the menu.
I have tried to adhere to a rather stereotyped input and output format.
For the parsimony, compatibility and maximum likelihood programs, excluding the
distance matrix methods, the simplest version of the input file looks something
like this:
Input File Format
6 13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC
The first line of the input file contains the number of species and the
number of characters, in free format, separated by blanks (not by
commas). The information for each species follows, starting with a
ten-character species name (which can include punctuation marks and blanks),
and continuing with the characters for that species. In the
discrete-character, DNA and protein sequence programs the characters are each a
single letter or digit, sometimes separated by blanks. In
the continuous-characters programs they are real numbers with decimal points,
separated by blanks:
Latimeria 2.03 3.457 100.2 0.0 -3.7
The conventions about continuing the data beyond one line per species are
different between the molecular sequence programs and the others. The
molecular sequence programs can take the data in "aligned" or "interleaved"
format, with some lines giving the first part of each of the sequences, then
lines giving the next part of each, and so on. Thus the sequences might look
like this:
6 39 Archaeopt CGATGCTTAC CGCCGATGCT HesperorniCGTTACTCGT TGTCGTTACT BaluchitheTAATGTTAAT TGTTAATGTT B. virginiTAATGTTCGT TGTTAATGTT BrontosaurCAAAACCCAT CATCAAAACC B.subtilisGGCAGCCAAT CACGGCAGCC TACCGCCGAT GCTTACCGC CGTTGTCGTT ACTCGTTGT AATTGTTAAT GTTAATTGT CGTTGTTAAT GTTCGTTGT CATCATCAAA ACCCATCAT AATCACGGCA GCCAATCACNote that in these sequences we have a blank every ten sites to make them easier to read: any such blanks are allowed. The blank line which separates the two groups of lines (the ones containing sites 1-20 and ones containing sites 21-39) may or may not be present, but if it is, it should be a line of zero length and not contain any extra blank characters (this is because of a limitation of the current versions of the programs). It is important that the number of sites in each group be the same for all species (i.e., it will not be possible to run the programs successfully if the first species line contains 20 bases, but the first line for the second species contains 21 bases).
Alternatively, an option can be selected to take the data in "sequential" format, with all of the data for the first species, then all of the characters for the next species, and so on. This is also the way that the discrete characters programs and the gene frequencies and quantitative characters programs want to read the data. They do not allow the "interleaved" format.
In the sequential format, the character data can run on to a new line at any time (except in a species name or in the case of continuous character and distance matrix programs where you cannot go to a new line in the middle of a real number). Thus it is legal to have:
Archaeopt 001100 1101or even:
Archaeopt 0011001101though note that the FULL ten characters of the species name MUST then be present: in the above case there must be a blank after the "t". In all cases it is possible to put internal blanks between any of the character values, so that
Archaeopt 0011001101 0111011100is allowed.
If you make an error in the input file, the programs will often detect that they have been fed an illegal character or illegal numerical value and issue an error message such as "BAD CHARACTER STATE:", often printing out the bad value, and sometimes the number of the species and character in which it occurred. The program will then stop shortly after. One of the things which can lead to a bad value is the omission of something earlier in the file, or the insertion of something superfluous, which cause the reading of the file to get out of synchronization. The program then starts reading things it didn't expect, and concludes that they are in error. So if you see this error message, you may also want to look for the earlier problem that may have led to this.
The other major variation on the input data format is the options information. Many options are selected using the menu, but a few are selected by including extra information in the input file. Some options are described below.
for DNAPARS):
Note the "Terminal type" entry, which you will find on all menus. It
allows you to specify which type of terminal your screen is. The options are
an IBM PC screen, an ANSI standard terminal (such as a DEC VT100), a DEC VT52-
compatible terminal, such as a Zenith Z29, or no terminal type. Choosing "0"
toggles among these four options in cyclical order, changing each time the "0"
option is chosen. If one of them is right for your terminal the screen will be
cleared before the menu is displayed. If none works the "none" option should
probably be chosen. Keep in mind that VT-52 compatible terminals can freeze up
if they receive the screen-clearing commands for the ANSI standard terminal!
If this is a problem it may be helpful to recompile the program, setting the
constants near its beginning so that the program starts up with the VT52 option
set.
The other numbered options control which information the program will
display on your screen or on the output files. The option to "Print
indications of progress of run" will show information such as the names of the
species as they are successively added to the tree, and the progress of global
rearrangements. You will usually want to see these as reassurance that the
program is running and to help you estimate how long it will take. But if you
are running the program "in background" as can be done on multitasking and
multiuser systems such as Unix, and do not have the program running in its own
window, you may want to turn this option off so that it does not disturb your
use of the computer while the program is running.
The exact contents of the output file vary from program to program and
also depend on which menu options you have selected. For many programs, if you
select all possible output information, the output will consist of (1) the name
of the program and its version number, (2) the input information printed out,
(3) a series of phylogenies, some with associated information indicating how
much change there was in each character or on each part of the tree. A typical
rooted tree looks like this:
and specified by a bottommost fork with a three-way split, with three
"monophyletic" groups separated by two commas:
For programs estimating branch lengths, these are given in the trees in
the tree file as real numbers following a colon, and placed immediately after
the group descended from that branch. Here is a typical tree with branch
lengths:
These representations of trees are a subset of the standard adopted on
June 24, 1986 at the annual meetings of the Society for the Study of Evolution
at an meeting (the final session in a local lobster restaurant) of an informal
committee consisting of Wayne Maddison (MacClade), David Swofford (PAUP), F.
James Rohlf (NTSYS-PC), Chris Meacham (COMPROB and plotting programs), James
Archie (character coding program), William H.E. Day, and me. This standard is
a generalization of PHYLIP's format, itself based on a well-known
representation of trees in terms of parenthesis patterns which has been around
for almost a century. The standard is now employed by most phylogeny computer
programs but unfortunately has yet to be decribed in a formal published
description.
The Options Menu
The menu is straightforward. It typically looks like this (this one is
DNA parsimony algorithm, version 3.5c
Setting for this run:
U Search for best tree? Yes
J Randomize input order of sequences? No. Use input order
O Outgroup root? No, use as outgroup species 1
T Use Threshold parsimony? No, use ordinary parsimony
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, VT52, ANSI)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Print out steps in each site No
5 Print sequences at all nodes of tree No
6 Write out trees onto tree file? Yes
Are these settings correct? (type Y or the letter for one to change)
If you want to accept the default settings (they are shown in the above case)
you can simply type "Y" followed by a carriage-return (Enter) character. If
you want to change any of the options, you should type the letter shown to the
left of its entry in the menu. For example, to set a threshold type "T".
Lower-case letters will also work. For many of the options the program will
ask for supplementary information, such as the value of the threshold.
The Output File
Most of the programs write their output onto a file called (usually)
"outfile", and a representation of the trees found onto a file called
"treefile".
+-------------------Gibbon
+----------------------------2
! ! +------------------Orang
! +------4
! ! +---------Gorilla
+-----3 +--6
! ! ! +---------Chimp
! ! +----5
--1 ! +-----Human
! !
! +-----------------------------------------------Mouse
!
+------------------------------------------------Bovine
The interpretation of the tree is fairly straightforward: it "grows" from left
to right. The numbers at the forks are arbitrary and are used (if present)
merely to identify the forks. In some of the programs asterisks ("*") are used
instead of numbers. For many of the programs the tree produced is unrooted.
It is printed out in nearly the same form, but with a warning message:
remember: this is an unrooted tree!
The warning message ("remember: ...") indicates that this is an unrooted tree
(mathematicians still call this a tree, though some systematists unfortunately
use the term "network". This conflicts with standard mathematical usage, which
reserves the name "network" for a completely different kind of graph). The
root of this tree could be anywhere, say on the line leading immediately to
Mouse. As an exercise, see if you can tell whether the following tree is or is
not a different one from the above:
+-----------------------------------------------Mouse
!
+---------4 +------------------Orang
! ! +------3
! ! ! ! +---------Chimp
---6 +----------------------------1 ! +----2
! ! +--5 +-----Human
! ! !
! ! +---------Gorilla
! !
! +-------------------Gibbon
!
+-------------------------------------------Bovine
remember: this is an unrooted tree!
(it is NOT different). It is IMPORTANT also to realize that the lengths of the
segments of the printed tree may not be significant: some may actually
represent branches of zero length, in the sense that there is no evidence that
the branches are nonzero in length. Some of the diagrams of trees attempt to
print branches approximately proportional to estimated branch lengths, while in
others the lengths are purely conventional and are presented just to make the
topology visible. You will have to look closely at the documentation that
accompanies each program to see what it presents and what is known about the
lengths of the branches on the tree. The above tree attempts to represent
branch lengths approximately in the diagram. But even in those cases, some of
the smaller branches are likely to be artificially lengthened to make the tree
topology clearer. Here is what a tree from DNAPARS looks like, when no attempt
is made to make the lengths of branches in the diagram proportional to
estimated branch lengths:
+--Human
+--5
+--4 +--Chimp
! !
+--3 +-----Gorilla
! !
+--2 +--------Orang
! !
+--1 +-----------Gibbon
! !
--6 +--------------Mouse
!
+-----------------Bovine
remember: this is an unrooted tree!
Some of the parsimony programs in the package can print out a table of the
number of steps that different characters (or sites) require on the tree. This
table may not be obvious at first. A typical example looks like this:
steps in each site:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 2 2 2 2 1 1 2 2 1
10! 1 2 3 1 1 1 1 1 1 2
20! 1 2 2 1 2 2 1 1 1 2
30! 1 2 1 1 1 2 1 3 1 1
40! 1
The numbers across the top and down the side indicate which site is being
referred to. Thus site 23 is column "3" of row "20" and has 2 steps in this
case.
The Tree File
In output from most programs, a representation of the tree is also written
into the tree file (usually named "treefile"). The tree is specified by the
nested pairs of parentheses, enclosing names and separated by commas. If there
are any blanks in the names, these must be replaced by the underscore character
"_". Trailing blanks in the name may be omitted. The pattern of the
parentheses indicates the pattern of the tree by having each pair of
parentheses enclose all the members of a monophyletic group. The tree file for
the above tree would have its first line look like this:
((Mouse,Bovine),((Orang,(Gorilla,(Chimp,Human))),Gibbon));
In the above tree the first fork separates the lineage leading to Mouse and
Bovine from the lineage leading to the rest. Within the latter group there is
a fork separating Gibbon from the rest, and so on. The entire tree is enclosed
in an outermost pair of parentheses. The tree ends with a semicolon. In some
programs such as DNAML, FITCH, and CONTML, the tree will be completely unrooted
(A,(B,(C,D)),(E,F));
The three "monophyletic" groups here are A, (B,C,D), and (E,F). The single
three-way split corresponds to one of the interior nodes of the unrooted tree
(it can be any interior node). The remaining forks are encountered as you move
out from that first node, and each then appears as a two-way split. You should
check the documentation files for the particular programs you are using to see
in which of these forms you can expect the user tree to be in. Note that many
of the programs that estimate an unrooted tree produce trees in the treefile in
rooted form! This is done for reasons of arbitrary internal bookkeeping. The
placement of the root is arbitrary.
((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,
bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,
seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931);
Note that the tree may continue to a new line at any time except in the middle
of a name or the middle of a branch length, although in trees written to the
tree file this will only be done after a comma.
Back to the main PHYLIP page
Back to the SEQNET home page