THE OPTIONS AND HOW TO INVOKE THEM

Most of the programs allow various options that alter the amount of information the program is provided or what it is to do with the information. Most options are selected in the menu. However a few are specified in the input file, or require part of their specification to be in the input file.

Options Information in the Input File

In such cases, the program is notified that an option has been invoked by the presence of one or more letters after the last number on the first line of the input file. These letters may or may not be separated from each other by blanks, though it is usually necessary to separate them from the number by a blank. They can be in any order. Thus to invoke options A and W, the input file starts with the line:
   12   20 WA
or:
   12   20 A W
The options are described individually in the other documents of this package. For the options that require information to be in the input file, additional information must be provided. For all but one of these, this information is provided by placing a line after the first line of the file, but before the beginning of the species data. The first character of that line should match the option letter. These auxiliary information lines can be in any order. Thus if options A and W are both invoked, both of the following formats (and two others as well) are legal:
   12   20 AW                            12   20  A W
A         0001111000                  Weights   00112221A0
Weights   00112221A0                  A         0001111000
(then the species information)        (then the species information)
One of the options requires special discussion. Many of the programs have in their menu the option U, which signals that one or more user-defined trees is to be provided for evaluation. This "user tree" is supplied in the input file (not the tree file), but AFTER the species data, rather than before it. It does not require any indication to be placed in the first line of the input file, as do the options that place information before the species data. After the data, there is a line containing the number of user-defined trees being defined. Each user-defined tree starts on a new line. It is in the same form as the trees in the tree files mentioned above, namely the New Hampshire standard. Here is an example with one user-defined tree:
    6   13
Archaeopt 0011001110000
Hesperorni0001101101101
Baluchithe1111011011101
B. virgini1111011101101
Brontosaur0110100111011
B.subtilis0000000011010
1
((B.subtilis,Baluchithe),((Brontosaur,B._virgini),
(Hesperorni,Archaeopt)));
In using the user tree option, check the pattern of parentheses carefully. The programs do not always detect whether the tree makes sense, and if it does not there will probably be a crash (hopefully, but not inevitably, with an error message indicating the nature of the problem).

Common Options in the Menu

Seven options from the menu, the U (User tree), G (Global), J (Jumble), O (Outgroup), T (Threshold), M (multiple data sets), and the tree output options, are used so widely that it is best to discuss them in this document.

The U (User tree) option

This option toggles between the default setting, which allows the program to search for the best tree, and the User tree setting, which reads a tree or trees ("user trees") from the input file and evaluates them. The user trees must follow the other information in the data set, and be preceded by a line specifying the number to user trees that are to be evaluated. Each user tree then is given in standard form, each starting on a new line. The form that the user trees must take is described in some detail below, under the description of the program output of tree files. In some cases a program may require that the trees fed in be rooted trees, even though the program cannot infer the placement of the root. In those cases you can place the root anywhere. Program RETREE can be used to convert between rooted and unrooted trees.

The G (Global) option

In the programs which construct trees (except for NEIGHBOR, the "...PENNY" programs and CLIQUE, and of course the "...MOVE" programs where you construct the trees yourself), after all species have been added to the tree a rearrangements phase ensues. In most of these programs the rearrangements are automatically global, which in this case means that subtrees will be removed from the tree and put back on in all possible ways so as to have a better chance of finding a better tree. Since this can be time consuming (it roughly triples the time taken for a run) it is left as an option in some of the programs, specifically CONTML, FITCH, and DNAML. In these programs the G menu option toggles between the default of local rearrangement and global rearrangement. The rearrangements are explained more below.

The J (Jumble) option

In most of the tree construction programs (except for the "...PENNY" programs and CLIQUE), the exact details of the search of different trees depend on the order of input of species. In these programs J option enables you to tell the program to use a random number generator to choose the input order of species. This option is toggled on and off by selecting option J in the menu. The program will then prompt you for a "seed" for the random number generator. The seed should be an integer between 1 and 32767, and should of form 4n+1, which means that it must give a remainder of 1 when divided by 4. This can be judged by looking at the last two digits of the number. Each different seed leads to a different sequence of addition of species. By simply changing the random number seed and re-running the programs one can look for other, and better trees. If the seed entered is not odd, the program will not proceed, but will prompt for another seed.

The Jumble option also causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run).

The O (Outgroup) option

This specifies which species is to be used to root the tree by having it become the outgroup. This option is toggled on and off by choosing O in the menu. When it is on, the program will then prompt for the number of the outgroup (the species being taken in the numerical order that they occur in the input file). Responding by typing "6" and then a carriage-return (Enter) character indicates that the sixth species in the data is the outgroup. Outgroup-rooting will not be attempted if the data have already established a root for the tree from some other consideration, and may not be if it is a user-defined tree, despite your invoking the option. Thus programs such as DOLLOP that produce only rooted trees do not allow the Outgroup option. It is also not available in KITSCH, DNAMLK, or CLIQUE. When it is used, the tree as printed out is still listed as being an unrooted tree, though the outgroup is connected to the bottommost node so that it is easy to visually convert the tree into rooted form.

The T (Threshold) option

This sets a threshold such that if the number of steps counted in a character is higher than the threshold, it will be taken to be the threshold value rather than the actual number of steps. The default is a threshold so high that it will never be surpassed. The T menu option toggles on and off asking the user to supply a threshold. The use of thresholds to obtain methods intermediate between parsimony and compatibility methods is described in my 1981b paper. When the T option is in force, the program will prompt for the numerical threshold value. This will be a positive real number greater than 1. In programs MIX, MOVE, PENNY, PROTPARS, DNAPARS, DNAMOVE, and DNAPENNY, do not use threshold values less than or equal to 1.0, as they have no meaning and lead to a tree which depends only on considerations such as the input order of species and not at all on the character state data! In programs DOLLOP, DOLMOVE, and DOLPENNY the threshold should never be 0.0 or less, for the same reason. The T option is an important and underutilized one: it is, for example, the only way in this package (except for program DNACOMP) to do a compatibility analysis when there are missing data. It is a method of de-weighting characters that evolve rapidly. I wish more people were aware of its properties.

The M (Multiple data sets) option

In menu programs there is an M menu option which allows one to toggle on the multiple data sets option. The program will ask you how many data sets it should expect. The data sets have the same format as the first data set. Here is a (very small) input file with two five-species data sets:
     5    6
Alpha     CCACCA
Beta      CCAAAA
Gamma     CAACCA
Delta     AACAAC
Epsilon   AACCCA
     5    6
Alpha     CACACA
Beta      CCAACC
Gamma     CAACAC
Delta     GCCTGG
Epsilon   TGCAAT
The main use of this option will be to allow all of the methods in these programs to be bootstrapped. Using the program
SEQBOOT one can take any DNA, protein, restriction sites, or binary character data set and make multiple data sets by bootstrapping. Trees can be produced for all of these using the M option. They will be written on the tree output file if that option is left in force. Then the program CONSENSE can be used with that tree file as its input file. The result is a majority rule consensus tree which can be used to make confidence intervals. The present version of the package allows, with the use of SEQBOOT and CONSENSE and the M option, bootstrapping of many of the methods in the package.

The option to write out the trees into a tree file

This specifies that you want the program to write out the tree not only on its usual output, but also onto a file in nested-parenthesis notation (as described above). This option is sufficiently useful that it is turned on by default in all programs that allow it. You can optionally turn it off if you wish, by typing the appropriate number from the menu (it varies from program to program). This option is useful for creating tree files that can be directly read into the plotting programs, the consensus tree program, and can be incorporated into the input file to specify user-defined trees in many of the other programs.

The (0) terminal type option

The program will default to one particular assumption about your terminal (except in the case of Macintoshes, the default will be an ANSI compatible terminal). You can alternatively select it to be either an IBM PC, a DEC VT52, or nothing. This affects the ability of the programs to clear the screen when they display their menus, and the graphics characters used to display trees in the programs DNAMOVE, MOVE, DOLMOVE, and RETREE. If you are running a PCDOS system any have the ANSI.SYS driver installed in your CONFIG.SYS file, you may find that the screen clears correctly even with the default setting of ANSI.

Common Options Requiring Information in the Input File

There are a number of options (Ancestor, Factors, Categories and Weights) that are specified in the input file. Some of them must also be selected in the menu. Of these, the Ancestor and Factors options are specific to the Discrete Characters programs and are described in their group document. The Categories option is specific to some of the molecular sequence programs and is described in their group document. The Weights option is used throughout the package and is best introduced here.

The Weights Option

This allows us to specify weights on the individual characters. Weights are invoked by placing a W on the first line of the file. The weights are then specified by a line or lines which start with W and then have enough characters or blanks to complete the full length of a species name. Then they have a single character (0-9 or A-Z) for each character. Thus they look like the data for a species:
Weights   0001111001112
or:
W         1110000ZZZZZ1
The weights cause a character to be counted as if it were n characters, where n is the weight. The values 0-9 give weights 0 through 9, and the values A-Z give weights 10 through 35. By use of the weights we can give overwhelming weight to some characters, and drop others from the analysis. In the molecular sequence programs only two values of the weights, 0 or 1 are allowed.

Weights can be used to analyze different subsets of characters (by weighting the rest as zero). Alternatively, in the discrete characters programs they can be used to force a certain group to appear on the phylogeny (in effect confining consideration to only phylogenies containing that group). This is done by adding an imaginary character that has 1's for the members of the group, and 0's for all the other species. That imaginary character is then given the highest weight possible: the result will be that any phylogeny that does not contain that group will be penalized by such a heavy amount that it will not (except in the most unusual circumstances) be considered. Of course, the new character brings extra steps to the tree, but the number of these can be calculated in advance and subtracted out of the total when reporting the results. This use of weights is an important one, and one sadly ignored by many users who could profit from it. In the case of molecular sequences we cannot use weights this way, so that to force a given group to appear we have to add a large extra segment of sites to the molecule, with (say) A's for that group and C's for every other species.


Back to the main PHYLIP page
Back to the SEQNET home page
Maintained 15 Jul 1996 -- by Martin Hilbers(e-mail:M.P.Hilbers@dl.ac.uk)