The input and output format for RESTML is described in its document files. In general its input format is similar to those described here, except that the one-letter codes for restriction sites is specific to that program and is described in that document file. Since the input formats for the eight DNA sequence and two protein sequence programs apply to more than one program, they are described here. Their input formats are standard, making use of the IUPAC standards.
The sequences can continue over multiple lines; when this is done the
sequences must be either in "interleaved" format, similar to the output of
alignment programs, or "sequential" format. These are described in the main
document file. In sequential format all of one sequence is given, possibly on
multiple lines, before the next starts. In interleaved format the first part
of the file should contain the first part of each of the sequences, then
possibly a line containing nothing but a carriage-return character, then the
second part of each sequence, and so on. Only the first parts of the sequences
should be preceded by names. Here is a hypothetical example of interleaved
format:
In interleaved format the present versions of the programs may sometimes
have difficulties with the blank lines between groups of lines, and if so you
might want to retype those lines, making sure that they have only a carriage-
return and no blank characters on them, or you may perhaps have to eliminate
them. The symptoms of this problem are that the programs complain that the
sequences are not properly aligned, and you can find no other cause for this
complaint.
The input format for the DNA sequence programs is standard: the data have
A's, G's, C's and T's (or U's). The first line of the input file contains the
number of species and the number of sites. As with the other programs, options
information may follow this. Following this, each species starts on a new
line. The first 10 characters of that line are the species name. There then
follows the base sequence of that species, each character being one of the
letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period
was also previously allowed but it is no longer allowed, because it sometimes
is used in different senses in other programs). Blanks will be ignored, and so
will numerical digits. This allows GENBANK and EMBL sequence entries to be
read with minimum editing.
These characters can be either upper or lower case. The algorithms
convert all input characters to upper case (which is how they are treated).
The characters constitute the IUPAC (IUB) nucleic acid code plus some slight
extensions. They enable input of nucleic acid sequences taking full account of
any ambiguities in the sequence.
The input for the protein sequence programs is fairly standard. The first
line contains the number of species and the number of amino acid positions
(counting any stop codons that you want to include). These are followed on the
same line by the options. The only options which need information in the input
file are U (User Tree) and W (Weights). They are as described in the
main documentation file. If the W (Weights) option is used there must be a W in the
first line of the input file.
Next come the species data. Each sequence starts on a new line, has a
ten-character species name that must be blank-filled to be of that length,
followed immediately by the species data in the one-letter code. The sequences
must either be in the "interleaved" or "sequential" formats. The I option
selects between them. The sequences can have internal blanks in the sequence
but there must be no extra blanks at the end of the terminated line. Note that
a blank is not a valid symbol for a deletion.
The protein sequences are given by the one-letter code used by the late
Margaret Dayhoff's group in the Atlas of Protein Sequences, and consistent with
the IUB standard abbreviations. In the present version it is:
Here are the same one-letter codes tabulated the other way 'round:
The programs allow options chosen from their menus. Many of these are as
described in the main documentation file, particularly the options J, O, U, T,
W, and Y. (Although T has a different meaning in the programs DNAML and
DNADIST than in the others).
The U option indicates that user-defined trees are provided at the end of
the input file. This happens in the usual way, except that for PROTPARS,
DNAPARS, DNACOMP, and DNAMLK, the trees must be strictly bifurcating,
containing only two-way splits, e. g.: ((A,B),(C,(D,E)));. For DNAML and
RESTML it must have a trifurcation at its base, e. g.: ((A,B),C,(D,E));. The
root of the tree may in those cases be placed arbitrarily, since the trees
needed are actually unrooted, though they look different when printed out. The
program RETREE should enable you to reroot the trees without having to hand-
edit or retype them. For DNAMOVE the U option is not available (although there
is an equivalent feature which uses rooted user trees).
A feature of the nucleotide sequence programs other than DNAMOVE is that
they save time and computer memory space by recognizing sites at which the
pattern of bases is the same, and doing their computation only once. Thus if
we have only four species but a large number of sites, there are (ignoring
ambiguous bases) only about 256 different patterns of nucleotides
(4 x 4 x 4 x 4) that can occur. The programs automatically count how many
occurrences there are of each and then only needs to do as much computation as
would be needed with 256 sites, even though the number of sites is actually
much larger. If there are ambiguities (such as Y or R nucleotides), these are
also handled correctly, and do not cause trouble. The programs store the full
sequences but reserve other space for bookkeeping only for the distinct
patterns. This saves space. Thus the programs will run very effectively with
few species and many sites. On larger numbers of species, if rates of
evolution are small, many of the sites will be invariant (such as having all
A's) and thus will mostly have one of four patterns. The programs will in this
way automatically avoid doing duplicate computations for such sites.
INTERLEAVED AND SEQUENTIAL FORMATS
5 42
Turkey AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp AAACCCTTGC CGTTACGCTT
Gorilla AAACCCTTGC CGGTACGCTT
GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA
while in sequential format the same sequences would be:
5 42
Turkey AAGCTNGGGC ATTTCAGGGT
GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT
GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT
ACAGGTTGGC CGTTCAGGGT AA
Chimp AAACCCTTGC CGTTACGCTT
AAACCGAGGC CGGGACACTC AT
Gorilla AAACCCTTGC CGGTACGCTT
AAACCATTGC CGGTACGCTT AA
Note, of course, that a portion of a sequence like this:
300 AAGCGTGAAC GTTGTACTAA TRCAG
is perfectly legal, assuming that the species name has gone before, and is
filled out to full length by blanks. The above digits and blanks will be
ignored, the sequence being taken as starting at the first base symbol (in this
case an A). This should enable you to use output from many multiple-sequence
alignment programs with only minimal editing.
INPUT FOR THE DNA SEQUENCE PROGRAMS
Symbol Meaning
------ -------
A Adenine
G Guanine
C Cytosine
T Thymine
U Uracil
Y pYrimidine (C or T)
R puRine (A or G)
W "Weak" (A or T)
S "Strong" (C or G)
K "Keto" (T or G)
M "aMino" (C or A)
B not A (C or G or T)
D not C (A or G or T)
H not G (A or C or T)
V not T (A or C or G)
X,N,? unknown (A or C or G or T)
O deletion
- deletion
INPUT FOR THE PROTEIN SEQUENCE PROGRAMS
Symbol Stands for
------ ----------
A ala
B asx
C cys
D asp
E glu
F phe
G gly
H his
I ileu
J (not used)
K lys
L leu
M met
N asn
O (not used)
P pro
Q gln
R arg
S ser
T thr
U (not used)
V val
W trp
X unknown amino acid
Y tyr
Z glx
* nonsense (stop)
? unknown amino acid or deletion
- deletion
where "nonsense", and "unknown" mean respectively a nonsense (chain
termination) codon and an amino acid whose identity has not been determined.
The state "asx" means "either asn or asp", and the state "glx" means "either
gln or glu" and the state "deletion" means that alignment studies indicate a
deletion has happened in the ancestry of this position, so that it is no longer
present. Note that if two polypeptide chains are being used that are of
different length owing to one terminating before the other, they can be coded
as (say)
HIINMA*????
HIPNMGVWABT
since after the stop codon we do not definitely know that there has been a
deletion, and do not know what amino acid would have been there. If DNA
studies tell us that there is DNA sequence in that region, then we could use
"X" rather than "?". Note that "X" means an unknown amino acid, but definitely
an amino acid, while "?" could mean either that or a deletion. Otherwise one
will usually want to use "?" after a stop codon, if one does not know what
amino acid is there. If the DNA sequence has been observed there, one probably
ought to resist putting in the amino acids that this DNA would code for, and
one should use "X" instead, because under the assumptions implicit in this
either the parsimony or the distance methods, changes to any noncoding sequence
are much easier than changes in a coding region that change the amino acid
Amino acid One-letter code
---------- ---------------
ala A
arg R
asn N
asp D
asx B
cys C
gln Q
glu E
gly G
glx Z
his H
ileu I
leu L
lys K
met M
phe F
pro P
ser S
thr T
trp W
tyr Y
val V
deletion -
nonsense (stop) *
unknown amino acid X
unknown (incl. deletion) ?
THE OPTIONS
Back to the main PHYLIP page
Back to the SEQNET home page
Maintained 15 Jul 1996 -- by Martin Hilbers(e-mail:M.P.Hilbers@dl.ac.uk)