?? clustalx.hlp
字號:
searched by clicking on the sequence names. You can then enter the string to
search for by selecting the SEARCH FOR STRING option. If the string is found in
any of the sequences selected, the sequence name and column number is printed
below the sequence display.
In PROFILE ALIGNMENT MODE, the two profiles can be merged (normally done after
alignment) by selecting ADD PROFILE 2 TO PROFILE 1. The sequences currently
displayed as Profile 2 will be appended to Profile 1.
The REMOVE ALL GAPS option will remove all gaps from the sequences currently
selected.
WARNING: This option removes ALL gaps, not only those introduced by ClustalX,
but also those that were read from the input alignment file. Any secondary
structure information associated with the alignment will NOT be automatically
realigned.
The REMOVE GAP-ONLY COLUMNS will remove those positions in the alignment which
contain gaps in all sequences. This can occur as a result of removing divergent
sequences from an alignment, or if an alignment has been realigned.
>>HELP M <<
Multiple Alignments
Make sure MULTIPLE ALIGNMENT MODE is selected, using the switch directly above
the sequence display area. Then, use the ALIGNMENT menu to do multiple
alignments.
Multiple alignments are carried out in 3 stages:
1) all sequences are compared to each other (pairwise alignments);
2) a dendrogram (like a phylogenetic tree) is constructed, describing the
approximate groupings of the sequences by similarity (stored in a file).
3) the final multiple alignment is carried out, using the dendrogram as a guide.
The 3 stages are carried out automatically by the DO COMPLETE ALIGNMENT option.
You can skip the first stages (pairwise alignments; guide tree) by using an old
guide tree file (DO ALIGNMENT FROM GUIDE TREE); or you can just produce the
guide tree with no final multiple alignment (PRODUCE GUIDE TREE ONLY).
REALIGN SELECTED SEQUENCES is used to realign badly aligned sequences in the
alignment. Sequences can be selected by clicking on the sequence names - see
Editing Alignments for more details. The unselected sequences are then 'fixed'
and a profile is made including only the unselected sequences. Each of the
selected sequences in turn is then realigned to this profile. The realigned
sequences will be displayed as a group at the end the alignment.
REALIGN SELECTED SEQUENCE RANGE is used to realign a small region of the
alignment. A residue range can be selected by clicking on the sequence display
area. A multiple alignment is then performed, following the 3 stages described
above, but only using the selected residue range. Finally the new alignment of
the range is pasted back into the full sequence alignment.
By default, gap penalties are used at each end of the subrange in order to
penalise terminal gaps. If the REALIGN SEGMENT END GAP PENALTIES option is
switched off, gaps can be introduced at the ends of the residue range at no
cost.
ALIGNMENT PARAMETERS displays a sub-menu with the following options:
RESET NEW GAPS BEFORE ALIGNMENT will remove any new gaps introduced into the
sequences during multiple alignment if you wish to change the parameters and
try again. This only takes effect just before you do a second multiple
alignment. You can make phylogenetic trees after alignment whether or not this
is ON. If you turn this OFF, the new gaps are kept even if you do a second
multiple alignment. This allows you to iterate the alignment gradually.
Sometimes, the alignment is improved by a second or third pass.
RESET ALL GAPS BEFORE ALIGNMENT will remove all gaps in the sequences including
gaps which were read in from the sequence input file. This only takes effect
just before you do a second multiple alignment. You can make phylogenetic
trees after alignment whether or not this is ON. If you turn this OFF, all
gaps are kept even if you do a second multiple alignment. This allows you to
iterate the alignment gradually. Sometimes, the alignment is improved by a
second or third pass.
PAIRWISE ALIGNMENT PARAMETERS control the speed/sensitivity of the initial
alignments.
MULTIPLE ALIGNMENT PARAMETERS control the gaps in the final multiple
alignments.
PROTEIN GAP PARAMETERS displays a temporary window which allows you to set
various parameters only used in the alignment of protein sequences.
(SECONDARY STRUCTURE PARAMETERS, for use with the Profile Alignment Mode only,
allows you to set various parameters only used with gap penalty masks.)
SAVE LOG FILE will write the alignment calculation scores to a file. The log
filename is the same as the input sequence filename, with an extension .log
appended.
<H4>
OUTPUT FORMAT OPTIONS
</H4>
You can choose from 6 different alignment formats (CLUSTAL, GCG, NBRF/PIR,
PHYLIP, GDE and NEXUS). You can choose more than one (or all 6 if you wish).
CLUSTAL format output is a self explanatory alignment format. It shows the
sequences aligned in blocks. It can be read in again at a later date to (for
example) calculate a phylogenetic tree or add in new sequences by profile
alignment.
GCG output can be used by any of the GCG programs that can work on multiple
alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG
.msf format files (multiple sequence file); new in version 7 of GCG.
NEXUS format is used by several phylogeny programs, including PAUP and
MacClade.
PHYLIP format output can be used for input to the PHYLIP package of Joe
Felsenstein. This is a very widely used package for doing every imaginable
form of phylogenetic analysis (MUCH more than the the modest introduction
offered by this program).
NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap
characters "-" are used to indicate the positions of gaps in the multiple
alignment. These files can be re-used as input in any part of clustal that
allows sequences (or alignments or profiles) to be read in.
GDE: this format is used by the GDE package of Steven Smith and is understood
by SEQLAB in GCG 9 or later.
GDE OUTPUT CASE: sequences in GDE format may be written in either upper or
lower case.
CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the
alignment lines in clustalw format.
OUTPUT ORDER is used to control the order of the sequences in the output
alignments. By default, it uses the order in which the sequences were aligned
(from the guide tree/dendrogram), thus automatically grouping closely related
sequences. It can be switched to be the same as the original input order.
PARAMETER OUTPUT: This option will save all your parameter settings in a
parameter file (suffix .par) during alignment. The file can be subsequently
used to rerun ClustalW using the same parameters.
<H3>
ALIGNMENT PARAMETERS
</H3>
--------------------
<STRONG>
PAIRWISE ALIGNMENT PARAMETERS
</STRONG>
A distance is calculated between every pair of sequences and these are used to
construct the phylogenetic tree which guides the final multiple alignment. The
scores are calculated from separate pairwise alignments. These can be
calculated using 2 methods: dynamic programming (slow but accurate) or by the
method of Wilbur and Lipman (extremely fast but approximate).
You can choose between the 2 alignment methods using the PAIRWISE ALIGNMENTS
option. The slow/accurate method is fast enough for short sequences but will be
VERY SLOW for many (e.g. >100) long (e.g. >1000 residue) sequences.
<STRONG>
SLOW-ACCURATE alignment parameters:
</STRONG>
These parameters do not have any affect on the speed of the alignments. They
are used to give initial alignments which are then rescored to give percent
identity scores. These % scores are the ones which are displayed on the
screen. The scores are converted to distances for the trees.
Gap Open Penalty: the penalty for opening a gap in the alignment.
Gap Extension Penalty: the penalty for extending a gap by 1 residue.
Protein Weight Matrix: the scoring table which describes the similarity of
each amino acid to each other.
Load protein matrix: allows you to read in a comparison table from a file.
DNA weight matrix: the scores assigned to matches and mismatches (including
IUB ambiguity codes).
Load DNA matrix: allows you to read in a comparison table from a file.
See the Multiple alignment parameters, MATRIX option below for details of the
matrix input format.
<STRONG>
FAST-APPROXIMATE alignment parameters:
</STRONG>
These similarity scores are calculated from fast, approximate, global align-
ments, which are controlled by 4 parameters. 2 techniques are used to make
these alignments very fast: 1) only exactly matching fragments (k-tuples) are
considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
are used.
GAP PENALTY: This is a penalty for each gap in the fast alignments. It has
little effect on the speed or sensitivity except for extreme values.
K-TUPLE SIZE: This is the size of exactly matching fragment that is used.
INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
For longer sequences (e.g. >1000 residues) you may wish to increase the
default.
TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
dot-matrix plot) is calculated. Only the best ones (with most matches) are used
in the alignment. This parameter specifies how many. Decrease for speed;
increase for sensitivity.
WINDOW SIZE: This is the number of diagonals around each of the 'best'
diagonals that will be used. Decrease for speed; increase for sensitivity.
<STRONG>
MULTIPLE ALIGNMENT PARAMETERS
</STRONG>
These parameters control the final multiple alignment. This is the core of the
program and the details are complicated. To fully understand the use of the
parameters and the scoring system, you will have to refer to the documentation.
Each step in the final multiple alignment consists of aligning two alignments
or sequences. This is done progressively, following the branching order in the
GUIDE TREE. The basic parameters to control this are two gap penalties and the
scores for various identical/non-indentical residues.
The GAP OPENING and EXTENSION PENALTIES can be set here. These control the
cost of opening up every new gap and the cost of every item in a gap.
Increasing the gap opening penalty will make gaps less frequent. Increasing
the gap extension penalty will make gaps shorter. Terminal gaps are not
penalised.
The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most distantly
related sequences until after the most closely related sequences have been
aligned. The setting shows the percent identity level required to delay the
addition of a sequence; sequences that are less identical than this level to
any other sequences will be aligned later.
The TRANSITION WEIGHT gives transitions (A<-->G or C<-->T i.e. purine-purine or
pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zero
means that the transitions are scored as mismatches, while a weight of 1 gives
the transitions the match score. For distantly related DNA sequences, the
weight should be near to zero; for closely related sequences it can be useful
to assign a higher score. The default is set to 0.5.
The PROTEIN WEIGHT MATRIX option allows you to choose a series of weight
matrices. For protein alignments, you use a weight matrix to determine the
similarity of non-identical amino acids. For example, Tyr aligned with Phe is
usually judged to be 'better' than Tyr aligned with Pro.
There are three 'in-built' series of weight matrices offered. Each consists of
several matrices which work differently at different evolutionary distances. To
see the exact details, read the documentation. Crudely, we store several
matrices in memory, spanning the full range of amino acid distance (from almost
identical sequences to highly divergent ones). For very similar sequences, it
is best to use a strict weight matrix which only gives a high score to
identities and the most favoured conservative substitutions. For more divergent
sequences, it is appropriate to use "softer" matrices which give a high score
to many other frequent substitutions.
1) BLOSUM (Henikoff). These matrices appear to be the best available for
carrying out data base similarity (homology searches). The matrices currently
used are: Blosum 80, 62, 45 and 30. BLOSUM was the default in earlier Clustal X
versions.
2) PAM (Dayhoff). These have been extremely widely used since the late '70s. We
currently use the PAM 20, 60, 120, 350 matrices.
3) GONNET. These matrices were derived using almost the same procedure as the
Dayhoff one (above) but are much more up to date and are based on a far larger
data set. They appear to be more sensitive than the Dayhoff series. We
currently use the GONNET 80, 120, 160, 250 and 350 matrices. This series is the
default for Clustal X version 1.8.
We also supply an identity matrix which gives a score of 10 to two identical
amino acids and a score of zero otherwise. This matrix is not very useful.
Load protein matrix: allows you to read in a comparison matrix from a file.
This can be either a single matrix or a series of matrices (see below for
format).
DNA WEIGHT MATRIX option allows you to select a single matrix (not a series)
used for aligning nucleic acid sequences. Two hard-coded matrices are available:
1) IUB. This is the default scoring matrix used by BESTFIT for the comparison
of nucleic acid sequences. X's and N's are treated as matches to any IUB
ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
2) CLUSTALW(1.6). A previous system used by ClustalW, in which matches score
1.0 and mismatches score 0. All matches for IUB symbols also score 0.
Load DNA matrix: allows you to read in a nucleic acid comparison matrix from a
file (just one matrix, not a series).
SINGLE MATRIX INPUT FORMAT
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -