?? fastdnaml_doc_1.2.txt
字號:
fastDNAml 1.2Gary J. Olsen, Department of MicrobiologyUniversity of Illinois, Urbana, ILgary@phylo.life.uiuc.eduRoss Overbeek, Mathematics and Computer ScienceArgonne National Laboratory, Argonne, ILoverbeek@mcs.anl.govCiting fastDNAmlIf you publish work using fastDNAml, please cite the following publications: Olsen, G. J., Matsuda, H., Hagstrom, R., and Overbeek, R. 1994. fastDNAml: A tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10: 41-48. Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17: 368-376.What is fastDNAmlfastDNAml is a program derived from Joseph Felsenstein's version 3.3 DNAML(part of his PHYLIP package). Users should consult the documentation forDNAML before using this program.fastDNAml is an attempt to solve the same problem as DNAML, but to do sofaster and using less memory, so that larger trees and/or more bootstrapreplicates become tractable. Much of fastDNAml is merely a recoding of thePHYLIP 3.3 DNAML program from PASCAL to C.DNAML includes the following notice:version 3.3. (c) Copyright 1986, 1990 by the University of Washington andJoseph Felsenstein. Written by Joseph Felsenstein. Permission is granted tocopy and use this program provided no fee is charged for it and provided thatthis copyright notice is not removed.Why is fastDNAml faster?Some recomputation of values has been eliminated (Joe Felsenstein has donemuch of this in version 3.4 DNAML).The optimization of branch lengths has been accelerated by changing from an EMmethod to Newton's method (Joe Felsenstein has done much of this in version 3.4DNAML).The strategy for simultaneously optimizing all of the branches on the tree hasbeen modified to spend less time getting an individual branch right beforeimproving the other branches.Other new features in fastDNAmlfastDNAml includes a checkpoint feature to regularly save its progress towardfinding a large tree. If the program is interrupted, a minor change to theinput file and adding the R (restart) option permits the work to be resumedfrom the last checkpoint.The new R {restart) option can also be used for more rapid addition of newsequences to a previously computed tree (when new sequences are added to thealignment, it is best if the relative alignment of the previous sequences isnot altered).The G (global) option has been generalized to permit crossing any number ofbranches during tree rearrangements. In addition, it is possible to modifythe extent of rearrangement explored during the sequential addition phase oftree building.The G U (global and user tree) option combination instructs the program tofind the best of the user trees, and then look for rearrangements that arebetter still.The number of available rate categories has been raised from 9 to 35.The weighting mask accepts values from 0 through 35.The new B (bootstrap) option causes generation of a bootstrap sample, drawnfrom the input data.The program includes "P4" code for distributing the problem over multipleprocessors (either within one machine, or across multiple machines).Do DNAML and fastDNAml give the same answer?Generally yes, though there are some reservations:One or the other might find a better tree due to minor changes in the waystrees are searched. When sequence addition is replicated with differentvalues of the jumble random number seed, they have about the same probabilityof finding the best tree, but any given seed might give different trees.The likelihoods and branch lengths sometimes differ very slightly due todifferent criteria for stopping the optimization process.Little has been done to check the confidence limits on branch lengths. Thereseem to be some instances in which they disagree, and we think that fastDNAmlis correct. However, do not take the "significantly greater than zero" tooseriously.If you are concerned, you can supply a tree inferred by fastDNAml as a usertree to DNAML and let it (1) reoptimize branch lengths, (2) tell youthe confidence limits and (3) tell you the tree likelihood.Changes and new features in version 1.2The program can now calculate the likelihood of extremely large user trees.The largest tree we have tested had 3200 taxa. Generally, you will run outof computer memory before you excede an intrinsic limitation. (With this,it is possible to compare trees found by whatever your favorite methods areunder the likelihood criterion.)The computation has been changed to permit ease of implimenting new modelsof evolution and analysis of amino acid sequences (though these have not yetbeen done). This has slowed down the program 5-10%.Changes and new features in version 1.1The quickadd option is now the default. This has the ugly effect of reversingthe meaning of putting a Q on the option line. (Sorry, about this, and thenext note, but in the long run it it is the better behavior.)Use of empirical base frequencies is now the default. This reverses themeaning of the F option, making the default behavior more like that of PHYLIP.The tree output file is now generated by default and should be more compatiblewith the files written and read by the PHILIP programs. In particular, thecomments with information about the tree, its likelihood, etc. are removed, andthere are no quotation marks around names unless there are unusual characterswithin the name. (There are two things to be very careful about in names:there is no completely consistent way to handle both blanks and underscores innames without quotation marks, and when a name is spaced in from the margin inthe input file, there are leading blank spaces in the name, which can be veryhard to make compatible with some programs.)Maintaining a list of the several best trees, not just the (single) best. Inparticular, when evaluating user-supplied trees, the program tries to saveinformation about all of the trees and provides a Hasegawa and Kishino typetest of whether each tree is better than optimum. Note, the current versionof the program prints the report in the order of tree likelihood, NOT in theorder the trees are supplied to the program. The best way (at present) tofigure out which tree is which is to look at the likelihoods. This is thesame test used in PHILIP, but I had removed access in version 1.0 of fastDNAmldue to differences in how the programs handle multiple trees. The differenceis that fastDNAml can maintain nearly optimal trees all the time, so you canget a list of the N best trees found by using the new K option (below).The program should accept rooted trees (strictly bifurcating), as well asunrooted trees (with a trifurcation at the deepest level). This is not fullytested, but it seems to work.Features in the worksTest subtree exchanges (as well as moving a single subtree) in the search forbetter trees.Allowing the program to optimize any user-defined subset of branches when userlengths are supplied.Input and OptionsBasicsThe input to fastDNAml is similar to that used by DNAML (and the other PHYLIPprograms). The user should consult the PHYLIP documentation for a basicdescription of the format.This version of fastDNAml expects to get its input from stdin (standard input)and writes its output to stdout (standard output). (There are compile timeoptions to modify this, for those who care to get into such things.)On a UNIX or DOS system, it is a simple matter to redirect input from a fileand output to a file: fastDNAml < infile > outfileOn a VMS system it is only slightly more difficult. Immediately beforerunning the program, one includes two commands that define the input andoutput files: $ Define/User Sys$Input infile $ Define/User Sys$Output outfile $ Run fastDNAmlThe default input data format is Interleaved (see I option). To help get datafrom a GenBank or similar format, the interleaved option can be switched offwith the I option. Numbers in the sequence data (i.e., sequence positionnumbers) will be ignored, so they need not be stripped out.(Note that the program also writes a file called checkpoint.PID. See the Roption below for more description.)1 -- Print DataBy default, fastDNAml does not echo the sequence data to the output file.Option 1 reverses this.3 -- Do Not Print TreeBy default, fastDNAml prints the final tree to the output file. Option 3reverses this.4 -- Do Not Write Tree to File (***** Changed in version 1.1 *****)By default, fastDNAml versions 1.1 and 1.2 write a machine readable (Newickformat) copy of the final tree to an output file. Option 4 reverses this.The tree output file will be called treefile.PID (where PID is the process IDunder which fastDNAml is running). Look at the Y option below for moreinformation on alternative tree formats.B -- BootstrapGenerates a bootstrap sample of the input data. Requires auxiliary data lineof the form: B random_number_seedExample: 5 114 B B 137 Sequence1 ACACGGTGTCGTATCATGCTGCAGGATGCTAGACTGCGTCANATGTTCGTACTAACTGTG ...If the W option is used, only positions that have nonzero weights are used incomputing the bootstrap sample. Warning: For a given random number seed, thesample will always be the same.PHYLIP DNAML does not include a bootstrap option. (Use the SEQBOOT program.)C -- CategoriesRequires auxiliary data of the form: C number_of_categories list_of_category_ratesThe maximum number of categories is 35. This line is followed by a list ofthe rates for each site: Categories list_of_categories [per site, one or more lines]Category "numbers" are ordered: 1, 2, 3, ..., 9, A, B, ..., Y, Z. Categoryzero (undefined rate) is permitted at sites with a zero in a user-suppliedweighting mask.Example: 5 114 C C 12 0.0625 0.125 0.25 0.5 1 2 4 8 16 32 64 128 Categories 5111136343678975AAA8949995566778888889AAAAAA9239898629AAAAA9 633792246624457364222574877188898132984963499AA9899975 Sequence1 ACACGGTGTCGTATCATGCTGCAGGATGCTAGACTGCGTCANATGTTCGTACTAACTGTG ...PHYLIP DNAML is limited to categories 1 through 9. Also, in PHYLIP version3.3, the categories data came after all the other auxiliary data, but beforethe user-supplied base frequencies and sequence data. If you make the C lineyour last auxiliary data line, the programs will behave the same.F -- Empirical Frequencies (***** Changed in version 1.1 *****)By default (starting with version 1.1), the program uses base frequenciesderived from the sequence data (called emperical base frequencies). Thereforethe input file should normally NOT include a base frequencies line precedingthe data. If you want to include your own base freqency data, it is nownecessary to use the F option, and add a line to the input file that suppliesthe frequency data:Instructs the program to use user-supllied base frequencies derived from thesequence data. Therefore the input file should not include a base frequenciesline IMMEDIATELY preceding the data: 5 114 F 0.25 0.30 0.20 0.25 Sequence1 ACACGGTGTCGTATCATGCTGCAGGATGCTAGACTGCGTCANATGTTCGTACTAACTGTG
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -