?? training-scripts.1
字號:
.\" $Id: training-scripts.1,v 1.15 2006/08/11 22:35:11 stolcke Exp $.TH training-scripts 1 "$Date: 2006/08/11 22:35:11 $" "SRILM Tools".SH NAMEtraining-scripts, compute-oov-rate, continuous-ngram-count, get-gt-counts, make-abs-discount, make-batch-counts, make-big-lm, make-diacritic-map, make-google-ngrams, make-gt-discounts, make-kn-counts, make-kn-discounts, merge-batch-counts, replace-words-with-classes, reverse-ngram-counts, split-tagged-ngrams, reverse-text, uniform-classes, vp2text \- miscellaneous conveniences for language model training.SH SYNOPSIS.B get-gt-counts.BI max= K.BI out= name.RI [ counts ...].br.B make-abs-discount.I gtcounts.br.B make-gt-discounts.BI min= min.BI max= max.I gtcounts.br.B make-kn-counts.BI order= N.BI max_per_file= M.BI output= file[.B no_max_order=1].br.B make-kn-discounts.BI min= min.I gtcounts.br.B make-batch-counts.I file-list.RI [ batch-size.RI [ filter.RI [ count-dir.RI [ options ...]]]].br.B merge-batch-counts.I count-dir.RI [ file-list |\c.IR start-iter ].br.B make-google-ngrams[.BI dir= DIR] [.BI per_file= N] [.B gzip=0].RI [ counts-file ...].br.B continuous-ngram-count[.BI order= N].RI [ textfile ...].br.B reverse-ngram-counts.RI [ counts-file ...].br.B reverse-text.RI [ textfile ...].br.B split-tagged-ngrams[.BI separator= S].RI [ counts-file ...].br.B make-big-lm.B \-name.I name.B \-read.I counts[.B \-trust-totals.BR \-max-per-file " M"].B \-lm.I new-model.RI [ options ...].br.B replace-words-with-classes.BI classes= classes[\c.BI outfile= counts.BR normalize= 0|1.BI addone= K.B have_counts=1.B partial=1].RI [ textfile ...].br.B uniform-classes.I classes .BI > new-classes.br.B make-diacritic-map.I vocab.br.B vp2text.RI [ textfile ...].br.B compute-oov-rate.I vocab.RI [ counts ...].SH DESCRIPTIONThese scripts perform convenience tasks associated with the training oflanguage models.They complement and extend the basic N-gram model estimator in.BR ngram-count (1)..PPSince these tools are implemented as scripts they don't automaticallyinput or output compressed data files correctly, unlike the mainSRILM tools.However, since most scripts work with data from standard input orto standard output (by leaving out the file argument, or specifying it as ``-'') it is easy to combine them with .BR gunzip (1)or.BR gzip (1)on the command line..PPAlso note that many of the scripts take their options with the .BR gawk (1)syntax.IB option = valueinstead of the more common.BI - option.IR value ..PP.B get-gt-countscomputes the counts-of-counts statistics needed in Good-Turing smoothing.The frequencies of counts up to.I K are computed (default is 10).The results are stored in a series of files with root.IR name ,.BR \fIname\fP.gt1counts ,.BR \fIname\fP.gt2counts ,\&..., .BR \fIname\fP.gt\fIN\fPcounts .It is assumed that the input counts have been properly merged, i.e.,that there are no duplicated N-grams..PP.B make-gt-discountstakes one of the output files of.B get-gt-countsand computes the corresponding Good-Turing discounting factors.The output can then be passed to.BR ngram-count (1)via the .BI \-gt noptions to control the smoothing during model estimation.Precomputing the GT discounting in this fashion has the advantage that theGT statistics are not affected by restricting N-grams to a limited vocabulary.Also, .BR get-gt-counts / make-gt-discountscan process arbitrarily large count files, since they do not need toread the counts into memory (unlike.BR ngram-count )..PP.B make-abs-discountcomputes the absolute discounting constant needed for the.B ngram-count.BI \-cdiscount noptions.Input is one of the files produced by .BR get-gt-counts . .PP.B make-kn-discountcomputes the discounting constants used by the modified Kneser-Neysmoothing method.Input is one of the files produced by .BR get-gt-counts . .PP.B make-batch-countsperforms the first stage in the construction of very large N-gram count files..I file-listis a list of input text files.Lines starting with a `#' character are ignored.These files will be grouped into batches of size.I batch-size (default 10)that are then processed in one run of.B ngram-count each.For maximum performance,.I batch-size should be as large as possible without triggering paging.Optionally, a.I filterscript or program can be given to condition the input texts.The N-gram count files are left in directory.I count-dir(``counts'' by default), where they can be found by a subsequentrun of.BR merge-batch-counts .All following.I optionsare passed to .BR ngram-count ,e.g., to control N-gram order, vocabulary, etc.(no options triggering model estimation should be included)..PP.B merge-batch-countscompletes the construction of large count files by merging the batched counts left in .I count-diruntil a single count file is produced.Optionally, a.I file-list of count files to combine can be specified; otherwise all count filesin.I count-dirfrom a prior run of.B make-batch-countswill be merged.A number as second argument restarts the merging process at iteration.IR start-iter .This is convenient if merging fails to complete for some reason(e.g., for temporary lack of disk space)..PP.B make-google-ngramstakes a sorted count file as input and creates an indexed directorystructure, in a format developed by Google to store very large N-gramcollections.The resulting directory can then be used with the.BR ngram-count (1).B \-read-googleoption.Optional arguments specify the output directory.I dirand the size.I Nof individual N-gram files(default is 10 million N-grams per file).The .B gzip=0 option writes plain, as opposed to compressed, files..PP.B continuous-ngram-countgenerates N-grams that span line breaks (which are usually taken tobe sentence boundaries).To count N-grams across line breaks use.br continuous-ngram-count \fItextfile\fP | ngram-count -read -.brThe argument.I Ncontrols the order of N-grams counted (default 3), andshould match the argument of .B ngram-count.BR \-order ..PP.B reverse-ngram-countsreverses the word order of N-grams in a counts file or stream.For example, to recompute lower-order counts from higher-order ones,but do the summation over preceding words (rather than following words,as in .BR ngram-count (1)),use.br reverse-ngram-counts \fIcount-file\fP | \\.br ngram-count -read - -recompute -write - | \\.br reverse-ngram-counts > \fInew-counts\fP.PP.B reverse-textreverses the word order in text files, line-by-line.Start- and end-sentence tags, if present, will be preserved.This reversal is appropriate for preprocessing training datafor LMs that are meant to be used with the .B ngram.BR \-reverseoption..PP.B split-tagged-ngramsexpands N-gram count of word/tag pairs into mixed N-grams of words and tags.The optional .BI separator= Sargument allows the delimiting character, which defaults to "/",to be modified..PP.B make-big-lmconstructs large N-gram models in a more memory-efficient way than.B ngram-countby itself.It does so by precomputing the Good-Turing or Kneser-Ney smoothing parametersfrom the full set of counts, and then instructing.B ngram-count to store only a subset of the counts in memory,namely those of N-grams to be retained in the model.The.I nameparameter is used to name various auxiliary files..I counts contains the raw N-gram counts; it may be (and usually is) a compressed file.Unlike with.BR ngram-count ,the.B \-readoption can be repeated to concatenate multiple count files, but the argumentsmust be regular files; reading from stdin is not supported.If Good-Turing smoothing is used and the file contains complete lower-ordercounts corresponding to thesums of higher-order counts, then the.B \-trust-totals options may be given for efficiency.All other.I optionsare passed to .B ngram-count (only options affecting model estimation should be given).Smoothing methods other than Good-Turing and modified Kneser-Ney are notsupported by.BR make-big-lm .Kneser-Ney smoothing also requires enough disk space to compute and store themodified lower-order counts used by the KN method.This is done using the .B merge-batch-countscommand, and the.B \-max-per-fileoption controls how many counts are to be stored per batch, and should be chosen so that these batches fit in real memory..PP.B make-kn-countscomputes the modified lower-order counts used by the KN smoothing method.It is invoked as a helper scripts by .B make-big-lm ..PP.B replace-words-with-classesreplaces expansions of word classes with the corresponding class labels..I classesspecifies class expansions in .BR classes-format (5).Ambiguities are resolved in favor of the longest matching word strings.Ties are broken in favor of the expansion listed first in .IR classes.Optionally, the file.I countswill receive the expansion counts resulting from the replacements..B normalize=0or.B 1indicates whether the counts should be normalized to probabilities(default is 1).The.B addone option may be used to smooth the expansion probabilities by adding .I K to each count (default 1).The option .B have_counts=1indicates that the input consists of N-gram counts and that replacementshould be performed on them.Note this will not merge counts that have been mapped to identical N-grams,since this is done automatically when .BR ngram-count (1)reads count data.The option.B partial=1prevents multi-word class expansions from being replaced when more thanone space character occurs inbetween the words..PP.B uniform-classestakes a file in.BR classes-format (5)and adds uniform probabilities to expansions that don't have a probabilityexplicitly stated..PP.B make-diacritic-mapconstructs a map file that pairs an ASCII-fied version of the words in.I vocabwith all the occurring non-ASCII word forms.Such a map file can then be used with.BR disambig (1)and a language modelto reconstruct the non-ASCII word form with diacritics from an ASCIItext..PP.B vp2textis a reimplementation of the filter used in the DARPA Hub-3 and Hub-4 CSR evaluations to convert ``verbalized punctuation'' texts tolanguage model training data..PP.B compute-oov-ratedetermines the out-of-vocabulary rate of a corpus from its unigram.I countsand a target vocabulary list in.IR vocab ..SH "SEE ALSO"ngram-count(1), ngram(1), classes-format(5), disambig(1), select-vocab(1)..SH BUGSSome of the tools could be generalized and/or made more robust tomisuse..SH AUTHORAndreas Stolcke <stolcke@speech.sri.com>..brCopyright 1995-2006 SRI International
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -