?? ngram-count.html
字號:
<! $Id: ngram-count.1,v 1.33 2006/09/04 09:13:10 stolcke Exp $><HTML><HEADER><TITLE>ngram-count</TITLE><BODY><H1>ngram-count</H1><H2> NAME </H2>ngram-count - count N-grams and estimate language models<H2> SYNOPSIS </H2><B> ngram-count </B>[<B>-help</B>]<B></B><I> option </I>...<H2> DESCRIPTION </H2><B> ngram-count </B>generates and manipulates N-gram counts, and estimates N-gram languagemodels from them.The program first builds an internal N-gram count set, eitherby reading counts from a file, or by scanning text input.Following that, the resulting counts can be output back to a fileor used for building an N-gram language model in ARPA<A HREF="ngram-format.html">ngram-format(5)</A>.Each of these actions is triggered by corresponding options, asdescribed below.<H2> OPTIONS </H2><P>Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicatestdin/stdout.<DL><DT><B> -help </B><DD>Print option summary.<DT><B> -version </B><DD>Print version information.<DT><B>-order</B><I> n</I><B></B><DD>Set the maximal order (length) of N-grams to count.This also determines the order of the estimated LM, if any.The default order is 3.<DT><B>-vocab</B><I> file</I><B></B><DD>Read a vocabulary from file.Subsequently, out-of-vocabulary words in both counts or text arereplaced with the unknown-word token.If this option is not specified all words found are implicitly addedto the vocabulary.<DT><B>-vocab-aliases</B><I> file</I><B></B><DD>Reads vocabulary alias definitions from<I>file</I>,<I></I>consisting of lines of the form<BR> <I>alias</I> <I>word</I><BR>This causes all tokens<I> alias </I>to be mapped to<I>word</I>.<I></I><DT><B>-write-vocab</B><I> file</I><B></B><DD>Write the vocabulary built in the counting process to<I>file</I>.<I></I><DT><B> -tagged </B><DD>Interpret text and N-grams as consisting of word/tag pairs.<DT><B> -tolower </B><DD>Map all vocabulary to lowercase.<DT><B> -memuse </B><DD>Print memory usage statistics.</DD></DL><H3> Counting Options </H3><DL><DT><B>-text</B><I> textfile</I><B></B><DD>Generate N-gram counts from text file.<I> textfile </I>should contain one sentence unit per line.Begin/end sentence tokens are added if not already present.Empty lines are ignored.<DT><B>-read</B><I> countsfile</I><B></B><DD>Read N-gram counts from a file.Ascii count files contain one N-gram of words per line, followed by an integer count, all separated by whitespace.Repeated counts for the same N-gram are added.Thus several count files can be merged by using <A HREF="cat.html">cat(1)</A>and feeding the result to <B>ngram-count -read -</B><B></B>(but see<A HREF="ngram-merge.html">ngram-merge(1)</A>for merging counts that exceed available memory).Counts collected by <B> -text </B>and <B> -read </B>are additive as well.Binary count files (see below) are also recognized.<DT><B>-read-google</B><I> dir</I><B></B><DD>Read N-grams counts from an indexed directory structure rooted in<B>dir</B>,<B></B>in a format developed byGoogle to store very large N-gram collections.The corresponding directory structure can be created using the script<B> make-google-ngrams </B>described in<A HREF="training-scripts.html">training-scripts(1)</A>.<DT><B>-write</B><I> file</I><B></B><DD>Write total counts to<I>file</I>.<I></I><DT><B>-write-binary</B><I> file</I><B></B><DD>Write total counts to <I> file </I>in binary format.Binary count files cannot be compressed and are typicallylarger than compressed ascii count files.However, they can be loaded faster, especially when the<B> -limit-vocab </B>option is used.<I><DT><B>-write-order</B><I> n</I><B></B><DD>Order of counts to write.The default is 0, which stands for N-grams of all lengths.<DT><B>-write</B><I>n file</I><B></B><DD>where<I> n </I>is 1, 2, 3, 4, 5, 6, 7, 8, or 9.Writes only counts of the indicated order to<I>file</I>.<I></I>This is convenient to generate counts of different orders separately in a single pass.<DT><B> -sort </B><DD>Output counts in lexicographic order, as required for<A HREF="ngram-merge.html">ngram-merge(1)</A>.<DT><B> -recompute </B><DD>Regenerate lower-order counts by summing the highest-order counts for each N-gram prefix.<DT><B> -limit-vocab </B><DD>Discard N-gram counts on reading that do not pertain to the words specified in the vocabulary.The default is that words used in the count files are automatically added tothe vocabulary.</DD></DL><H3> LM Options </H3><DL><DT><B>-lm</B><I> lmfile</I><B></B><DD>Estimate a backoff N-gram model from the total counts, and write itto<I> lmfile </I>in <A HREF="ngram-format.html">ngram-format(5)</A>.<DT><B>-nonevents</B><I> file</I><B></B><DD>Read a list of words from<I> file </I>that are to be considered non-events, i.e., thatcan only occur in the context of an N-gram.Such words are given zero probability mass in model estimation.<DT><B> -float-counts </B><DD>Enable manipulation of fractional counts.Only certain discounting methods support non-integer counts.<DT><B> -skip </B><DD>Estimate a ``skip'' N-gram model, which predicts a word byan interpolation of the immediate context and the context one word prior.This also triggers N-gram counts to be generated that are one word longer than the indicated order.The following four options control the EM estimation algorithm used forskip-N-grams.<DT><B>-init-lm</B><I> lmfile</I><B></B><DD>Load an LM to initialize the parameters of the skip-N-gram.<DT><B>-skip-init</B><I> value</I><B></B><DD>The initial skip probability for all words.<DT><B>-em-iters</B><I> n</I><B></B><DD>The maximum number of EM iterations.<DT><B>-em-delta</B><I> d</I><B></B><DD>The convergence criterion for EM: if the relative change in log likelihoodfalls below the given value, iteration stops.<DT><B> -count-lm </B><DD>Estimate a count-based interpolated LM using Jelinek-Mercer smoothing(Chen & Goodman, 1998).Several of the options for skip-N-gram LMs (above) apply.An initial count-LM in the format described in <A HREF="ngram.html">ngram(1)</A>needs to be specified using<B>-init-lm</B>.<B></B>The options<B> -em-iters </B>and<B> -em-delta </B>control termination of the EM algorithm.Note that the N-gram counts used to estimate the maximum-likelihoodestimates come from the <B> -init-lm </B>model.The counts specified with<B> -read </B>or<B> -text </B>are used only to estimate the smoothing (interpolation weights).<DT><B> -unk </B><DD>Build an ``open vocabulary'' LM, i.e., one that contains the unknown-wordtoken as a regular word.The default is to remove the unknown word.<DT><B>-map-unk</B><I> word</I><B></B><DD>Map out-of-vocabulary words to <I>word</I>,<I></I>rather than the default<B> <unk> </B>tag.<DT><B> -trust-totals </B><DD>Force the lower-order counts to be used as total counts in estimatingN-gram probabilities.Usually these totals are recomputed from the higher-order counts.<DT><B>-prune</B><I> threshold</I><B></B><DD>Prune N-gram probabilities if their removal causes (training set)perplexity of the model to increase by less than<I> threshold </I>relative.<DT><B>-minprune</B><I> n</I><B></B><DD>Only prune N-grams of length at least<I>n</I>.<I></I>The default (and minimum allowed value) is 2, i.e., only unigrams are excludedfrom pruning.
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -