?? hlmfund.tex
字號:
%%w_2, \ldots w_{i-1}$, given an existing probability estimate function $\hat{P}$:%\begin{equation}%\hat{\beta}(w_1, \ldots, w_{i-1}) = 1\, - \,\sum_{w_i: c(w_1, \ldots, w_i)>0}% \hat{P}(w_i | w_1, \ldots, w_{i-1})%\end{equation}%where $c(w_1, \ldots, w_i)$ is the occurrence count of the sequence%$w_1, \ldots, w_i$ stored in the language model.The probability mass $\hat{\beta}$ is then distributed amongst all theunseen $w_i$ and the language model probability estimate becomes:% where $c(w_1,$ $\ldots,$ $w_i)$ $=$ $0$:%\begin{equation}%\hat{P}(w_i | w_1, \ldots, w_{i-1}) = \alpha(w_1, \ldots, w_{i-1}) \,.\, \hat{P}(w_i | w_2, \ldots, w_{i-1})%\end{equation}%given:%\begin{equation}%\alpha(w_1, \ldots, w_{i-1}) = \frac{\hat{\beta}(w_1, \ldots, w_{i-1})}% {\sum_{w_i: c(w_1, \ldots, w_i)=0}\hat{P}(w_i | w_2, \ldots, w_{i-1})}%\end{equation}%%Combining this back off scheme for unseen events with a discounting%method, the resulting language model probability estimate becomes:\begin{equation}\hat{P}(w_i \;|\; w_{i-n+1},\ldots,w_{i-1}) =\qquad{}\qquad{}\qquad{}\qquad{}\qquad{}\qquad{}\qquad{}\qquad{}\qquad{}\qquad{}\qquad{}\qquad{}\qquad{}\qquad{}\nonumber\end{equation}\vspace{-0.5cm}\begin{equation}\left\{ \begin{array}{l@{\quad:\quad}l}\alpha(w_{i-n+1}, \ldots, w_{i-1}) \,.\, \hat{P}(w_i | w_{i-n+2}, \ldots, w_{i-1}) &c(w_{i-n+1}, \ldots, w_i) = 0\\d_{c(w_{i-n+1}, \ldots, w_i)} \,.\, \frac{c(w_{i-n+1}, \ldots, w_i)}{c(w_{i-n+1},\ldots, w_{i-1})} & 1 \le c(w_{i-n+1}, \ldots, w_i) \le k\\\frac{c(w_{i-n+1}, \ldots, w_i)}{c(w_{i-n+1},\ldots, w_{i-1})} & c(w_{i-n+1}, \ldots, w_i) > k\\\end{array}\right.\end{equation}where $c(.)$ is the count of an event and:\begin{equation}\alpha(w_{i-n+1}, \ldots, w_{i-1}) = \frac{\hat{\beta}(w_{i-n+1},\ldots, w_{i-1})} {\sum_{w_i: c(w_{i-n+1}, \ldots, w_i)=0}\hat{P}(w_i| w_{i-n+2}, \ldots, w_{i-1})}\end{equation}A back off scheme such as this can be implemented efficiently becauseall the back off weights $\alpha$ can be computed once and then storedas part of the language model, and through its recursive nature it isstraightforward to incorporate within a language model. Through theuse of pruning methods, contexts which occur `too infrequently' are notstored in the model so in practice the test $c(w_1,\ldots,w_{i})=0$ isimplemented as referring to whether or not the context is in the model.%\subsubsection{Interpolation}\label{interp_bit}%Instead of backing off to a shorter-context estimate only when a%longer context is not stored within the language model, a model can%combine multiple contexts in every estimate it makes by means of%linear interpolation:%\begin{equation}%\hat{P}(w_i \;|\; w_1,\ldots,w_{i-1}) = \lambda_1 \frac{c(w_i)}{A} \;+\; % \sum_{m=1}^{i-1} \lambda_{i-m+1}% \frac{c(w_m,\ldots,w_i)}{c(w_m,\ldots,w_{i-1})}%\end{equation}%where $\lambda_m$ is the weight of each component of the language%model such that $\sum_m \lambda_m = 1$. In this example $\lambda_1$ and%$\lambda_2$ are the weight of the unigram and bigram models%respectively. The weights $\lambda_m$ are chosen so as to maximise the%probability of a held-out test set which was not originally used for%training each of the component models.\subsubsection{Cut-offs}With a back off scheme low count eventscan be discarded -- {\it cut-off} -- from the model and more frequently observedshorter-context estimates can be used instead. An additional advantageof discarding low occurrence events is that the model size can besubstantially reduced, since in general as $a$ decreases so the numberof events $c_a$ increases -- in fact the Good-Turing discountingscheme depends upon this relationship.\mysect{Perplexity}{HLMperplexity}A measure of language model performance based on average probabilitycan be developed within the field of information theory[Shannon 1948]\footnote{C.E. Shannon, \textbf{``A MathematicalTheory of Communication''}; \textit{The Bell System Technical Journal}1948, vol. 27 pp. 379-423, 623-656. Available online at\texttt{http://galaxy.ucsd.edu/new/external/shannon.pdf}}.A speaker emitting language can be considered to be a discreteinformation source which is generating a sequence of words $w_1, w_2,\ldots, w_m$ from a vocabulary set, $\mathbb{W}$. The probability of asymbol $w_i$ is dependent upon the previous symbols $w_1, \ldots,w_{i-1}$. The information source's inherent per-word entropy $H$represents the amount of non-redundant information provided by eachnew word on average, defined in bits as:\begin{equation}H = - \lim_{m \to \infty} \frac{1}{m} \sum_{w_1, w_2, \ldots, w_m}\left( P(w_1, w_2, \ldots, w_m)\;\log_2 P(w_1, w_2, \ldots, w_m) \right)\end{equation}This summation is over all possible sequences of words, but if thesource is ergodic then the summation over all possible word sequencescan be discarded and the equation becomes equivalent to:\begin{equation}H = - \lim_{m \to \infty} \frac{1}{m} \log_2 P(w_1, w_2, \ldots, w_m)\end{equation}It is reasonable to assume ergodicity on the basis that we uselanguage successfully without having been privy to all words that haveever been spoken or written, and we can disambiguate words on thebasis of only the recent parts of a conversation or piece of text.Having assumed this ergodic property, it follows that given a largeenough value of $m$, $H$ can be approximated with:\begin{equation}\label{2-3}\hat{H} = - \frac{1}{m} \log_2 P(w_1, w_2, \ldots, w_m)\end{equation}This last estimate is feasible to evaluate, thus providing the basisfor a metric suitable for assessing the performance of a language model.Considering a language model as an information source, it follows thata language model which took advantage of all possible features oflanguage to predict words would also achieve a per-word entropy of$H$. It therefore makes sense to use a measure related to entropy toassess the actual performance of a language model. Perplexity, $PP$,is one such measure that is in standard use, defined such that:\begin{equation}\label{2-4}PP = 2^{\hat{H}}\end{equation}Substituting equation \ref{2-3} into equation \ref{2-4} andrearranging obtains:\begin{equation}PP = \hat{P}(w_1, w_2, \ldots, w_m)^{-\frac{1}{m}}\label{pp_sum_eqn}\end{equation}where $\hat{P}(w_1, w_2, \ldots, w_m)$ is the probability estimate assigned tothe word sequence $(w_1,$ $w_2,$ $\ldots, w_m)$ by a language model.Perplexity can be considered to be a measure of on average how manydifferent equally most probable words can follow any given word.Lower perplexities represent better language models, although thissimply means that they `model language better', rather thannecessarily work better in speech recognition systems -- perplexity isonly loosely correlated with performance in a speech recognitionsystem since it has no ability to note the relevance of acousticallysimilar or dissimilar words.In order to calculate perplexity both a language model and some testtext are required, so a meaningful comparison between two languagemodels on the basis of perplexity requires the same test text and wordvocabulary set to have been used in both cases. The size of thevocabulary can easily be seen to be relevant because as itscardinality is reduced so the number of possible words given anyhistory must monotonically decrease, therefore the probabilityestimates must on average increase and so the perplexity will decrease.\mysect{Overview of $n$-Gram Construction Process}{mkngoview}This section describes the overall process of building an $n$-gramlanguage model using the \HTK\ tools. As noted in the introduction,it is a three stage process. Firstly, the training text is scannedand the $n$-grams counts are stored in a set of \textit{gram} files.Secondly, and optionally, the counts in the gram files are modified toperform vocabulary and class mapping. Finally the resulting gramfiles are used to build the LM. This separation into stages adds somecomplexity to the overall process but it makes it much more efficientto handle very large quantities of data since the gram files only needto be constructed once but they can be augmented, processed and usedfor constructing LMs many times.The overall process involved in building an $n$-gram language modelusing the \HTK\ tools is illustrated in Figure~\ref{f:WordLM}. Theprocedure begins with some training text, which first of all should beconditioned into a suitable format by performing operations such asconverting numbers to a citation form, expanding common abbreviationsand so on. The precise format of the training text depends on yourrequirements, however, and can vary enormously -- thereforeconditioning tools are not supplied with \HTK.\footnote{In fact a verysimple text conditioning Perl script is included in {\ttLMTutorial/extras/LCond.pl} for demonstration purposes only -- itconverts text to uppercase (so that words are considered equivalentirrespective of case) and reads the input punctuation in order to tagsentences, stripping most other punctuation. See the script for moredetails.}Given some input text, the tool \htool{LGPrep} scans the input wordsequence and counts $n$-grams.\footnote{\htool{LGPrep} can alsoperform text modification using supplied rules.} These $n$-gram countsare stored in a buffer which fills as each new $n$-gram isencountered. When this buffer becomes full, the $n$-grams within itare sorted and stored in a \textit{gram} file.All words (and symbols generally) are represented within \HTK\ by aunique integer id. The mapping from words to ids is recorded in aword map. On startup, \htool{LGPrep} loads in an existing word map,then each new word encountered in the input text is allocated a new idand added to the map. On completion, \htool{LGPrep} outputs the newupdated word map. If more text is input, this process is repeated andhence the word map will expand as more and more data is processed.Although each of the gram files output by \htool{LGPrep} is sorted,the range of $n$-grams within individual files will overlap. To builda language model, all $n$-gram counts must be input in sort order sothat words with equivalent histories can be grouped. To accommodatethis, all \HTK\ language modelling tools can read multiple gram filesand sort them on-the-fly. This can be inefficient, however, and it istherefore useful to first copy a newly generated set of gram filesusing the \HLM\ tool \htool{LGCopy}. This yields a set of gram fileswhich are \textit{sequenced}, i.e. the ranges of $n$-grams within eachgram file do not overlap and can therefore be read in a single stream.Furthermore, the sequenced files will take less disc space since thecounts for identical $n$-gram in different files will have beenmerged.\centrefig{WordLM}{65}{The main stages in building an $n$-gram languagemodel}The set of (possibly sequenced) gram files and their associated wordmap provide the raw data for building an $n$-gram LM. The next stagein the construction process is to define the vocabulary of the LM andconvert all $n$-grams which contain OOV (out of vocabulary) words sothat each OOV word is replaced by a single symbol representing the\textit{unknown} class. For example, the $n$-gram \texttt{AN OLEAGINOUSAFFAIR} would be converted to \texttt{AN !!UNK AFFAIR} if the word``oleaginous'' was not in the selected vocabulary and \texttt{!!UNK}is the name chosen for the unknown class.This assignment of OOV words to a class of unknown words is a specificexample of a more general mechanism. In \HTK, any word can beassociated with a named class by listing it in a \textit{class map}file. Classes can be defined either by listing the class members orby listing all non-members. For defining the unknown class the latteris used, so a plain text list of all in-vocabulary words is suppliedand all other words are mapped to the OOV class. The tool\htool{LGCopy} can use a class map to make a copy of a set of gramfiles in which all words listed in the class map are replaced by theclass name, and also output a word map which contains only therequired vocabulary words and their ids plus any classes and theirids.As shown in Figure~\ref{f:WordLM}, the LM itself is built using thetool \htool{LBuild}. This takes as input the gram files and the wordmap and generates the required LM. The language modelcan be built in steps (first a unigram, then a bigram, then a trigram,etc.) or in a single pass if required.\mysect{Class-Based Language Models}{cllmoview}\centrefig{ClassLM}{110}{The main stages in building a class-based language model}As described in section \ref{s:HLMclassngram}, a class-base languagemodel consists of two separate components. The first is an $n$-gramwhich models the sequence of classes (i.e.\ $p(c_i, | c_{i-n+1},\ldots, c_{n-1})$) and the second is a class map with associated wordcounts or probabilities within classes allowing the word-given-classprobability bigram\ $p(w_k|c_k)$ to be evaluated. These files maythen either be linked into a single composite file or a third `linking'file is create to point to these two separate files -- both of theseoperations can be performed using the \htool{LLink} tool.Given a set of word classes defined in a class map file and a set ofword level gram files, building a class-based model with the \HTK\tools requires only a few simple modifications to the basic proceduredescribed above for building a word $n$-gram:\begin{itemize}\item \htool{Cluster} is used with the word map and word level gram files derived from the source text to construct a class map which defines which class each word is in. The same tool is then used to generate the word-classes component file referred to above. Note that \htool{Cluster} can also be used to generate this file from an existing or manually-generated class map. \item \htool{LGCopy} is used with the class map to convert the word level gram files derived from the source text into class gram files. \htool{LBuild} can then be used directly with the class level gram files to build the class sequence $n$-gram language model referred to above.\item \htool{LLink} is then run to create either a language model script pointing to the two separate language model files or a single composite file. The resulting language model is then ready for use.\end{itemize}The main steps of this procedure are illustrated inFigure~\ref{f:ClassLM}.The next chapter provides a more thorough introduction to the tools aswell as a tutorial to work through explaining how to use them inpractice.
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -