?? hlmfund.tex
字號:
%% !HVER!hlmfund [SJY 05/04/97]%% Updated (and about 90% rewritten) - Gareth Moore 16/01/02 - 27/03/02%\mychap{Fundamentals of language modelling}{hlmfund}The \HTK\ language modelling tools are designed for constructing andtesting statistical $n$-gram language models. This chapter introduceslanguage modelling and provides an overview of the supplied tools. Itis strongly recommended that you read this chapter and then workthrough the tutorial in the following chapter -- this will provide youwith everything you need to know to get started building language models.\sidepic{HLMoperation}{80}{An $n$-gram is a sequence of $n$ symbols (e.g. words, syntacticcategories, etc) and an $n$-gram language model (LM) is used topredict each symbol in the sequence given its $n-1$ predecessors. Itis built on the assumption that the probability of a specific $n$-gramoccurring in some unknown test text can be estimated from thefrequency of its occurrence in some given training text. Thus, asillustrated by the picture above, $n$-gram construction is a three stageprocess. Firstly, the training text is scanned and its $n$-grams arecounted and stored in a database of \textit{gram} files. In thesecond stage some words may be mapped to an out of vocabulary class orother class mapping may be applied, and then in the final stagethe counts in the resulting gram files are used to compute$n$-gram probabalities which are stored in the \textit{language model}file. Lastly, the \textit{goodness} of a language model can beestimated by using it to compute a measure called \textit{perplexity}on a previously unseen test set. In general, the better a language modelthen the lower its test-set perplexity.}Although the basic principle of an $n$-gram LM is very simple, inpractice there are usually many more potential $n$-grams than can everbe collected in a training text in sufficient numbers to yield robustfrequency estimates. Furthermore, for any real application such asspeech recognition, the use of an essentially static and finitetraining text makes it difficult to generate a single LM which iswell-matched to varying test material. For example, an LM trained onnewspaper text would be a good predictor for dictating newsreports but the same LM would be a poor predictor for personal lettersor a spoken interface to a flight reservation system. A finaldifficulty is that the \textit{vocabulary} of an $n$-gram LM is finiteand fixed at construction time. Thus, if the LM is word-based, it canonly predict words within its vocabulary and furthermore new wordscannot be added without rebuilding the LM.The following four sections provide a thorough introduction to thesection because it will provide you with at least a basicunderstanding of what many of the tools and their parameters actuallydo -- you can safely skip the equations if you choose because the textexplains all the most important parts in plain English. The finalsection of this chapter then introduces the tools provided toimplement the various aspects of $n$-gram language modelling that havebeen described.\mysect{{\it n}-gram language models}{ngramLMs}Language models estimate the probability of a word sequence, $\hatP(w_1, w_2, \ldots, w_m)$ -- that is, they evaluate $P(w_i)$ as defined in equation\ref{e:3} in chapter \ref{c:fundaments}.\footnote{The theory components of this chapter --these first four sections -- are condensed from portions of{\textbf{``Adaptive Statistical Class-based Language Modelling''},G.L. Moore; \textit{Ph.D thesis, Cambridge University} 2001}}The probability $\hat P(w_1, w_2, \ldots, w_m)$ can be decomposed as aproduct of conditional probabilities:\begin{equation}\hat P(w_1, w_2, \ldots, w_m) = \prod_{i=1}^{m} \hat P(w_i \;|\; w_1,\ldots, w_{i-1})\label{cond_prob_model}\end{equation}\mysubsect{Word {\it n}-gram models}{wordngrams}Equation \ref{cond_prob_model} presents an opportunity forapproximating $\hat{P}(\cal{W})$ by limiting the context:\begin{equation}\hat P(w_1, w_2, \ldots, w_m) \simeq \prod_{i=1}^{m} \hat P(w_i \;|\; w_{i-n+1},\ldots, w_{i-1})\label{ngram_model}\end{equation}for some $n \geqslant 1$. If language is assumed to be ergodic -- thatis, it has the property that the probability of any state can beestimated from a large enough history independent of the startingconditions\footnote{See section 5 of [Shannon 1948]for a more formal definition of ergodicity.} -- then for sufficiently high $n$ equation\ref{ngram_model} is exact. Due to reasons of data sparsity, however,values of $n$ in the range of 1 to 4 inclusive are typically used, andthere are also practicalities of storage space for these estimates toconsider. Models using contiguous but limited context in this way areusually referred to as $n$-gram language models, and the conditionalcontext component of the probability (``$w_{i-n+1}, \ldots, w_{i-1}$''in equation \ref{ngram_model}) is referred to as the {\it history}.Estimates of probabilities in $n$-gram models are commonly based on maximumlikelihood estimates -- that is, by counting events in context on some giventraining text:\begin{equation}\hat P(w_i | w_{i-n+1}, \ldots, w_{i-1}) =\frac{C(w_{i-n+1}, \ldots, w_i)}{C(w_{i-n+1}, \ldots, w_{i-1})}\label{ngramcountdiv}\end{equation}where $C(.)$ is the count of a given word sequence in thetraining text. Refinements to this maximum likelihood estimate aredescribed later in this chapter.The choice of $n$ has a significant effect on the number of potentialparameters that the model can have, which is maximally bounded by$|\mathbb{W}|^n$, where $\mathbb{W}$ is the set of words in thelanguage model, also known as the {\it vocabulary}. A 4-gram modelwith a typically-sized 65,000 word vocabulary can thereforepotentially have $65,000\,^4\simeq 1.8\times10^{19}$ parameters. In practice, however, only asmall subset of the possible parameter combinations represent likelyword sequences, so the storage requirement is far less than thistheoretical maximum -- of the order of $10^{11}$ times less infact.\footnote{Based on the analysis of 170 million words of newspaperand broadcast news text.} Even given this significant reduction incoverage and a very large training text\footnote{A couple of hundredmillion words, for example.} there are still many plausible wordsequences which will not be encountered in the training text, or willnot be found a statistically significant number of times. It would notbe sensible to assign all unseen sequences zero probability, somethods of coping with low and zero occurrence word tuples have beendeveloped. This is discussed later in section \ref{robust_estimation}.It is not only the storage space that must be considered, however --it is also necessary to be able to attach a reasonable degree ofconfidence to the derived estimates. Suitably large quantities ofexample training text are also therefore required to ensure statisticalsignificance. Increasing the amount of training text not only givesgreater confidence in model estimates, however, but also demands morestorage space and longer analysis periods when estimating modelparameters, which may place feasibility limits on how much data can beused in constructing the final model or how thoroughly it can beanalysed. At the other end of the scale for restricted domain modelsthere may be only a limited quantity of suitable in-domain textavailable, so local estimates may need smoothing with global priors.In addition, if language models are to be used for speech recognitionthen it is good to train them on {\it precise} acoustic transcriptionswhere possible -- that is, text which features the hesitations,repetitions, word fragments, mistakes and all the other sources ofdeviation from purely grammatical language that characterise everydayspeech. However, such acoustically accurate transcriptions are inlimited supply since they must be specifically prepared; real-worldtranscripts as available for various other purposes almostubiquitously correct any disfluencies or mistakes made by speakers.\mysubsect{Equivalence classes}{HLMequivalenceclasses}The word $n$-gram model described in equation \ref{ngram_model} usesan equivalence mapping on the word history which assumes that allcontexts which have the same most recent $n-1$ words all have the sameprobability. This concept can be expressed more generally by definingan equivalence class function that acts on word histories, $\mathcalE(.)$, such that if $\mathcal E(x) = \mathcal E(y)$ then $\forall w:$$P(w | x) = P(w | y)$:\begin{equation}P(w_i \;|\; w_1, w_2, \ldots, w_{i-1}) =P(w_i \;|\; \mathcal E(w_1, w_2, \ldots, w_{i-1}))\label{equiv_cond_prob_model}\end{equation}A definition of $\mathcal{E}$ that describes a word $n$-gram is thus:\begin{equation}\mathcal E_{\textrm{word-{\it n}-gram}}(w_1, \ldots, w_{i}) = \mathcal E(w_{i-n+1}, \ldots, w_{i})\end{equation}In a good language model the choice of $\mathcal{E}$ should be such that itprovides a reliable predictor of the next word, resulting in classeswhich occur frequently enough in the training text that they can bewell modelled, and does not result in so many distinct historyequivalence classes that it is infeasible to store or analyse all theresultant separate probabilities.\mysubsect{Class {\it n}-gram models}{HLMclassngram}\label{classngram-description}One method of reducing the number of word history equivalence classesto be modelled in the $n$-gram case is to consider some words asequivalent. This can be implemented by mapping a set of words to aword class $g \in \mathbb{G}$ by using a classification function $G(w)= g$. If any class contains more than one word then this mapping willresult in less distinct word classes than there are words,$|\mathbb{G}| < |\mathbb{W}|$, thus reducing the number of separatecontexts that must be considered. The equivalence classes can then bedescribed as a sequence of these classes:\begin{equation}\mathcal E_{\textrm{class-{\it n}-gram}}(w_1, \ldots, w_{i}) = \mathcalE(G(w_{i-n+1}), \ldots, G(w_{i}))\label{equiv_classes}\end{equation}A deterministic word-to-class mapping like this has some advantagesover a word $n$-gram model since the reduction in the number ofdistinct histories reduces the storage space and training datarequirements whilst improving the robustness of the probabilityestimates for a given quantity of training data. Because multiplewords can be mapped to the same class, the model has the ability tomake more confident assumptions about infrequent words in a classbased on other more frequent words in the same class\footnote{Since itis assumed that words are placed in the same class because they sharecertain properties.} than is possible in the word $n$-gram case -- andfurthermore for the same reason it is able to make generalisingassumptions about words used in contexts which are not explicitlyencountered in the training text. These gains, however, clearlycorrespond with a loss in the ability to distinguish between differenthistories, although this might be offset by the ability tochoose a higher value of $n$.The most commonly used form of class $n$-gram model uses a singleclassification function, $G(.)$, as in equation \ref{equiv_classes},which is applied to each word in the $n$-gram, including the word whichis being predicted. Considering for clarity the bigram\footnote{By convention{\it unigram} refers to a 1-gram,{\it bigram} indicates a 2-gram and {\it trigram} is a 3-gram. There is nostandard term for a 4-gram.}case, then given $G(.)$ the language model has the terms $w_i$,$w_{i-1}$, $G(w_i)$ and $G(w_{i-1})$ available to it. The probabilityestimate can be decomposed as follows:\begin{eqnarray}P_{\textrm{class'}}(w_i \;|\; w_{i-1})& = & P(w_i \;|\; G(w_i), G(w_{i-1}), w_{i-1} )\nonumber\\ & \qquad\qquad\times & P(G(w_i) \;|\; G(w_{i-1}), w_{i-1})\label{chap2equalToOne}\end{eqnarray}It is assumed that $P(w_i\;|\; G(w_i), G(w_{i-1}), w_{i-1})$ is independent of $G(w_{i-1})$ and$w_{i-1}$ and that $P(G(w_i) \;|\; G(w_{i-1}), w_{i-1})$ isindependent of $w_{i-1}$, resulting in the model:\begin{equation}P_{\textrm{class}}(w_i \;|\; w_{i-1}) = P(w_i \;|\; G(w_{i})) \;\times\;P(G(w_i) \;|\; G(w_{i-1}))\label{normclass}\end{equation} Almost all reported class $n$-gram work using statistically-found classesis based on clustering algorithms which optimise $G(.)$ on the basisof bigram training set likelihood, even if the class map is to be usedwith longer-context models. It is interesting tonote that this approximation appears to works well, however,suggesting that the class maps found are in some respects ``general''and capture some features of natural language which apply irrespectiveof the context length used when finding these features.\mysect{Statistically-derived Class Maps}{HLMclustering}\label{clustering_section}An obvious question that arises is how to compute or otherwise obtaina class map for use in a language model. This section discussesone strategy which has successfully been used.Methods of statistical class map construction seek to maximise thelikelihood of the training text given the class model by makingiterative controlled changes to an initial class map -- in order tomake this problem more computationally feasible they typically use adeterministic map.\mysubsect{Word exchange algorithm}{HLMexchangealg}\label{KN-clustering}[Kneser and Ney 1993]\footnote{R. Kneser and H. Ney,\textbf{``Improved Clustering Techniques for Class-Based Statistical LanguageModelling''}; \textit{Proceedings of the European Conference on SpeechCommunication and Technology} 1993, pp. 973-976} describes analgorithm to build a class map by starting from some initial guess ata solution and then iteratively searching for changes to improve theexisting class map. This is repeated until some minimum changethreshold has been reached or a chosen number of iterations have beenperformed. The initial guess at a class map is typically chosen by asimple method such as randomly distributing words amongst classes orplacing all words in the first class except for the most frequentwords which are put into singleton classes. Potential moves are thenevaluated and those which increase the likelihood of the training textmost are applied to the class map. The algorithm is described indetail below, and is implemented in the \HTK\ tool \htool{Cluster}.Let $\mathcal{W}$ be the training text list of words $(w_1, w_2, w_3,\ldots)$ and let $\mathbb{W}$ be the set of all words in$\mathcal{W}$.From equation\ref{cond_prob_model} it follows that:\begin{equation}P_\mathrm{class}(\mathcal{W}) \;=\; \prod_{x, y \in \mathbb{W}}P_\mathrm{class}(x \;|\; y)^{C(x,y)}\label{classnorm_totprob}\end{equation}where $(x, y)$ is some word pair `$x$' preceded by `$y$' and $C(x, y)$is the number of times that the word pair `$y$ $x$' occurs in the list
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -