?? hlmfund.tex
字號:
$\mathcal{W}$.In general evaluating equation \ref{classnorm_totprob} will lead toproblematically small values, so logarithms can be used:\begin{equation}\log P_\mathrm{class}(\mathcal{W}) \;=\; \sum_{x, y \in \mathbb{W}}C(x, y) . \log P_\mathrm{class}(x \;|\; y) \label{classnorm_logprob}\end{equation}Given the definition of a class $n$-gram model in equation\ref{normclass}, the maximum likelihood bigramprobability estimate of a word is:\begin{eqnarray}P_\mathrm{class}(w_i \;|\; w_{i-1}) & = & \frac{C(w_i)}{C(G(w_i))} \times \frac{C\left(G(w_i), G(w_{i-1})\right)} {C(G(w_{i-1}))} \label{classnorm_breakdown}\end{eqnarray}where $C(w)$ is the number of times that the word `$w$' occurs in thelist $\mathcal{W}$ and $C(G(w))$ is the number of times that the class$G(w)$ occurs in the list resulting from applying $G(.)$ to each entryof $\mathcal{W}$;\footnote{That is, $C(G(w))=\sum_{x:G(x)=G(w)}C(x)$.} similarly $C(G(w_x), G(w_y))$ is the count of theclass pair `$G(w_y)$ $G(w_x)$' in that resultant list.Substituting equation \ref{classnorm_breakdown} into equation \ref{classnorm_logprob}and then rearranging gives:\begin{eqnarray}\log P_\mathrm{class}(\mathcal{W}) & \;=\; &\sum_{x,y \in \mathbb{W}} C(x,y) . \log\left( \frac{C(x)}{C(G(x))} \times \frac{C(G(x),G(y))}{C(G(y))} \right) \nonumber\\&\;=\;& \sum_{x,y \in \mathbb{W}} C(x,y) . \log \left(\frac{C(x)}{C(G(x))}\right) \;+\; \sum_{x,y \in \mathbb{W}} C(x,y) . \log\left(\frac{C(G(x),G(y))}{C(G(y))}\right) \nonumber\\&\;=\;& \sum_{x \in \mathbb{W}} C(x) . \log \left(\frac{C(x)}{C(G(x))}\right) \;+\; \sum_{g,h \in \mathbb{G}} C(g,h) . \log\left(\frac{C(g,h)}{C(h)}\right) \nonumber\\&\;=\;& \sum_{x \in \mathbb{W}} C(x) . \log C(x) \;-\; \sum_{x \in \mathbb{W}} C(x) . \log C(G(x))\nonumber\\&&\;+\; \sum_{g,h \in \mathbb{G}} C(g,h) . \log C(g,h) \;-\; \sum_{g \in \mathbb{G}} C(g) . \log C(g) \nonumber\\&\;=\;& \sum_{x \in \mathbb{W}} C(x) . \log C(x) \;+\; \sum_{g,h \in \mathbb{G}} C(g,h) . \log C(g,h)\nonumber\\&&\;-\; 2 \sum_{g \in \mathbb{G}} C(g) . \log C(g)\label{classnorm_ml}\end{eqnarray}where $(g,h)$ is some class sequence `$h$ $g$'.Note that the first of these three terms in the final stage of equation\ref{classnorm_ml}, ``$\sum_{x \in \mathbb{W}} C(x)$ $.$ $\log(C(x))$'', isindependent of the class map function $G(.)$, therefore it is notnecessary to consider it when optimising $G(.)$. The value a classmap must seek to maximise, $F_{\mathrm{M}_\mathrm{C}}$, can now be defined:\begin{eqnarray}F_{\mathrm{M}_\mathrm{C}}&\;=\;& \sum_{g,h \in \mathbb{G}} C(g,h) . \log C(g,h)\;-\; 2 \sum_{g \in \mathbb{G}} C(g) . \log C(g)\label{classnorm_Fml}\end{eqnarray}A fixed number of classes must be decided before running thealgorithm, which can now be formally defined:\begin{center}\framebox[13.5cm]{\parbox{12cm}{\vspace{0.5cm}\begin{enumerate}\item {\bfseries Initialise}:\label{Cstepone} $\forall w \in\mathbb{W}:\; G(w) = 1$\\Set up the class map so that all words are in the first class and allother classes are empty ({\it or} initialise using some other scheme)\item{\bfseries Iterate}: $\forall i \in \{1\ldots n\} \; \wedge \; \neg s$\\For a given number of iterations $1 \ldots n$ or until some stopcriterion $s$ is fulfilled\begin{enumerate}\item {\bfseries Iterate}: $\forall w \in\mathbb{W}$\\For each word $w$ in the vocabulary\begin{enumerate}\label{Csteptwo}\item {\bfseries Iterate}: $\forall c\in\mathbb{G}$\\For each class $c$\begin{enumerate}\item {\bfseries Move} word $w$ to class $c$, remembering its previous class\item {\bfseries Calculate} the change in $F_{\mathrm{M}_\mathrm{C}}$ for this move\item {\bfseries Move} word $w$ back to its previous class\end{enumerate}\item {\bfseries Move} word $w$ to the class which increased $F_{\mathrm{M}_\mathrm{C}}$ by themost, or do not move it if no move increased $F_{\mathrm{M}_\mathrm{C}}$\end{enumerate}\end{enumerate}\end{enumerate}\vspace{0.5cm}}}\end{center}The initialisation scheme given here in step \ref{Cstepone} representsa word unigram language model, making no assumptions about which wordsshould belong in which class.\footnote{Given this initialisation, thefirst $(|\mathbb{G}|-1)$ moves will be to place each word into anempty class, however, since the class map which maximises$F_{\mathrm{M}_\mathrm{C}}$ is the one which places each word into a singletonclass.} The algorithm is greedy and so can get stuck in a localmaximum and is therefore not guaranteed to find the optimal class mapfor the training text. The algorithm is rarely run until totalconvergence, however, and it is found in practice that an extraiteration can compensate for even a deliberately poor choice ofinitialisation.The above algorithm requires the number of classes to be fixed beforerunning. It should be noted that as the number of classes utilisedincreases so the overall likelihood of the training text will tendtend towards that of the word model.\footnote{Which will be higher,given maximum likelihood estimates.} This is why the algorithm doesnot itself modify the number of classes, otherwise it wouldna\"{\i}vely converge on $|\mathbb{W}|$ classes.\mysect{Robust model estimation}{HLMrobustestimates}\label{robust_estimation}Given a suitably large amount of training data, an extremely long $n$-gramcould be trained to give a very good model of language, as per equation\ref{cond_prob_model} -- in practice, however,any actual extant model must be an approximation. Because it is anapproximation, it will be detrimental to include within the modelinformation which in fact was just noise introduced by the limits ofthe bounded sample set used to train the model -- this information maynot accurately represent text not contained within the trainingcorpus. In the same way, word sequences which were not observed in thetraining text cannot be assumed to represent impossible sequences, sosome probability mass must be reserved for these. The issue of how toredistribute the probability mass, as assigned by the maximum likelihoodestimates derived from the raw statistics of a specific corpus, into asensible estimate of the real world is addressed by various standardmethods, all of which aim to create more robust language models.\subsection{Estimating probabilities}\label{discounting_and_other_fun_things}Language models seek to estimate the probability of each possible wordsequence event occurring. In order to calculate maximum likelihoodestimates this set of events must be finite so that the language modelcan ensure that the sum of the probabilities of all events is 1 givensome context. In an $n$-gram model the combination of the finite vocabularyand fixed length history limits the number of unique events to$|\mathbb{W}|^n$. For any $n>1$ it is highly unlikely that all wordsequence events will be encountered in the training corpora, and manythat do occur may only appear one or two times. A language modelshould not give any unseen event zero probability,\footnote{If it didthen from equation \ref{cond_prob_model} it follows that theprobability of any piece of text containing that event would also bezero, and would have infinite perplexity.} but without an infinitequantity of training text it is almost certain that there will beevents it does not encounter during training, so various mechanismshave been developed to redistribute probability within the model such that theseunseen events are given some non-zero probability. %The same or%similar mechanisms can also be used to%redistribute probability amongst infrequently-occurring events too in%the assumption that these were not seen commonly enough to draw firm%conclusions about their behaviour.%Various methods of discounting probability mass from observed events and%redistributing it to unseen events have been developed.As in equation \ref{ngramcountdiv}, the maximum likelihood estimate ofthe probability of an event $\mathcal{A}$ occurring is defined by thenumber of times that event is observed, $a$, and the total number ofsamples in the training set of all observations, $A$, where$P(\mathcal{A}) = \frac{a}{A}$. With this definition, events that donot occur in the training data are assigned zero probability since itwill be the case that $a=0$. [Katz 1987]\footnote{S.M. Katz,\textbf{``Estimation of Probabilities from Sparse Data for theLanguage Model Component of a Speech Recogniser''}; \textit{IEEETransactions on Acoustic, Speech and Signal Processing} 1987, vol. 35no. 3 pp. 400-401} suggests multiplying each observed count by a discountcoefficient factor, $d_a$, which is dependent upon the number of timesthe event is observed, $a$, such that $a' = d_a \,.\, a$.Using this discounted occurrence count, the probability of an eventthat occurs $a$ times now becomes$P_\mathrm{discount}(\mathcal{A}) = \frac{a'}{A\,}$.Different discounting schemes have been proposed that define thediscount coefficient, $d_a$, in specific ways. The same discountcoefficient is used for all events that occur the same number oftimes on the basis of the symmetry requirement that two events thatoccur with equal frequency, $a$, must have the same probability, $p_a$.Defining $c_a$ as the number of events that occur exactly $a$ timessuch that $A = \sum_{a\ge 1} a\,.\,c_a$ %and given the discount%coefficients $d_a$ it thereforeit follows that the total amount ofreserved mass, left over for distribution amongst the unseen events,is %\begin{eqnarray}%p_0 &=& \frac{1}{c_0} \; \sum_{a\ge 1}% a \,\frac{(1-d_a)\,.\,c_a}{A}\nonumber\\%&=& \frac{1}{c_0\,.\,A} \; \left( \sum_{a\ge 1} c_a\,.\,a% - \sum_{a\ge 1} d_a\,.\,c_a\,.\,a \right) \nonumber\\%&=& $\frac{1}{c_0} \; ( 1\;-$ $\frac{1}{A}\sum_{a\ge 1}$ $d_a\,.\,c_a\,.\,a)$.%\end{eqnarray}\subsubsection{Discounting}In [Good 1953]\footnote{I.J. Good, \textbf{``The Population Frequenciesof Species and the Estimation of Population Parameters''};\textit{Biometrika} 1953, vol. 40 (3,4) pp. 237-264}a method of discounting maximum likelihood estimates was proposedwhereby the count of an event occuring $a$ times is discounted with\begin{equation}d_a = (a+1) \frac{c_{a+1}}{a\,.\,c_a}\end{equation}A problem with this scheme, referred to as {\it Good-Turing} discounting,is that due to the count in the denominator it will fail if there is acase where $c_a = 0$ if there is any count $c_b > 0$ for$b>a$. Inevitably as $a$ increases the count $c_a$ will tend towardszero and for high $a$ there are likely to be many such zero counts. Asolution to this problem was proposed in[Katz 1987], which defines a cut-off value $k$ at which counts $a$for $a > k$ are not discounted\footnote{It is suggested that ``$k=5$or so is a good choice''} -- this is justified byconsidering these more frequently observed counts as reliable andtherefore not needing to be discounted. Katz then describes a reviseddiscount equation which preserves the same amount of mass for theunseen events:\begin{equation}d_a = \left\{ \begin{array}{c@{\quad:\quad}l} \frac{(a+1) \frac{c_{a+1}}{a\,.\,c_a} \;-\; (k+1)\frac{c_{k+1}}{c_1}} {1 - (k+1)\frac{c_{k+1}}{c_1}} & 1 \le a \le k\\1 & a>k\end{array}\right.\end{equation}This method is itself unstable, however -- for example if $k.c_k > c_1$then $d_a$ will be negative for $1 \le a \le k$.\subsubsection{Absolute discounting}An alternative discounting method is {\it absolute}discounting,\footnote{H. Ney, U. Essen and R. Kneser, \textbf{``OnStructuring Probabilistic Dependences in Stochastic LanguageModelling''}; \textit{Computer Speech and Language} 1994, vol.8 no.1pp.1-38} in which a constant value $m$ is substracted from eachcount. The effect of this is that the events with the lowest countsare discounted relatively more than those with higher counts. Thediscount coefficient is defined as\begin{equation}d_a = \frac{a-m}{a}\end{equation}In order to discount the same amount of probability mass as theGood-Turing estimate, $m$ must be set to:\begin{equation}m = \frac{c_1}{\sum_{a=1}^{A}a\,.\,c_a}\end{equation}%\subsubsection{Linear discounting}%In {\it linear} discounting, event counts are discounted in proportion to%their magnitude, thus $d_a$ is constant over all values of $a$. In%order to discount the same quantity of probability mass as the%Good-Turing discounting scheme, $d_a$ must be defined as%\begin{equation}%d_a = 1 \,-\, \frac{c_1}{A}%\end{equation}\mysubsect{Smoothing probabilities}{HLMsmoothingprobs}\label{smoothing_probs}The above discounting schemes present various methods ofredistributing probability mass from observed events to unseenevents. Additionally, if events are infrequently observed then theycan be smoothed with less precise but more frequently observed events.In [Katz 1987] a {\it back off} scheme is proposed and used alongsideGood-Turing discounting. In this method probabilities areredistributed via the recursive utilisation of lower level conditionaldistributions. Given the $n$-gram case, if the $n$-tuple is not observedfrequently enough in the training text then a probability based on theoccurrence count of a shorter-context $(n-1)$-tuple is usedinstead -- using the shorter context estimate is referred to as{\it backing off}.In practice probabilities are typically consideredbadly-estimated if their corresponding word sequences are notexplicitly stored in the language model, either because they did notoccur in the training text or they have been discarded using somepruning mechanism.Katz defines a function $\hat{\beta}(w_{i-n+1},\ldots w_{i-1})$ which represents the totalprobability of all the unseen events in a particular context. % $w_1,
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -