?? exampsys.tex
字號:
-p 0.0 -s 5.0 dict tiedlist\end{verbatim}The options \texttt{-p} and \texttt{-s} set the \textit{word insertion penalty}\index{word insertion penalty}and the \textit{grammar scale factor}, \index{grammar scale factor}respectively. The word insertion penaltyis a fixed value added to each token when it transits from the end of one wordto the start of the next. The grammar scale factor is the amount by whichthe language model probability is scaled before being added to each token as it transits from the end of one wordto the start of the next. These parameters can have a significant effecton recognition performance and hence, some tuning on development test datais well worthwhile.The dictionary contains monophone transcriptions whereas the supplied HMM listcontains word internal triphones. \htool{HVite}\index{hvite@\htool{HVite}} will make the necessary conversions when loading the word network \texttt{wdnet}. However, if the HMM list contained both monophones and context-dependent phonesthen \htool{HVite} would become confused. The required form of word-internal network\index{networks!word-internal} expansion can be forced by setting the configuration variable\texttt{FORCECXTEXP}\index{forcecxtexp@\texttt{FORCECXTEXP}} to true and \texttt{ALLOWXWRDEXP}\index{allowxwrdexp@\texttt{ALLOWXWRDEXP}} to false (see chapter~\ref{c:netdict} for details).\index{accuracy figure}Assuming that the MLF \texttt{testref.mlf} contains word level transcriptionsfor each test file\footnote{The \htool{HLEd} tool may have to be used to insert silences at the start and end of each transcription or alternatively\htool{HResults} can be used to ignore silences (or any other symbols) usingthe \texttt{-e} option}, the actualperformance can be determined by running \htool{HResults} as follows\begin{verbatim} HResults -I testref.mlf tiedlist recout.mlf\end{verbatim}the result would be a print-out of the form\begin{verbatim} ====================== HTK Results Analysis ============== Date: Sun Oct 22 16:14:45 1995 Ref : testrefs.mlf Rec : recout.mlf ------------------------ Overall Results ----------------- SENT: %Correct=98.50 [H=197, S=3, N=200] WORD: %Corr=99.77, Acc=99.65 [H=853, D=1, S=1, I=1, N=855] ==========================================================\end{verbatim}The line starting with \texttt{SENT:} indicates that of the 200 test utterances,197 (98.50\%) were correctly recognised. The following line starting with \texttt{WORD:} gives the word level statistics and indicates that of the 855 words in total,853 (99.77\%) were recognised correctly. There was 1 deletion error (\texttt{D}), 1 substitution\index{recognition!results analysis}error (\texttt{S}) and 1 insertion error (\texttt{I}). The accuracy figure (\texttt{Acc})of 99.65\% is lower than the percentage correct (\texttt{Cor}) because it takesaccount of the insertion errors which the latter ignores.\centrefig{step11}{120}{Step 11}\mysect{Running the Recogniser Live}{egreclive}The recogniser can also be run with live input\index{live input}. \index{recognition!direct audio input}To do this it is onlynecessary to set the configuration variables needed to convert the inputaudio to the correct form of parameterisation. Specifically, the followingneeds to be appended to the configuration file \texttt{config} tocreate a new configuration file \texttt{config2}\begin{verbatim} # Waveform capture SOURCERATE=625.0 SOURCEKIND=HAUDIO SOURCEFORMAT=HTK ENORMALISE=F USESILDET=T MEASURESIL=F OUTSILWARN=T\end{verbatim}These indicate that the source is direct audio with sample period 62.5$\mu$secs. The silence detector is enabled and a measurement of the backgroundspeech/silence levels should be made at start-up. The final line makes surethat a warning is printed when this silence measurement is being made.Once the configuration file has been set-up for direct audio input,\htool{HVite} can be run as in the previous step except that no files need begiven as arguments\begin{verbatim} HVite -H hmm15/macros -H hmm15/hmmdefs -C config2 \ -w wdnet -p 0.0 -s 5.0 dict tiedlist\end{verbatim}On start-up, \htool{HVite} will prompt the user to speak anarbitrary sentence (approx. 4 secs) in order to measure the speech andbackground silence levels. It will then repeatedly recognise and, if tracelevel bit 1 is set, it will output each utterance to the terminal. A typicalsession is as follows\index{recognition!output}\begin{verbatim} Read 1648 physical / 4131 logical HMMs Read lattice with 26 nodes / 52 arcs Created network with 123 nodes / 151 links READY[1]> Please speak sentence - measuring levels Level measurement completed DIAL FOUR SIX FOUR TWO FOUR OH == [303 frames] -95.5773 [Ac=-28630.2 LM=-329.8] (Act=21.8) READY[2]> DIAL ZERO EIGHT SIX TWO == [228 frames] -99.3758 [Ac=-22402.2 LM=-255.5] (Act=21.8) READY[3]> etc\end{verbatim}During loading, information will be printed out regarding the differentrecogniser components. The physical models are the distinct HMMs used by the system, while the logical models include all model names. The number of logical models is higher than the number of physical models because many logically distinct models have been determined to be physically identical and have been merged during the previous model building steps. The latticeinformation refers to the number of links and nodes in the recognition syntax.The network information refers to actual recognition network built byexpanding the lattice using the current HMM set, dictionary and any contextexpansion rules specified.After each utterance, the numerical information gives the total numberof frames, the average log likelihood per frame, the total acoustic score,the total language model score and the average number of models active.Note that if it was required to recognise a new name, then thefollowing two changes would be needed\begin{enumerate}\item the grammar would be altered to include the new name\item a pronunciation for the new name would be added to the dictionary\end{enumerate}If the new name required triphones which did not exist, then they could becreated by loading the existing triphone set into\htool{HHEd}\index{hhed@\htool{HHEd}}, loading the decision trees using the\texttt{LT} command\index{lt@\texttt{LT} command} and then using the\texttt{AU} command\index{au@\texttt{AU} command} to generate a new completetriphone set.\index{triphones!synthesising unseen}\mysect{Adapting the HMMs}{exsysadapt}The previous sections have described the stages required to build a simple voice dialling system. To simplify this process, speaker dependent models were developed using training data from a single user. Consequently, recognition accuracy for any other users would be poor.To overcome this limitation, a set of speaker independent models could be constructed, but this would require large amounts of training data from a variety of speakers. An alternative is to adapt the current speaker dependent models to the characteristics of a new speaker using a small amount of training or adaptation data\index{adaptation}. In general, adaptation techniques are applied to well trained speaker independent model sets to enable them to better model the characteristics of particular speakers.\HTK\ supports both supervised adaptation\index{adaptation!supervised adaptation}, where the true transcription of the data is known and unsupervised adaptation\index{adaptation!unsupervised adaptation} where thetranscription is hypothesised.In \HTK\ supervised adaptation is performed offline by\htool{HEAdapt} using maximum likelihood linear regression(MLLR)\index{adaptation!MLLR} and/or maximum a-posteriori (MAP)\index{adaptation!MAP} techniques to estimatea series of transforms or a transformed model set, that reduces the mismatch between the current model set and the adaptation data. Unsupervised adaptation is provided by \htool{HVite} (see section~\ref{s:unsup_adapt}), using just MLLR.The following sections describe offline supervised adaptation (usingMLLR) with the use of \htool{HEAdapt}.\subsection{Step 12 - Preparation of the Adaptation Data}As in normal recogniser development, the first stage in adaptation involves data preparation. Speech data from the new user is required for both adapting the models and testing the adapted system. The data can be obtained in a similar fashion to that taken to prepare the original test data.Initially, prompt lists for the adaptation and test data will be generated using \htool{HSGen}. For example, typing\begin{verbatim} HSGen -l -n 20 wdnet dict > promptsAdapt HSGen -l -n 20 wdnet dict > promptsTest\end{verbatim}\noindentwould produce two prompt files for the adaptation and test data. The amount of adaptation data required will normally be found empirically, but a performance improvement should be observable after just 30 seconds of speech.In this case, around 20 utterances should be sufficient.\htool{HSLab} can be used to record the associated speech.Assuming that the script files \texttt{codeAdapt.scp} and \texttt{codeTest.scp} list the source and output files for the adaptation and test data respectively then both sets of speech can then be coded using the \htool{HCopy} commands given below.\begin{verbatim} HCopy -C config -S codeAdapt.scp HCopy -C config -S codeTest.scp\end{verbatim}\noindentThe final stage of preparation involves generating context dependent phone transcriptions of the adaptation data and word level transcriptions of the test data for use in adapting the models and evaluating their performance.The transcriptions of the test data can be obtained using \texttt{prompts2mlf}.To minimize the problem of multiple pronunciations the phone level transcriptions of the adaptation data can be obtained by using \htool{HVite}to perform a \textit{forced alignment} of the adaptation data. Assuming that word level transcriptions are listed in \texttt{adaptWords.mlf}, then thefollowing command will place the phone transcriptions in \texttt{adaptPhones.mlf}.\begin{verbatim} HVite -l '*' -o SWT -b silence -C config -a -H hmm15/macros \ -H hmm15/hmmdefs -i adaptPhones.mlf -m -t 250.0 \ -I adaptWords.mlf -y lab -S adapt.scp dict tiedlist\end{verbatim}\subsection{Step 13 - Generating the Transforms}\index{adaptation!generating transforms}\htool{HEAdapt} provides two forms of MLLR adaptation depending on theamount of adaptation data available. If only small amounts areavailable a global transform\index{adaptation!global transforms} canbe generated for every output distribution of every model. As more adaptation data becomes available more specific transforms can be generated for specific groups of Gaussians.To identify the number of transforms that can be estimated using the current adaptation data, \htool{HEAdapt}\index{headapt@\htool{HEAdapt}} uses a regression class tree\index{adaptation!regression tree} to cluster together groups of output distributions that are to undergo the same transformation. The \HTK\ tool \htool{HHEd} can be used to build a regression class tree and store it as part of the HMM set. For example,\begin{verbatim} HHEd -B -H hmm15/macros -H hmm15/hmmdefs -M hmm16 regtree.hed tiedlist\end{verbatim}\noindentcreates a regression class tree using the models stored in \texttt{hmm15}. The models are written out to the \texttt{hmm16} directory together with the regression class tree information. The \htool{HHEd} edit script \texttt{regtree.hed} contains the following commands\begin{verbatim} RN "models" LS "stats" RC 32 "rtree"\end{verbatim}\noindentThe \texttt{RN}\index{rn@\texttt{RN} command} command assigns anidentifier to the HMM set.The \texttt{LS}\index{ls@\texttt{LS} command} command loads the state occupation statistics file \texttt{stats} generated by the last application of \htool{HERest} which created the models in \texttt{hmm15}. The \texttt{RC}\index{rc@\texttt{RC} command} command then attempts to build a regression class tree with 32 terminal or leaf nodes using these statistics.\htool{HEAdapt} can be used to perform either static adaptation, where all theadaptation data is processed in a single block or incrementaladaptation, where adaptation is performed after a specified number of utterances and this is controlled by the \texttt{-i} option. In this tutorial the default setting of static adaptation will be used.A typical use of \htool{HEAdapt} involves two passes. On the first pass a global adaptation is performed. The second pass then uses the global transformation to transform the model s
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -