?? exampsys.tex
字號:
%/* ----------------------------------------------------------- */%/* */%/* ___ */%/* |_| | |_/ SPEECH */%/* | | | | \ RECOGNITION */%/* ========= SOFTWARE */ %/* */%/* */%/* ----------------------------------------------------------- */%/* developed at: */%/* */%/* Speech Vision and Robotics group */%/* Cambridge University Engineering Department */%/* http://svr-www.eng.cam.ac.uk/ */%/* */%/* Entropic Cambridge Research Laboratory */%/* (now part of Microsoft) */%/* */%/* ----------------------------------------------------------- */%/* Copyright: Microsoft Corporation */%/* 1995-2000 Redmond, Washington USA */%/* http://www.microsoft.com */%/* */%/* 2001 Cambridge University */%/* Engineering Department */%/* */%/* Use of this software is governed by a License Agreement */%/* ** See the file License for the Conditions of Use ** */%/* ** This banner notice must not be removed ** */%/* */%/* ----------------------------------------------------------- */%% HTKBook - Steve Young 24/11/97%% revised by JBA and VV\mychap{A Tutorial Example of Using HTK}{exampsys}\sidepic{recipe}{80}{This final chapter of the tutorial part of the book will describe theconstruction of a recogniser for simple voice dialling applications. Thisrecogniser will be designed to recognise continuously spoken digit strings anda limited set of names. It is sub-word based so that adding a new name to thevocabulary involves only modification to the pronouncing dictionary and taskgrammar. The HMMs will be continuous density mixture Gaussian tied-statetriphones with clustering performed using phonetic decision trees. Althoughthe voice dialling task itself is quite simple, the system design isgeneral-purpose and would be useful for a range of applications. }The system will be built from scratch even to the extent of recording trainingand test data using the \HTK\ tool \htool{HSLab}. To make this tractable, thesystem will be speaker dependent\footnote{The final stage of the tutorial deals with adapting the speaker dependent models for new speakers}, but the same design would be followed to build a speaker independent system. The only difference being that data would be required from a large number of speakers and there would be a consequential increase in model complexity. Building a speech recogniser from scratch involves a number of inter-relatedsubtasks and pedagogically it is not obvious what the best order is to presentthem. In the presentation here, the ordering is chronological so that in effectthe text provides a recipe that could be followed to construct a similarsystem. The entire process is described in considerable detail in order give aclear view of the range of functions that \HTK\ addresses and thereby tomotivate the rest of the book.The \HTK\ software distribution also contains an example of constructing arecognition system for the 1000 word ARPA Naval Resource Management Task. Thisis contained in the directory \texttt{RMHTK} of the \HTK\ distribution.Further demonstration of \HTK's capabilities can be found in the directory \texttt{HTKDemo}. Some example scripts that may be of assistance during the tutorial are available in the \texttt{HTKTutorial} directory.At each step of the tutorial presented in this chapter, the user is advised tothoroughly read the entire section before executing the commands, and also toconsult the reference section for each \HTK\ tool being introduced(chapter~\ref{c:toolref}), so that all command line options and arguments areclearly understood.\mysect{Data Preparation}{egdataprep}The first stage of any recogniser development project is data preparation.\index{data preparation} Speech data is needed both for training and fortesting. In the system to be built here, all of this speech will be recordedfrom scratch and to do this scripts are needed to prompt for each sentence. Inthe case of the test data, these prompt scripts will also provide the referencetranscriptions against which the recogniser's performance can be measured and aconvenient way to create them is to use the task grammar as a random generator.In the case of the training data, the prompt scripts will be used inconjunction with a pronunciation dictionary to provide the initial phone leveltranscriptions needed to start the HMM training process. Since the applicationrequires that arbitrary names can be added to the recogniser, training datawith good phonetic balance and coverage is needed. Here for convenience theprompt scripts needed for training are taken from the TIMIT acoustic-phoneticdatabase.It follows from the above that before the data can be recorded, a phone setmust be defined, a dictionary must be constructed to cover both training andtesting and a task grammar must be defined.\subsection{Step 1 - the Task Grammar}The goal of the system to be built here is to provide a voice-operatedinterface for phone dialling. Thus, the recogniser must handle digit stringsand also personal name lists. Examples of typical inputs might be\begin{quote}Dial three three two six five fourDial nine zero four one oh ninePhone WoodlandCall Steve Young\end{quote}\HTK\ provides a grammar definition language forspecifying simple task grammars\index{task grammar} such as this. It consistsof a set of variable definitions followed by a regular expression describing the words to recognise. For thevoice dialling application, a suitable grammar might be\begin{verbatim} $digit = ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | NINE | OH | ZERO; $name = [ JOOP ] JANSEN | [ JULIAN ] ODELL | [ DAVE ] OLLASON | [ PHIL ] WOODLAND | [ STEVE ] YOUNG; ( SENT-START ( DIAL <$digit> | (PHONE|CALL) $name) SENT-END )\end{verbatim}where the vertical bars denote alternatives, the square brackets denoteoptional items and the angle braces denote one or more repetitions. Thecomplete grammar can be depicted as a network as shown inFig.~\href{f:dialnet}.\centrefig{dialnet}{110}{Grammar for Voice Dialling}\sidefig{step1}{25}{Step 1}{-4}{The above high level representation of a task grammaris provided for user convenience. The \HTK\ recogniser actually requires aword network to be defined using a low level notationcalled \HTK\ Standard Lattice Format\index{standard lattice format} (SLF)\index{SLF}in which each word instance and each word-to-word transitionis listed explicitly. This word network can be created automatically from the grammar above using the \htool{HParse}tool, thus assuming that the file \texttt{gram} contains theabove grammar, executing}\index{hparse@\htool{HParse}}\begin{verbatim} HParse gram wdnet\end{verbatim}will create an equivalent word network in the file \texttt{wdnet} (see Fig~\href{f:step1}).\subsection{Step 2 - the Dictionary}The first step in building a dictionary is to create a sorted list of therequired words. In the telephone dialling task pursued here, it is quite easy to create a listof required words by hand. However, if the task were more complex, it would benecessary to build a word list from the sample sentences present in the trainingdata. Furthermore, to build robust acoustic models, it is necessary to trainthem on a large set of sentences containing many words and preferablyphonetically balanced. For these reasons, the training data will consist ofEnglish sentences unrelated to the phone recognition task. Below, a shortexample of creating a word list from sentence prompts will be given. As notedabove the training sentences given here are extracted from some prompts usedwith the TIMIT database\index{TIMIT database} and for convenience reasons they have been renumbered. For example, the first few items might be as follows\vspace{1cm}\begin{verbatim} S0001 ONE VALIDATED ACTS OF SCHOOL DISTRICTS S0002 TWO OTHER CASES ALSO WERE UNDER ADVISEMENT S0003 BOTH FIGURES WOULD GO HIGHER IN LATER YEARS S0004 THIS IS NOT A PROGRAM OF SOCIALIZED MEDICINE etc\end{verbatim}The desired training word list\index{word list} (\texttt{wlist}) could then beextracted automatically from these. Before using HTK, one would need to editthe text into a suitable format. For example, it would be necessary to changeall white space to newlines and then to use the UNIX utilities \texttt{sort}and \texttt{uniq} to sort the words into a unique alphabetically ordered set,with one word per line. The script \texttt{prompts2wlist} from the\texttt{HTKTutorial} directory can be used for this purpose.The dictionary\index{dictionary!construction}\index{dictionary!format} itself can be built from a standard source using \htool{HDMan}\index{hdman@\htool{HDMan}}.For this example, the British English BEEP pronouncing dictionary will beused\footnote{Available by anonymous ftp from \texttt{svr-ftp.eng.cam.ac.uk/pub/comp.speech/dictionaries/beep.tar.gz}.Note that items beginning with unmatched quotes, found at the startof the dictionary, should be removed.}. Its phone set will be adopted without modification except that the stress marks will be removed and a short-pause (\texttt{sp}) willbe added to the end of every pronunciation. If the dictionary contains anysilence markers then the \texttt{MP} command will merge the \texttt{sil} and \texttt{sp} phones into a single \texttt{sil}. These changes can be applied using \htool{HDMan} and an edit script (stored in \texttt{global.ded})containing the three commands\begin{verbatim} AS sp RS cmu MP sil sil sp\end{verbatim}where \texttt{cmu} refers to a style of stress marking\index{stress marking} in which the lexical stress level ismarked by a single digit appended to the phone name (e.g.\ \texttt{eh2} meansthe phone \texttt{eh} with level 2 stress). \centrefig{step2}{100}{Step 2}\noindentThe command\begin{verbatim} HDMan -m -w wlist -n monophones1 -l dlog dict beep names\end{verbatim}will create a new dictionary called \texttt{dict} by searching the sourcedictionaries \texttt{beep} and \texttt{names} to find pronunciations for eachword in \texttt{wlist} (see Fig~\href{f:step2}). Here, the \texttt{wlist} inquestion needs only to be a sorted list of the words appearing in the taskgrammar given above.Note that \texttt{names} is a manually constructed file containingpronunciations for the proper names used in the task grammar. The option\texttt{-l} instructs \htool{HDMan} to output a log file \texttt{dlog} which contains various statistics about the constructed dictionary. In particular,it indicates if there are words missing. \htool{HDMan} can also output a listof the phones used, here called \texttt{monophones1}. Once training and testdata has been recorded, an HMM will be estimated for each of these phones.The general format of each dictionary entry\index{dictionary!entry} is\begin{verbatim} WORD [outsym] p1 p2 p3 ....\end{verbatim}which means that the word \texttt{WORD} is pronounced as the sequence of phones\texttt{p1 p2 p3 ...}. The string in square brackets specifies the string tooutput when that word is recognised. If it is omitted then the word itself isoutput. If it is included but empty, then nothing is output.To see what the dictionary is like, here are a few entries.\begin{verbatim} A ah sp A ax sp A ey sp CALL k ao l sp DIAL d ay ax l sp EIGHT ey t sp PHONE f ow n sp SENT-END [] sil SENT-START [] sil SEVEN s eh v n sp TO t ax sp TO t uw sp ZERO z ia r ow sp\end{verbatim}Notice that function words such as \texttt{A} and \texttt{TO}have multiple pronunciations.The entries for \texttt{SENT-START} and \texttt{SENT-END} have a silencemodel \texttt{sil} as their pronunciations and null output symbols. \subsection{Step 3 - Recording the Data}The\index{recording speech} training and test data will be recorded using the\HTK\ tool \htool{HSLab}\index{hslab@\htool{HSLab}}. This is a combined waveform recording and labelling tool. In this example \htool{HSLab} will beused just for recording, as labels already exist. However, if you do not havepre-existing training sentences (such as those from the TIMIT database) you cancreate them either from pre-existing text (as described above) or by labellingyour training utterances using \htool{HSLab}. \htool{HSLab} is invoked by typing
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -