?? hlmtutorial.tex
字號:
%% !HVER!hlmtutorial [SJY 05/04/97]%% Updated (and about 80% rewritten) - Gareth Moore 16/01/02%\mychap{A Tutorial Example of Building Language Models}{hlmtutor}This chapter describes the construction and evaluation of languagemodels using the \HTK\ language modelling tools. The models will bebuilt from scratch with the exception of the text conditioning stagenecessary to transform the raw text into its most common and usefulrepresentation (e.g. number conversions, abbreviation expansion andpunctuation filtering). All resources used in this tutorial can befound in the \texttt{LMTutorial} directory of the \HTK\ distribution.The text data used to build and test the language models are thecopyright-free texts of 50 Sherlock Holmes stories by Arthur Conan Doyle.The texts have been partitioned into training and test material (49stories for training and 1 story for testing) and reside in the\texttt{train} and \texttt{test} subdirectories respectively.\mysect{Database preparation}{HLMdatabaseprep}The first stage of any language model development project is datapreparation. As mentioned in the introduction, the text data used inthese example has already been conditioned. If you examine each fileyou will observe that they contains a sequence of tagged sentences.When training a language model you need to include sentence start andend labelling because the tools cannot otherwise infer this. Althoughthere is only one sentence per line in these files, this is not arestriction of the \HTK\ tools and is purely for clarity -- you canhave the entire input text on a single line if you want. Notice thatthe default sentence start and sentence end tokens of {\tt <s>} and{\tt </s>} are used -- if you were to use different tokens for theseyou would need to pass suitable configuration parameters to the \HTK\tools.\footnote{{\tt STARTWORD} and {\tt ENDWORD} to be precise.} Anextremely simple text conditioning tool is supplied in the form of\htool{LCond.pl} in the {\tt LMTutorial/extras} folder -- this onlysegments text into sentences on the basis of punctuation, as well asconverting to uppercase and stripping most punctuation symbols, and isnot intended for serious use. In particular it does not convertnumbers into words and will not expand abbreviations. Exactly whatconditioning you perform on your source text is dependent on the taskyou are building a model for.Once your text has been conditioned, the next step is to use the tool\htool{LGPrep} to scan the input text and produce apreliminary set of sorted $n$-gram files. In this tutorial we willstore all $n$-gram files created by \htool{LGPrep} will be stored inthe \texttt{holmes.0} directory, so create this directory now. In aUnix-type system, for example, the standard command is\begin{verbatim}$ mkdir holmes.0\end{verbatim} % $The \HTK\ tools maintain a cumulative word map to which every newword is added and assigned a unique id. This means that you can addfuture $n$-gram files without having to rebuild existing ones so longas you start from the same word map, thus ensuring that each idremains unique. The side effect of this ability is that\htool{LGPrep} always expects to be given a word map, so to preparethe first $n$-gram file (also referred to elsewhere as a `gram' file)you must pass an empty word map file.You can prepare an initial, empty word map using the \htool{LNewMap}tool. It needs to be passed the name to be used internally in the wordmap as well as a file name to write it to; optionally you may alsochange the default character escaping mode and request additionalfields. Type the following:\begin{verbatim}$ LNewMap -f WFC Holmes empty.wmap\end{verbatim} % $and you'll see that an initial, empty word map file has been createdfor you in the file \texttt{empty.wmap}. Examine the file and youwill see that it contains just a header and no words. It looks likethis:\begin{verbatim}Name = HolmesSeqNo = 0Entries = 0EscMode = RAWFields = ID,WFC\Words\\end{verbatim}Pay particular attention to the {\tt SeqNo} field since thisrepresents the sequence number of the word map. Each time you addwords to the word map the sequence number will increase -- the toolswill compare the sequence number in the word map with that in any datafiles they are passed, and if the word map is too old to contain allthe necessary words then it will be rejected. The {\tt Name} fieldmust also match, although initially you can set this to whatever youlike.\footnote{The exception to this is that differing text may followa {\tt \%} character.} The other fields specify that no \HTK\character escaping will be used, and that we wish to store the(compulsory) word ID field as well as an optional count field, whichwill reveal how many times each word has been encountered to date.The {\tt ID} field is always present which is why you did not need topass it with the {\tt -f} option to \htool{LNewMap}.To clarify, if we were to use the Sherlock Holmes texts together withother previously generated $n$-gram databases then the most recentword map available must be used instead of the prototype map fileabove. This would ensure that the map saved by \htool{LGPrep} once thenew texts have been processed would be suitable for decoding allavailable $n$-gram files.We'll now process the text data with the following command:\begin{verbatim}$ LGPrep -T 1 -a 100000 -b 200000 -d holmes.0 -n 4 -s "Sherlock Holmes" empty.wmap train/*.txt\end{verbatim} % $The \texttt{-a} option sets the maximum number of new words that canbe encountered in the texts to 100,000 (in fact, this is the default).If, during processing, this limit is exceeded then \htool{LGPrep} willterminate with an error and the operation will have to be repeated bysetting this limit to a larger value.The \texttt{-b} option sets the internal $n$-gram buffer size to200,000 $n$-gram entries. This setting has a direct effect on theoverall process size. The memory requirent for the internal buffer canbe calculated according to $mem_{bytes} = (n+1)*4*b$ where $n$ is the$n$-gram size (set with the \texttt{-n} option) and $b$ is the buffersize. In the above example, the $n$-gram size is set to four whichwill enable us to generate bigram, trigram and four-gram languagemodels. The smaller the buffer then in general the more separatefiles will be written out -- each time the buffer fills a new $n$-gramfile is generated in the output directory, specified by the {\tt -d}option.The {\tt -T 1} option switches on tracing at the lowest level. Ingeneral you should probably aim to run each tool with at least {\tt -T1} since this will give you better feedback about the progress of thetool. Other useful options to pass are {\tt -D} to check the state ofconfiguration variables -- very useful to check you have things set upcorrectly -- and {\tt -A} so that if you save the tool output you willbe able to see what options it was run with. It's good practice toalways pass {\tt -T 1 -A -D} to every \HTK\ tool in fact. You shouldalso note that all \HTK\ tools require the option switches to bepassed {\it before} the compulsory tool parameters -- trying to run{\tt LGPrep train/*.txt -T 1} will result in an error, for example.Once the operation has completed, the \texttt{holmes.0} directory shouldcontain the following files:\begin{verbatim}gram.0 gram.1 gram.2 wmap\end{verbatim}The saved word map file \texttt{wmap} has grown to include all newlyencountered words and the identifiers that the tool has assigned them,and at the same time the map sequence count has been incremented byone.\begin{verbatim}Name = HolmesSeqNo = 1Entries = 18080EscMode = RAWFields = ID,WFC\Words\<s> 65536 33669IT 65537 8106WAS 65538 7595...\end{verbatim}Remember that map sequence count together with the map's name fieldare used to verify the compatibility between the map and any $n$-gramfiles. The contents of the $n$-gram files can be inspected using the\htool{LGList} tool: (if not using a Unix type system you may need toomit the {\tt | more} and find some other way of viewing the output ina more manageable format; try {\tt > file.txt} and viewing theresulting file if that works)\begin{verbatim}$ LGList holmes.0/wmap holmes.0/gram.2 | more4-Gram File holmes.0/gram.2[165674 entries]: Text Source: Sherlock Holmes' IT IS NO : 1'CAUSE I SAVED HER : 1'EM </s> <s> WHO : 1</s> <s> ' IT : 1</s> <s> A BAND : 1</s> <s> A BEAUTIFUL : 1</s> <s> A BIG : 1</s> <s> A BIT : 1</s> <s> A BROKEN : 1</s> <s> A BROWN : 2</s> <s> A BUZZ : 1</s> <s> A CAMP : 1...\end{verbatim} % $If you examine the other $n$-gram files you will notice that whilstthe contents of each $n$-gram file are sorted, the files themselvesare not sequenced -- that is, one file does not carry on where theprevious one left off; each is an independent set of $n$-grams. Toderive a sequenced set of $n$-gram files, where no grams are repeatedbetween files, the tool \htool{LGCopy} must be used on these existinggram files. For the purposes of this tutorial the new set offiles will be stored in the \texttt{holmes.1} directory, so createthis and then run {\tt LGCopy}:\begin{verbatim}$ mkdir holmes.1$ LGCopy -T 1 -b 200000 -d holmes.1 holmes.0/wmap holmes.0/gram.*Input file holmes.0/gram.0 added, weight=1.0000Input file holmes.0/gram.1 added, weight=1.0000Input file holmes.0/gram.2 added, weight=1.0000Copying 3 input files to output files with 200000 entries saving 200000 ngrams to file holmes.1/data.0 saving 200000 ngrams to file holmes.1/data.1 saving 89516 ngrams to file holmes.1/data.2489516 out of 489516 ngrams stored in 3 files\end{verbatim}The resulting $n$-gram files, together with the word map, can now beused to generate language models for a specific vocabulary list. Notethat it is not necessary to sequence the files in this way beforebuilding a language model, but if you have too many separateunsequenced $n$-gram files then you may encounter performance problemsor reach the limit of your filing system to maintain open files -- inpractice, therefore, it is a good idea to always sequence them.\mysect{Mapping OOV words}{HLMmapoov}An important step in building a language model is to decide on thesystem's vocabulary. For the purpose of this tutorial, we havesupplied a word list in the file \texttt{5k.wlist} which contains the5000 most common words found in the text. We'll build our languagemodels and all intermediate files in the \texttt{lm\_5k} directory,so create it with a suitable command:\begin{verbatim}$ mkdir lm_5k\end{verbatim} % $Once the system's vocabulary has been specified, the tool\htool{LGCopy} should be used to filter out all out-of-vocabulary(OOV) words. To achieve this, the 5K word list is used as a specialcase of a class map which maps all OOVs into members of the``unknown'' word class. The unknown class symbol defaults to\texttt{!!UNK}, although this can be changed via the configurationparameter \texttt{UNKNOWNNAME}. Run \htool{LGCopy} again:\begin{verbatim}$ LGCopy -T 1 -o -m lm_5k/5k.wmap -b 200000 -d lm_5k -w 5k.wlist holmes.0/wmap holmes.1/data.*Input file holmes.1/data.0 added, weight=1.0000Input file holmes.1/data.1 added, weight=1.0000Input file holmes.1/data.2 added, weight=1.0000Copying 3 input files to output files with 200000 entries
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -