?? speech synthesis & speech recognition using sapi 5_1.htm
字號(hào):
<P>The SAPI 5.1 SDK comes with a C++ example called TTSApp, which displays an
animated cartoon microphone whose mouth is drawn to represent each viseme. The
microphone is made up from a number of separate images that can all be loaded
into an image list. The additional <A
href="http://www.blong.com/Conferences/DCon2002/Speech/SAPI51/SAPI51.zip">demo
program</A> TextToSpeechAnimated.dpr makes use of these images to show how the
effect can be achieved.</P>
<TABLE bgColor=white border=1>
<TBODY>
<TR>
<TD><PRE><CODE><FONT color=black size=2>
<B>const</B>
Visemes: <B>array</B>[0..21] <B>of</B> Byte = (
0, <FONT color=#003399><I>// SP_VISEME_0 = 0, // Silence</I></FONT>
11, <FONT color=#003399><I>// SP_VISEME_1, // AE, AX, AH</I></FONT>
11, <FONT color=#003399><I>// SP_VISEME_2, // AA</I></FONT>
11, <FONT color=#003399><I>// SP_VISEME_3, // AO</I></FONT>
10, <FONT color=#003399><I>// SP_VISEME_4, // EY, EH, UH</I></FONT>
11, <FONT color=#003399><I>// SP_VISEME_5, // ER</I></FONT>
9, <FONT color=#003399><I>// SP_VISEME_6, // y, IY, IH, IX</I></FONT>
2, <FONT color=#003399><I>// SP_VISEME_7, // w, UW</I></FONT>
13, <FONT color=#003399><I>// SP_VISEME_8, // OW</I></FONT>
9, <FONT color=#003399><I>// SP_VISEME_9, // AW</I></FONT>
12, <FONT color=#003399><I>// SP_VISEME_10, // OY</I></FONT>
11, <FONT color=#003399><I>// SP_VISEME_11, // AY</I></FONT>
9, <FONT color=#003399><I>// SP_VISEME_12, // h</I></FONT>
3, <FONT color=#003399><I>// SP_VISEME_13, // r</I></FONT>
6, <FONT color=#003399><I>// SP_VISEME_14, // l</I></FONT>
7, <FONT color=#003399><I>// SP_VISEME_15, // s, z</I></FONT>
8, <FONT color=#003399><I>// SP_VISEME_16, // SH, CH, JH, ZH</I></FONT>
5, <FONT color=#003399><I>// SP_VISEME_17, // TH, DH</I></FONT>
4, <FONT color=#003399><I>// SP_VISEME_18, // f, v</I></FONT>
7, <FONT color=#003399><I>// SP_VISEME_19, // d, t, n</I></FONT>
9, <FONT color=#003399><I>// SP_VISEME_20, // k, g, NG</I></FONT>
1 <FONT color=#003399><I>// SP_VISEME_21, // p, b, m</I></FONT>
);
<B>procedure</B> TfrmTextToSpeech.SpVoiceViseme(Sender: TObject;
StreamNumber: Integer; StreamPosition: OleVariant; Duration: Integer;
NextVisemeId, Feature, CurrentVisemeId: TOleEnum);
<B>const</B>
EyesNarrow = 14;
EyesClosed = 15;
<B>begin</B>
imgsMic.Draw(pbMic.Canvas, 0, 0, Visemes[CurrentVisemeId]);
<B>if</B> Visemes[CurrentVisemeId] <B>mod</B> 6 = 2 <B>then</B>
imgsMic.Draw(pbMic.Canvas, 0, 0, EyesNarrow)
<B>else</B>
<B>if</B> Visemes[CurrentVisemeId] <B>mod</B> 6 = 5 <B>then</B>
imgsMic.Draw(pbMic.Canvas, 0, 0, EyesClosed);
<B>end</B>;
<B>procedure</B> TfrmTextToSpeech.pbMicPaint(Sender: TObject);
<B>begin</B>
imgsMic.Draw(pbMic.Canvas, 0, 0, 0);
<B>end</B>;
</FONT></CODE></PRE></TD></TR></TBODY></TABLE>
<P>The <FONT face="Courier New, Courier, mono">OnViseme</FONT> event gets the
image list to draw on a paint box component and the image to draw is identified
from a simple lookup table. There are 22 different visemes, but only 13 images
(as in the Disney approach). Occasionally the code also draws narrowed or closed
eyes, but whenever the silence viseme is received (at the start and end of each
sentence) the default microphone (the first image in the image list) is
drawn.</P>
<P align=center><IMG
src="Speech Synthesis & Speech Recognition Using SAPI 5_1.files/TextToSpeechAnimated.png"></P>
<P>You can take this idea further if you need, by using images of a person's
face saying each of the 22 visemes (for real people it seems to work best if you
use 22 images, rather than 13). This way you can animate a real person's face in
sync with the spoken text quite trivially.</P>
<P align=center><IMG
src="Speech Synthesis & Speech Recognition Using SAPI 5_1.files/TextToSpeechAnimatedReal.png"></P>
<H3><A name=KeepingTrack>Keeping Track Of Spoken Text</A></H3>
<P>We can use <FONT face="Courier New, Courier, mono">OnWord</FONT> and <FONT
face="Courier New, Courier, mono">OnSentence</FONT> to highlight the currently
spoken work or sentence, as the events provide the character offset and length
of the pertinent characters in the text. So when a sentence is started, the
<FONT face="Courier New, Courier, mono">OnSentence</FONT> event tells you which
character in the text is the start of the sentence, and also how long the
sentence is.</P>
<TABLE bgColor=white border=1>
<TBODY>
<TR>
<TD><PRE><CODE><FONT color=black size=2>
<B>procedure</B> TfrmTextToSpeech.SetTextHilite(FirstChar, Len: Integer);
<B>begin</B>
reText.SelStart := FirstChar; <FONT color=#003399><I>//highlight word</I></FONT>
reText.SelLength := Len;
<B>end</B>;
<B>procedure</B> TfrmTextToSpeech.SetTextStyle(FirstChar, Len: Integer; Styles: TFontStyles);
<B>begin</B>
<B>with</B> reText <B>do</B>
<B>begin</B>
Lines.BeginUpdate;
<B>try</B>
SelStart := FirstChar; <FONT color=#003399><I>//highlight word</I></FONT>
SelLength := Len;
SelAttributes.Style := Styles; <FONT color=#003399><I>//apply requested style</I></FONT>
SelLength := 0; <FONT color=#003399><I>//unhighlight word</I></FONT>
<B>finally</B>
Lines.EndUpdate
<B>end</B>
<B>end</B>
<B>end</B>;
<B>procedure</B> TfrmTextToSpeech.SpVoiceSentence(Sender: TObject;
StreamNumber: Integer; StreamPosition: OleVariant; CharacterPosition,
Length: Integer);
<B>begin</B>
Log(<I>'OnSentence: stream %d, position: %s, char. pos. %d, length %d'</I>,
[StreamNumber, <B>String</B>(StreamPosition), CharacterPosition, Length]);
SetTextStyle(OldSentencePos, OldSentenceLen, []);
<B>if</B> Length > 0 <B>then</B>
<B>begin</B>
SetTextStyle(CharacterPosition, Length, [fsItalic]);
OldSentencePos := CharacterPosition;
OldSentenceLen := Length;
<B>end</B>;
<B>if</B> <B>not</B> StreamJustStarted <B>then</B>
memEnginePhonemes.Text := memEnginePhonemes.Text + #13#10;
StreamJustStarted := False;
<B>end</B>;
<B>procedure</B> TfrmTextToSpeech.SpVoiceWord(Sender: TObject;
StreamNumber: Integer; StreamPosition: OleVariant; CharacterPosition,
Length: Integer);
<B>begin</B>
Log(<I>'OnWord: stream %d, position: %s, char. pos. %d, length %d'</I>,
[StreamNumber, <B>String</B>(StreamPosition), CharacterPosition, Length]);
SetTextHilite(CharacterPosition, Length);
<B>end</B>;
</FONT></CODE></PRE></TD></TR></TBODY></TABLE>
<P>Each sentence that gets spoken is italicised through the <FONT
face="Courier New, Courier, mono">SetTextStyle</FONT> helper routine (which
records the position details so the sentence can be set back to non-italic when
the next sentence starts). Similarly, each spoken word is highlighted using the
<FONT face="Courier New, Courier, mono">SetTextHilite</FONT> helper routine.</P>
<P><U><B>Note:</B></U> the comment in the <FONT
face="Courier New, Courier, mono">OnSentence</FONT> event handler points out
that the last <FONT face="Courier New, Courier, mono">OnSentence</FONT> event
for some text has the character position set to the last character and the
length set to the negative equivalent. This gives an opportunity to reset all
the text formatting back to the default styles. However it is only true if the
text ends with a full stop; if not you can use the <FONT
face="Courier New, Courier, mono">OnEndStream</FONT> event for tidying up.</P>
<H3><A name=SpeakingDialogs>Speaking Dialogs</A></H3>
<P>As an example of using speech synthesis you can make all your VCL dialogs
talk to you using this small piece of code.</P>
<TABLE bgColor=white border=1>
<TBODY>
<TR>
<TD><PRE><CODE><FONT color=black size=2>
<B>uses</B>
ComObj;
<B>var</B>
Voice: Variant;
<B>procedure</B> TForm1.FormCreate(Sender: TObject);
<B>begin</B>
Screen.OnActiveFormChange := ScreenFormChange;
<B>end</B>;
<B>procedure</B> TForm1.ReadVCLDialog(Form: TCustomForm);
<B>var</B>
I: Integer;
ButtonCaptions, LabelCaption, DialogText: <B>string</B>;
<B>const</B>
SVSFlagsAsync = 1;
<B>begin</B>
<B>try</B>
<B>if</B> VarType(Voice) <> varDispatch <B>then</B>
Voice := CreateOleObject(<I>'SAPI.SpVoice'</I>);
<B>for</B> I := 0 <B>to</B> Form.ComponentCount - 1 <B>do</B>
<B>if</B> Form.Components[I] <B>is</B> TLabel <B>then</B>
LabelCaption := TLabel(Form.Components[I]).Caption
<B>else</B>
<B>if</B> Form.Components[I] <B>is</B> TButton <B>then</B>
ButtonCaptions := Format(<I>'%s%s, '</I>,
[ButtonCaptions, TButton(Form.Components[I]).Caption]);
ButtonCaptions := StringReplace(ButtonCaptions,<I>'&'</I>,<I>''</I>, [rfReplaceAll]);
DialogText := Format(<I>'%s.%s%s.%s%s'</I>,
[Form.Caption, sLineBreak, LabelCaption, sLineBreak, ButtonCaptions]);
Memo1.Text := DialogText;
Voice.Speak(DialogText, SVSFlagsAsync)
<B>except</B>
<FONT color=#003399><I>//pretend everything is okay</I></FONT>
<B>end</B>
<B>end</B>;
<B>procedure</B> TForm1.ScreenFormChange(Sender: TObject);
<B>begin</B>
<B>if</B> Assigned(Screen.ActiveForm) <B>and</B>
(Screen.ActiveForm.ClassName = <I>'TMessageForm'</I>) <B>then</B>
ReadVCLDialog(Screen.ActiveForm)
<B>end</B>;
</FONT></CODE></PRE></TD></TR></TBODY></TABLE>
<P>The form's <FONT face="Courier New, Courier, mono">OnCreate</FONT> event
handler sets up an <FONT
face="Courier New, Courier, mono">OnActiveFormChange</FONT> event handler for
the screen object. This is triggered each time a new form is displayed, which
includes VCL dialogs. Any call to <FONT
face="Courier New, Courier, mono">ShowMessage</FONT>, <FONT
face="Courier New, Courier, mono">MessageDlg</FONT> or related routines causes a
<FONT face="Courier New, Courier, mono">TMessageForm</FONT> to be displayed so
the code checks for this. If the form type is found, a textual version of what's
on the dialog is built up and then spoken through the SAPI Automation
component.</P>
<P>A statement such as:</P>
<TABLE bgColor=white border=1>
<TBODY>
<TR>
<TD><PRE><CODE><FONT color=black size=2>
MessageDlg(<I>'Save changes?'</I>, mtConfirmation, mbYesNoCancel, 0)
</FONT></CODE></PRE></TD></TR></TBODY></TABLE>
<P>causes the <FONT face="Courier New, Courier, mono">ReadVCLDialog</FONT>
routine to build up and say this text:</P>
<TABLE bgColor=white border=1>
<TBODY>
<TR>
<TD><PRE><CODE><FONT color=black size=2>
Confirm.
Save changes?.
Yes, No, Cancel,
</FONT></CODE></PRE></TD></TR></TBODY></TABLE>
<P>Notice the full stops at the end of each line to briefly pause the speech
engine at that point before moving on.</P>
<H2><A name=SR>Speech Recognition</A></H2>
<P>Continuous dictation is easy to set up as no specific grammar is required,
but Command and Control recognition will need a grammar to educate the
recogniser as to the permissible commands.</P>
<P>When you need SR you can either use a shared recogniser (<FONT
face="Courier New, Courier, mono">TSpSharedRecognizer</FONT>) or an in-process
recogniser (<FONT face="Courier New, Courier, mono">TSpInprocRecognizer</FONT>).
The in-process recogniser is more efficient (it resides in your process address
space) but means that no other SR applications can receive input from the
microphone until it is closed down. On the other hand the shared recogniser can
be used by multiple applications, and each one can access the microphone. It is
more common to use the shared recogniser in typical SAPI applications.</P>
<P>The recogniser uses the notion of a <I>recognition context</I> to identify
when it will be active (not to be confused with the use of context in a
context-free grammar or CFG). A context is represented by the <FONT
face="Courier New, Courier, mono">TSpInprocRecoContext</FONT> or <FONT
face="Courier New, Courier, mono">TSpSharedRecoContext</FONT> interfaces. An
application may use one context for each form that will use SR, or several
contexts for different application modes (Office XP has a dictation mode for
adding text to a document and a control mode for executing menu commands).</P>
<P>Recognition contexts enable you to start and stop recognition, set up the
grammar and receive important recognition notifications.</P>
<H3><A name=Grammars>Grammars</A></H3>
?? 快捷鍵說明
復(fù)制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號(hào)
Ctrl + =
減小字號(hào)
Ctrl + -