?? lex-docs.txt
字號(hào):
characters which are not both upper case letters, both lower case
letters, or both digits is implementation dependent and will get a
warning message. (E.g., [0-z] in ASCII is many more characters
than it is in EBCDIC). If it is desired to include the character
- in a character class, it should be first or last; thus
[-+0-9]
matches all the digits and the two signs.
In character classes, the ^ operator must appear as the first
character after the left bracket; it indicates that the resulting
string is to be complemented with respect to the computer
character set. Thus
[^abc]
matches all characters except a, b, or c, including all special or
control characters; or
[^a-zA-Z]
is any character which is not a letter. The \ character provides
the usual escapes within character class brackets.
Arbitrary character. To match almost any character, the
operator character
.
is the class of all characters except newline. Escaping into
octal is possible although non-portable:
[\40-\176]
matches all printable characters in the ASCII character set, from
octal 40 (blank) to octal 176 (tilde).
Optional expressions. The operator ? indicates an optional
element of an expression. Thus
ab?c
matches either ac or abc.
Repeated expressions. Repetitions of classes are indicated
by the operators * and +.
a*
is any number of consecutive a characters, including zero; while
a+
is one or more instances of a. For example,
[a-z]+
is all strings of lower case letters. And
[A-Za-z][A-Za-z0-9]*
indicates all alphanumeric strings with a leading alphabetic
character. This is a typical expression for recognizing
identifiers in computer languages.
Alternation and Grouping. The operator | indicates
alternation:
(ab|cd)
matches either ab or cd. Note that parentheses are used for
grouping, although they are not necessary on the outside level;
ab|cd
would have sufficed. Parentheses can be used for more complex
expressions:
(ab|cd+)?(ef)*
matches such strings as abefef, efefef, cdef, or cddd; but not
abc, abcd, or abcdef.
Context sensitivity. Lex will recognize a small amount of
surrounding context. The two simplest operators for this are ^
and $. If the first character of an expression is ^, the
expression will only be matched at the beginning of a line (after
a newline character, or at the beginning of the input stream).
This can never conflict with the other meaning of ^, comple-
mentation of character classes, since that only applies within the
[] operators. If the very last character is $, the expression
will only be matched at the end of a line (when immediately
followed by newline). The latter operator is a special case of
the / operator character, which indicates trailing context. The
expression
ab/cd
matches the string ab, but only if followed by cd. Thus
ab$
is the same as
ab/\n
Left context is handled in Lex by start conditions as explained in
section 10. If a rule is only to be executed when the Lex
automaton interpreter is in start condition x, the rule should be
prefixed by
<x>
using the angle bracket operator characters. If we considered
``being at the beginning of a line'' to be start condition ONE,
then the ^ operator would be equivalent to
<ONE>
Start conditions are explained more fully later.
Repetitions and Definitions. The operators {} specify either
repetitions (if they enclose numbers) or definition expansion
(if they enclose a name). For example
{digit}
looks for a predefined string named digit and inserts it at that
point in the expression. The definitions are given in the first
part of the Lex input, before the rules. In contrast,
a{1,5}
looks for 1 to 5 occurrences of a.
Finally, initial % is special, being the separator for Lex
source segments.
4. Lex Actions.
When an expression written as above is matched, Lex executes
the corresponding action. This section describes some features of
Lex which aid in writing actions. Note that there is a default
action, which consists of copying the input to the output. This
is performed on all strings not otherwise matched. Thus the Lex
user who wishes to absorb the entire input, without producing any
output, must provide rules to match everything. When Lex is being
used with Yacc, this is the normal situation. One may consider
that actions are what is done instead of copying the input to the
output; thus, in general, a rule which merely copies can be
omitted. Also, a character combination which is omitted from
the rules and which appears as input is likely to be printed on
the output, thus calling attention to the gap in the rules.
One of the simplest things that can be done is to ignore the
input. Specifying a C null statement, ; as an action causes this
result. A frequent rule is
[ \t\n] ;
which causes the three spacing characters (blank, tab, and
newline) to be ignored.
Another easy way to avoid writing actions is the action
character |, which indicates that the action for this rule is the
action for the next rule. The previous example could also have
been written
" "
"\t"
"\n"
with the same result, although in different style. The quotes
around \n and \t are not required.
In more complex actions, the user will often want to know the
actual text that matched some expression like [a-z]+. Lex leaves
this text in an external character array named yytext. Thus, to
print the name found, a rule like
[a-z]+ printf("%s", yytext);
will print the string in yytext. The C function printf accepts a
format argument and data to be printed; in this case, the format
is ``print string'' (% indicating data conversion, and s
indicating string type), and the data are the characters in
yytext. So this just places the matched string on the output.
This action is so common that it may be written as ECHO:
[a-z]+ ECHO;
is the same as the above. Since the default action is just to
print the characters found, one might ask why give a rule, like
this one, which merely specifies the default action? Such rules
are often required to avoid matching some other rule which is
not desired. For example, if there is a rule which matches read
it will normally match the instances of read contained in bread or
readjust; to avoid this, a rule of the form [a-z]+ is needed.
This is explained further below.
Sometimes it is more convenient to know the end of what has
been found; hence Lex also provides a count yyleng of the number
of characters matched. To count both the number of words and the
number of characters in words in the input, the user might write
[a-zA-Z]+ {words++; chars += yyleng;}
which accumulates in chars the number of characters in the words
recognized. The last character in the string matched can be
accessed by
yytext[yyleng-1]
Occasionally, a Lex action may decide that a rule has not
recognized the correct span of characters. Two routines are
provided to aid with this situation. First, yymore() can be
called to indicate that the next input expression recognized is to
be tacked on to the end of this input. Normally, the next input
string would overwrite the current entry in yytext. Second,
yyless (n) may be called to indicate that not all the characters
matched by the currently successful expression are wanted right
now. The argument n indicates the number of characters in yytext
to be retained. Further characters previously matched are
returned to the input. This provides the same sort of lookahead
offered by the / operator, but in a different form.
Example: Consider a language which defines a string as a set
of characters between quotation (") marks, and provides that to
include a " in a string it must be preceded by a \. The regular
expression which matches that is somewhat confusing, so that it
might be preferable to write
\"[^"]* {
if (yytext[yyleng-1] == '\\')
yymore();
else
... normal user processing
}
which will, when faced with a string such as "abc\"def" first
match the five characters "abc\; then the call to yymore() will
cause the next part of the string, "def, to be tacked on the end.
Note that the final quote terminating the string should be picked
up in the code labeled ``normal processing''.
The function yyless() might be used to reprocess text in
various circumstances. Consider the C problem of distinguishing
the ambiguity of ``=-a''. Suppose it is desired to treat this as
``=- a'' but print a message. A rule might be
=-[a-zA-Z] {
printf("Op (=-) ambiguous\n");
yyless(yyleng-1);
... action for =- ...
}
which prints a message, returns the letter after the operator to
the input stream, and treats the operator as ``=-''.
Alternatively it might be desired to treat this as ``= -a''. To
do this, just return the minus sign as well as the letter to the
input:
=-[a-zA-Z] {
printf("Op (=-) ambiguous\n");
yyless(yyleng-2);
... action for = ...
}
will perform the other interpretation. Note that the expressions
for the two cases might more easily be written
=-/[A-Za-z]
in the first case and
=/-[A-Za-z]
in the second; no backup would be required in the rule action. It
is not necessary to recognize the whole identifier to observe the
ambiguity. The possibility of ``=-3'', however, makes
=-/[^ \t\n]
a still better rule.
In addition to these routines, Lex also permits access to the
I/O routines it uses. They are:
1) input() which returns the next input character;
2) output(c) which writes the character c on the output; and
3) unput(c) pushes the character c back onto the input stream to
be read later by input().
By default these routines are provided as macro definitions, but
the user can override them and supply private versions. These
routines define the relationship between external files and
internal characters, and must all be retained or modified
consistently. They may be redefined, to cause input or output to
be transmitted to or from strange places, including other programs
or internal memory; but the character set used must be consistent
in all routines; a value of zero returned by input must mean end
of file; and the relationship between unput and input must be
retained or the Lex lookahead will not work. Lex does not look
ahead at all if it does not have to, but every rule ending in + *
? or $ or containing / implies lookahead. Lookahead is also
necessary to match an expression that is a prefix of another
expression. See below for a discussion of the character set used
by Lex. The standard Lex library imposes a 100 character limit on
backup.
Another Lex library routine that the user will sometimes want
to redefine is yywrap() which is called whenever Lex reaches an
end-of-file. If yywrap returns a 1, Lex continues with the normal
wrapup on end of input. Sometimes, however, it is convenient to
arrange for more input to arrive from a new source. In this case,
the user should provide a yywrap which arranges for new input and
returns 0. This instructs Lex to continue processing. The
default yywrap always returns 1.
This routine is also a convenient place to print tables,
summaries, etc. at the end of a program. Note that it is not
possible to write a normal rule which recognizes end-of-file; the
only access to this condition is through yywrap. In fact, unless
a private version of input() is supplied a file containing nulls
cannot be handled, since a value of 0 returned by input is taken
to be end-of-file.
5. Ambiguous Source Rules.
Lex can handle ambiguous specifications. When more than one
expression can match the current input, Lex chooses as follows:
1) The longest match is preferred.
2) Among rules which matched the same number of characters, the
rule given first is preferred.
Thus, suppose the rules
integer keyword action ...;
[a-z]+ identifier action ...;
to be given in that order. If the input is integers, it is taken
as an identifier, because [a-z]+ matches 8 characters while
integer matches only 7. If the input is integer, both rules match
7 characters, and the keyword rule is selected because it was
given first. Anything shorter (e.g. int) will not match the
expression integer and so the identifier interpretation is used.
The principle of preferring the longest match makes rules
containing expressions like .* dangerous. For example,
'.*'
?? 快捷鍵說(shuō)明
復(fù)制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號(hào)
Ctrl + =
減小字號(hào)
Ctrl + -