?? lex-docs.txt
字號:
http://www.cs.ucsb.edu/~cs160/machines/lex-docs.txt
Lex - A Lexical Analyzer Generator
M. E. Lesk and E. Schmidt
ABSTRACT
Lex helps write programs whose control flow is directed by
instances of regular expressions in the input stream. It is well
suited for editor-script type transformations and for segmenting
input in preparation for a parsing routine.
Lex source is a table of regular expressions and correspond-
ing program fragments. The table is translated to a program
which reads an input stream, copying it to an output stream and
partitioning the input into strings which match the given expres-
sions. As each such string is recognized the corresponding pro-
gram fragment is executed. The recognition of the expressions is
performed by a deterministic finite automaton generated by Lex.
The program fragments written by the user are executed in the
order in which the corresponding regular expressions occur in the
input stream.
The lexical analysis programs written with Lex accept ambi-
guous specifications and choose the longest match possible at
each input point. If necessary, substantial lookahead is per-
formed on the input, but the input stream will be backed up to
the end of the current partition, so that the user has general
freedom to manipulate it.
Lex can generate analyzers in either C or Ratfor, a language
which can be translated automatically to portable Fortran. It is
available on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems.
This manual, however, will only discuss generating analyzers in C
on the UNIX system, which is the only supported form of Lex under
UNIX Version 7. Lex is designed to simplify interfacing with
Yacc, for those with access to this compiler-compiler system.
1. Introduction.
Lex is a program generator designed for lexical processing of
character input streams. It accepts a high-level, problem
oriented specification for character string matching, and produces
a program in a general purpose language which recognizes regular
expressions. The regular expressions are specified by the user in
the source specifications given to Lex. The Lex written code
recognizes these expressions in an input stream and partitions the
input stream into strings matching the expressions. At the
boundaries between strings program sections provided by the user
are executed. The Lex source file associates the regular
expressions and the program fragments. As each expression appears
in the input to the program written by Lex, the corresponding
fragment is executed.
The user supplies the additional code beyond expression
matching needed to complete his tasks, possibly including code
written by other generators. The program that recognizes the
expressions is generated in the general purpose programming
language employed for the user's program fragments. Thus, a high
level expression language is provided to write the string
expressions to be matched while the user's freedom to write
actions is unimpaired. This avoids forcing the user who wishes to
use a string manipulation language for input analysis to write
processing programs in the same and often inappropriate string
handling language.
Lex is not a complete language, but rather a generator
representing a new language feature which can be added to
different programming languages, called ``host languages.'' Just
as general purpose languages can produce code to run on different
computer hardware, Lex can write code in different host languages.
The host language is used for the output code generated by Lex and
also for the program fragments added by the user. Compatible
run-time libraries for the different host languages are also
provided. This makes Lex adaptable to different environments and
different users. Each application may be directed to the
combination of hardware and host language appropriate to the task,
the user's background, and the properties of local implementa-
tions. At present, the only supported host language is C,
although Fortran (in the form of Ratfor [2] has been available in
the past. Lex itself exists on UNIX, GCOS, and OS/370; but the
code generated by Lex may be taken anywhere the appropriate
compilers exist.
Lex turns the user's expressions and actions (called source
in this memo) into the host general-purpose language; the
generated program is named yylex. The yylex program will
recognize expressions in a stream (called input in this memo) and
perform the specified actions for each expression as it is
detected. See Figure 1.
+-------+
Source -> | Lex | -> yylex
+-------+
+-------+
Input -> | yylex | -> Output
+-------+
An overview of Lex
Figure 1
For a trivial example, consider a program to delete from the
input all blanks or tabs at the ends of lines.
%%
[ \t]+$ ;
is all that is required. The program contains a %% delimiter to
mark the beginning of the rules, and one rule. This rule contains
a regular expression which matches one or more instances of the
characters blank or tab (written \t for visibility, in accordance
with the C language convention) just prior to the end of a line.
The brackets indicate the character class made of blank and tab;
the + indicates ``one or more ...''; and the $ indicates ``end
of line,'' as in QED. No action is specified, so the program
generated by Lex (yylex) will ignore these characters. Everything
else will be copied. To change any remaining string of blanks or
tabs to a single blank, add another rule:
%%
[ \t]+$ ;
[ \t]+ printf(" ");
The finite automaton generated for this source will scan for both
rules at once, observing at the termination of the string of
blanks or tabs whether or not there is a newline character, and
executing the desired rule action. The first rule matches all
strings of blanks or tabs at the end of lines, and the second rule
all remaining strings of blanks or tabs.
Lex can be used alone for simple transformations, or for
analysis and statistics gathering on a lexical level. Lex can
also be used with a parser generator to perform the lexical
analysis phase; it is particularly easy to interface Lex and Yacc
[3]. Lex programs recognize only regular expressions; Yacc writes
parsers that accept a large class of context free grammars, but
require a lower level analyzer to recognize input tokens. Thus, a
combination of Lex and Yacc is often appropriate. When used as
a preprocessor for a later parser generator, Lex is used to
partition the input stream, and the parser generator assigns
structure to the resulting pieces. The flow of control in such
a case (which might be the first half of a compiler, for exam-
ple) is shown in Figure 2. Additional programs, written by other
generators or by hand, can be added easily to programs written by
Lex.
lexical grammar
rules rules
| |
v v
+---------+ +---------+
| Lex | | Yacc |
+---------+ +---------+
| |
v v
+---------+ +---------+
Input -> | yylex | -> | yyparse | -> Parsed input
+---------+ +---------+
Lex with Yacc
Figure 2
Yacc users will realize that the name yylex is what Yacc expects
its lexical analyzer to be named, so that the use of this name by
Lex simplifies interfacing.
Lex generates a deterministic finite automaton from the
regular expressions in the source [4]. The automaton is
interpreted, rather than compiled, in order to save space. The
result is still a fast analyzer. In particular, the time taken by
a Lex program to recognize and partition an input stream is
proportional to the length of the input. The number of Lex rules
or the complexity of the rules is not important in determining
speed, unless rules which include forward context require a
significant amount of rescanning. What does increase with the
number and complexity of rules is the size of the finite
automaton, and therefore the size of the program generated by Lex.
In the program written by Lex, the user's fragments
(representing the actions to be performed as each regular
expression is found) are gathered as cases of a switch. The
automaton interpreter directs the control flow. Opportunity is
provided for the user to insert either declarations or addi-
tional statements in the routine containing the actions, or to add
subroutines outside this action routine.
Lex is not limited to source which can be interpreted on the
basis of one character lookahead. For example, if there are two
rules, one looking for ab and another for abcdefg, and the input
stream is abcdefh, Lex will recognize ab and leave the input
pointer just before cd. . . Such backup is more costly than the
processing of simpler languages.
2. Lex Source.
The general format of Lex source is:
{definitions}
%%
{rules}
%%
{user subroutines}
where the definitions and the user subroutines are often omitted.
The second %% is optional, but the first is required to mark the
beginning of the rules. The absolute minimum Lex program is thus
%%
(no definitions, no rules) which translates into a program which
copies the input to the output unchanged.
In the outline of Lex programs shown above, the rules
represent the user's control decisions; they are a table, in which
the left column contains regular expressions (see section 3) and
the right column contains actions, program fragments to be
executed when the expressions are recognized. Thus an individual
rule might appear
integer printf("found keyword INT");
to look for the string integer in the input stream and print the
message ``found keyword INT'' whenever it appears. In this
example the host procedural language is C and the C library
function printf is used to print the string. The end of the
expression is indicated by the first blank or tab character. If
the action is merely a single C expression, it can just be given
on the right side of the line; if it is compound, or takes more
than a line, it should be enclosed in braces. As a slightly more
useful example, suppose it is desired to change a number of words
from British to American spelling. Lex rules such as
colour printf("color");
mechanise printf("mechanize");
petrol printf("gas");
would be a start. These rules are not quite enough, since the
word petroleum would become gaseum; a way of dealing with this
will be described later.
3. Lex Regular Expressions.
The definitions of regular expressions are very similar to
those in QED [5]. A regular expression specifies a set of strings
to be matched. It contains text characters (which match the
corresponding characters in the strings being compared) and
operator characters (which specify repetitions, choices, and other
features). The letters of the alphabet and the digits are always
text characters; thus the regular expression
integer
matches the string integer wherever it appears and the expression
a57D
looks for the string a57D.
Operators. The operator characters are
" \ [ ] ^ - ? . * + | ( ) $ / { } % < >
and if they are to be used as text characters, an escape should be
used. The quotation mark operator (") indicates that whatever is
contained between a pair of quotes is to be taken as text
characters. Thus
xyz"++"
matches the string xyz++ when it appears. Note that a part of a
string may be quoted. It is harmless but unnecessary to quote an
ordinary text character; the expression
"xyz++"
is the same as the one above. Thus by quoting every
non-alphanumeric character being used as a text character, the
user can avoid remembering the list above of current operator
characters, and is safe should further extensions to Lex lengthen
the list.
An operator character may also be turned into a text
character by preceding it with \ as in
xyz\+\+
which is another, less readable, equivalent of the above
expressions. Another use of the quoting mechanism is to get a
blank into an expression; normally, as explained above, blanks or
tabs end a rule. Any blank character not contained within [] (see
below) must be quoted. Several normal C escapes with \ are
recognized: \n is newline, \t is tab, and \b is backspace. To
enter \ itself, use \\. Since newline is illegal in an
expression, \n must be used; it is not required to escape tab and
backspace. Every character but blank, tab, newline and the list
above is always a text character.
Character classes. Classes of characters can be specified
using the operator pair []. The construction [abc] matches a
single character, which may be a, b, or c. Within square
brackets, most operator meanings are ignored. Only three
characters are special: these are \ - and ^. The - character
indicates ranges. For example,
[a-z0-9<>_]
indicates the character class containing all the lower case
letters, the digits, the angle brackets, and underline. Ranges
may be given in either order. Using - between any pair of
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -