?? yacc-docs.txt
字號:
http://www.csc.calpoly.edu/~gfisher/450/doc/yacc/paper.txt
Yacc: Yet Another Compiler-Compiler PS1:15-1
Yacc: Yet Another Compiler-Compiler
Stephen C. Johnson
AT&T Bell Laboratories
Murray Hill, New Jersey 07974
ABSTRACT
Computer program input generally has some struc-
ture; in fact, every computer program that does input
can be thought of as defining an ``input language''
which it accepts. An input language may be as complex
as a programming language, or as simple as a sequence
of numbers. Unfortunately, usual input facilities are
limited, difficult to use, and often are lax about
checking their inputs for validity.
Yacc provides a general tool for describing the
input to a computer program. The Yacc user specifies
the structures of his input, together with code to be
invoked as each such structure is recognized. Yacc
turns such a specification into a subroutine that han-
dles the input process; frequently, it is convenient
and appropriate to have most of the flow of control in
the user's application handled by this subroutine.
The input subroutine produced by Yacc calls a
user-supplied routine to return the next basic input
item. Thus, the user can specify his input in terms of
individual input characters, or in terms of higher
level constructs such as names and numbers. The user-
supplied routine may also handle idiomatic features
such as comment and continuation conventions, which
typically defy easy grammatical specification.
Yacc is written in portable C. The class of
specifications accepted is a very general one: LALR(1)
grammars with disambiguating rules.
In addition to compilers for C, APL, Pascal, RAT-
FOR, etc., Yacc has also been used for less conven-
tional languages, including a phototypesetter language,
several desk calculator languages, a document retrieval
system, and a Fortran debugging system.
PS1:15-2 Yacc: Yet Another Compiler-Compiler
0: Introduction
Yacc provides a general tool for imposing structure on the
input to a computer program. The Yacc user prepares a specifica-
tion of the input process; this includes rules describing the
input structure, code to be invoked when these rules are recog-
nized, and a low-level routine to do the basic input. Yacc then
generates a function to control the input process. This func-
tion, called a parser, calls the user-supplied low-level input
routine (the lexical analyzer) to pick up the basic items (called
tokens) from the input stream. These tokens are organized
according to the input structure rules, called grammar rules;
when one of these rules has been recognized, then user code sup-
plied for this rule, an action, is invoked; actions have the
ability to return values and make use of the values of other
actions.
Yacc is written in a portable dialect of C[1] and the
actions, and output subroutine, are in C as well. Moreover, many
of the syntactic conventions of Yacc follow C.
The heart of the input specification is a collection of
grammar rules. Each rule describes an allowable structure and
gives it a name. For example, one grammar rule might be
date : month_name day ',' year ;
Here, date, month_name, day, and year represent structures of
interest in the input process; presumably, month_name, day, and
year are defined elsewhere. The comma ``,'' is enclosed in sin-
gle quotes; this implies that the comma is to appear literally in
the input. The colon and semicolon merely serve as punctuation
in the rule, and have no significance in controlling the input.
Thus, with proper definitions, the input
July 4, 1776
might be matched by the above rule.
An important part of the input process is carried out by the
lexical analyzer. This user routine reads the input stream,
recognizing the lower level structures, and communicates these
tokens to the parser. For historical reasons, a structure recog-
nized by the lexical analyzer is called a terminal symbol, while
the structure recognized by the parser is called a nonterminal
symbol. To avoid confusion, terminal symbols will usually be
referred to as tokens.
There is considerable leeway in deciding whether to recog-
nize structures using the lexical analyzer or grammar rules. For
example, the rules
Yacc: Yet Another Compiler-Compiler PS1:15-3
month_name : 'J' 'a' 'n' ;
month_name : 'F' 'e' 'b' ;
. . .
month_name : 'D' 'e' 'c' ;
might be used in the above example. The lexical analyzer would
only need to recognize individual letters, and month_name would
be a nonterminal symbol. Such low-level rules tend to waste time
and space, and may complicate the specification beyond Yacc's
ability to deal with it. Usually, the lexical analyzer would
recognize the month names, and return an indication that a
month_name was seen; in this case, month_name would be a token.
Literal characters such as ``,'' must also be passed through
the lexical analyzer, and are also considered tokens.
Specification files are very flexible. It is realively easy
to add to the above example the rule
date : month '/' day '/' year ;
allowing
7 / 4 / 1776
as a synonym for
July 4, 1776
In most cases, this new rule could be ``slipped in'' to a working
system with minimal effort, and little danger of disrupting
existing input.
The input being read may not conform to the specifications.
These input errors are detected as early as is theoretically pos-
sible with a left-to-right scan; thus, not only is the chance of
reading and computing with bad input data substantially reduced,
but the bad data can usually be quickly found. Error handling,
provided as part of the input specifications, permits the reentry
of bad data, or the continuation of the input process after skip-
ping over the bad data.
In some cases, Yacc fails to produce a parser when given a
set of specifications. For example, the specifications may be
self contradictory, or they may require a more powerful recogni-
tion mechanism than that available to Yacc. The former cases
represent design errors; the latter cases can often be corrected
by making the lexical analyzer more powerful, or by rewriting
some of the grammar rules. While Yacc cannot handle all possible
specifications, its power compares favorably with similar sys-
tems; moreover, the constructions which are difficult for Yacc to
PS1:15-4 Yacc: Yet Another Compiler-Compiler
handle are also frequently difficult for human beings to handle.
Some users have reported that the discipline of formulating valid
Yacc specifications for their input revealed errors of conception
or design early in the program development.
The theory underlying Yacc has been described elsewhere.[2,
3, 4] Yacc has been extensively used in numerous practical appli-
cations, including lint,[5] the Portable C Compiler,[6] and a
system for typesetting mathematics.[7]
The next several sections describe the basic process of
preparing a Yacc specification; Section 1 describes the prepara-
tion of grammar rules, Section 2 the preparation of the user sup-
plied actions associated with these rules, and Section 3 the
preparation of lexical analyzers. Section 4 describes the opera-
tion of the parser. Section 5 discusses various reasons why Yacc
may be unable to produce a parser from a specification, and what
to do about it. Section 6 describes a simple mechanism for han-
dling operator precedences in arithmetic expressions. Section 7
discusses error detection and recovery. Section 8 discusses the
operating environment and special features of the parsers Yacc
produces. Section 9 gives some suggestions which should improve
the style and efficiency of the specifications. Section 10
discusses some advanced topics, and Section 11 gives acknowledge-
ments. Appendix A has a brief example, and Appendix B gives a
summary of the Yacc input syntax. Appendix C gives an example
using some of the more advanced features of Yacc, and, finally,
Appendix D describes mechanisms and syntax no longer actively
supported, but provided for historical continuity with older ver-
sions of Yacc.
1: Basic Specifications
Names refer to either tokens or nonterminal symbols. Yacc
requires token names to be declared as such. In addition, for
reasons discussed in Section 3, it is often desirable to include
the lexical analyzer as part of the specification file; it may be
useful to include other programs as well. Thus, every specifica-
tion file consists of three sections: the declarations, (grammar)
rules, and programs. The sections are separated by double per-
cent ``%%'' marks. (The percent ``%'' is generally used in Yacc
specifications as an escape character.)
In other words, a full specification file looks like
declarations
%%
rules
%%
programs
The declaration section may be empty. Moreover, if the pro-
grams section is omitted, the second %% mark may be omitted also;
Yacc: Yet Another Compiler-Compiler PS1:15-5
thus, the smallest legal Yacc specification is
%%
rules
Blanks, tabs, and newlines are ignored except that they may
not appear in names or multi-character reserved symbols. Com-
ments may appear wherever a name is legal; they are enclosed in
/* . . . */, as in C and PL/I.
The rules section is made up of one or more grammar rules.
A grammar rule has the form:
A : BODY ;
A represents a nonterminal name, and BODY represents a sequence
of zero or more names and literals. The colon and the semicolon
are Yacc punctuation.
Names may be of arbitrary length, and may be made up of
letters, dot ``.'', underscore ``_'', and non-initial digits.
Upper and lower case letters are distinct. The names used in the
body of a grammar rule may represent tokens or nonterminal sym-
bols.
A literal consists of a character enclosed in single quotes
``'''. As in C, the backslash ``\'' is an escape character
within literals, and all the C escapes are recognized. Thus
'\n' newline
'\r' return
'\'' single quote ``'''
'\\' backslash ``\''
'\t' tab
'\b' backspace
'\f' form feed
'\xxx' ``xxx'' in octal
For a number of technical reasons, the NUL character ('\0' or 0)
should never be used in grammar rules.
If there are several grammar rules with the same left hand
side, the vertical bar ``|'' can be used to avoid rewriting the
left hand side. In addition, the semicolon at the end of a rule
can be dropped before a vertical bar. Thus the grammar rules
A : B C D ;
A : E F ;
A : G ;
can be given to Yacc as
PS1:15-6 Yacc: Yet Another Compiler-Compiler
A : B C D
| E F
| G
;
It is not necessary that all grammar rules with the same left
side appear together in the grammar rules section, although it
makes the input much more readable, and easier to change.
If a nonterminal symbol matches the empty string, this can
be indicated in the obvious way:
empty : ;
Names representing tokens must be declared; this is most
simply done by writing
%token name1 name2 . . .
in the declarations section. (See Sections 3 , 5, and 6 for much
more discussion). Every name not defined in the declarations
section is assumed to represent a nonterminal symbol. Every non-
terminal symbol must appear on the left side of at least one
rule.
Of all the nonterminal symbols, one, called the start sym-
bol, has particular importance. The parser is designed to recog-
nize the start symbol; thus, this symbol represents the largest,
most general structure described by the grammar rules. By
default, the start symbol is taken to be the left hand side of
the first grammar rule in the rules section. It is possible, and
in fact desirable, to declare the start symbol explicitly in the
declarations section using the %start keyword:
%start symbol
The end of the input to the parser is signaled by a special
token, called the endmarker. If the tokens up to, but not
including, the endmarker form a structure which matches the start
symbol, the parser function returns to its caller after the end-
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -