?? cocol
字號:
The compiler description language Cocol/R
=========================================
(This is a modified version of parts of Professor Moessenboeck's 1990 paper
to allow for the fact that this implementation is for C/C++. The full
version of the paper should be consulted by serious users.)
A compiler description can be viewed as a module consisting of imports,
declarations and grammar rules that describe the lexical and syntactical
structure of a language as well as its translation into a target language.
The vocabulary of Cocol/R uses identifiers, strings and numbers in the usual
way:
ident = letter { letter | digit } .
string = '"' { anyButQuote } '"' | "'" { anyButApostrophe } "'" .
number = digit { digit } .
Upper case letters are distinct from lower case letters. Strings must not
cross line borders. Coco/R keywords are
ANY CASE CHARACTERS CHR COMMENTS COMPILER CONTEXT END EOF FROM
IGNORE NAMES NESTED PRAGMAS PRODUCTIONS SYNC TO TOKENS WEAK
(NAMES is an extension over the original Oberon implementation.)
The following metasymbols are used to form EBNF expressions:
( ) for grouping
{ } for iterations
[ ] for options
< > for attributes
(. .) for semantic parts
= . | + - as explained below
Comments are enclosed in "/*" and "*/" brackets, and may be nested. The
semantic parts may contain declarations or statements in a general purpose
programming language (in this case, C or C++).
The Oberon, Modula-2 and Pascal implementations use "(*" and "*)" for
comments; the C/C++ versions use C like comments because "(*" can cause
problems in semantic actions, for example: (. while (*s) do *s++ .)
where "(*" clearly is not intended to begin a comment.
Overall Structure
=================
A compiler description is made up of the following parts
Cocol = "COMPILER" GoalIdentifier
ArbitraryText
ScannerSpecification
ParserSpecification
"END" GoalIdentifier "." .
The name after the keyword COMPILER is the grammar name and must match the
name after the keyword END. The grammar name also denotes the topmost
non-terminal (the start symbol).
After the grammar name, arbitrary C/C++ text may follow; this is not checked
by Coco/R. It usually contains C/C++ declarations of global objects
(constants, types, variables, or procedures) that are needed in the semantic
actions later on.
The remaining parts of the compiler description specify the lexical and
syntactical structure of the language to be processed. Effectively two
grammars are specified - one for the lexical analyser or scanner, and the
other for the syntax analyser or parser. The non-terminals (token classes)
recognized by the scanner are regarded as terminals by the parser.
Scanner Specification
=====================
A scanner has to read source text, skip meaningless characters, and recognize
tokens that have to be passed to the parser. Tokens may be classified as
literals or as token classes. Literals (like "END" and "!=") may be
introduced directly into productions as strings, and do not need to be named.
Token classes (such as identifiers or numbers) must be named, and have
structures that are specified by regular expressions, defined in EBNF.
In Cocol, a scanner specification consists of six optional parts, that may, in
fact, be introduced in arbitrary order.
ScannerSpecification = { CharacterSets
| Ignorable
| Comments
| Tokens
| Pragmas
| UserNames
} .
CHARACTERS
----------
The CharacterSets component allows for the declaration of names for character
sets like letters or digits, and defines the characters that may occur as
members of these sets. These names then may be used in the other sections of
the scanner specification (but not, it should be noted, in the parser
specification).
CharacterSets = "CHARACTERS" { NamedCharSet } .
NamedCharSet = SetIdent "=" CharacterSet "." .
CharacterSet = SimpleSet { ( "+" | "-" ) SimpleSet } .
SimpleSet = SetIdent | string | "ANY"
| SingleChar [ ".." SingleChar ] .
SingleChar = "CHR" "(" number ")" .
SetIdent = identifier .
Simple character sets are denoted by one of
SetIdent a previously declared character set with that name
string a set consisting of all characters in the string
CHR(i) a set of one character with ordinal value i
CHR(i) .. CHR(j) a set consisting of all characters whose ordinal
values are in the range i ... j.
ANY the set of all characters acceptable to the
implementation
Simple sets may then be combined by the union (+) and difference operators
(-).
The ability to specify a range like CHR(7) .. CHR(31) is an extension over
the original Oberon implementation.
EXAMPLES:
digit = "0123456789" . The set of all digits
hexdigit = digit + "ABCDEF" . The set of all hexadecimal digits
eol = CHR(13) . End-of-line character
noDigit = ANY - digit . Any character that is not a digit
ctrlChars = CHR(1) .. CHR(31) . The ascii control characters
COMMENTS AND IGNORABLE CHARACTERS
---------------------------------
Usually spaces within the source text of a program are irrelevant, and in
scanning for the start of a token, a Coco/R generated scanner will simply
ignore them. Other separators like tabs, line ends, and form feeds may also
be declared irrelevant, and some applications may prefer to ignore the
distinction between upper and lower case input.
Comments are difficult to specify with the regular expressions used to denote
tokens - indeed, nested comments may not be specified at all in this way.
Since comments are usually discarded by a parsing process, and may typically
appear in arbitrary places in source code, it makes sense to have a special
construct to express their structure.
Ignorable aspects of the scanning process are defined in Cocol by
Comments = "COMMENTS" "FROM" TokenExpr "TO" TokenExpr [ "NESTED" ] .
Ignorable = "IGNORE" ( "CASE" | CharacterSet ) .
where the optional keyword NESTED should have an obvious meaning. A practical
restriction is that comment brackets must not be longer than 2 characters. It
is possible to declare several kinds of comments within a single grammar, for
example, for C++:
COMMENTS FROM "/*" TO "*/"
COMMENTS FROM "//" TO eol
IGNORE CHR(9) .. CHR(13)
The set of ignorable characters in this example is that which includes the
standard white space separators in ASCII files. The null character CHR(0)
should not be included in any ignorable set. It is used internally by Coco/R
to mark the end of the input file.
TOKENS
------
A very important part of the scanner specification declares the form of
terminal tokens:
Tokens = "TOKENS" { Token } .
Token = TokenSymbol [ "=" TokenExpr "." ] .
TokenExpr = TokenTerm { "|" TokenTerm } .
TokenTerm = TokenFactor { TokenFactor } [ "CONTEXT" "(" TokenExpr ")" ] .
TokenFactor = SetIdent | string
| "(" TokenExpr ")"
| "[" TokenExpr "]"
| "{" TokenExpr "}" .
TokenSymbol = TokenIdent | string .
TokenIdent = identifier .
Tokens may be declared in any order. A token declaration defines a
TokenSymbol together with its structure. Usually the symbol on the left-hand
side of the declaration is an identifier, which is then used in other parts of
the grammar to denote the structure described on the right-hand side of the
declaration by a regular expression (expressed in EBNF). This expression may
contain literals denoting themselves (for example "END"), or the names of
character sets (for example letter), denoting an arbitrary character from such
sets. The restriction to regular expressions means that it may not contain
the names of any other tokens.
While token specification is usually straightforward, there are a number of
subtleties that may need emphasizing:
- There is one predeclared token EOF that can be used in productions where
it is necessary to check explicitly that the end of the source has been
reached. When the Scanner detects that the end of the source has been
reached further attempts to obtain a token return only this one.
- Since spaces are deemed to be irrelevant when they come between tokens in
the input for most languages, one should not attempt to declare literal
tokens that have spaces within them.
- The grammar for tokens allows for empty right-hand sides. This may seem
strange, especially as no scanner is generated if the right-hand side of a
declaration is missing. This facility is used if the user wishes to supply
a hand-crafted scanner, rather than the one generated by Coco/R. In this
case, the symbol on the left-hand side of a token declaration may also
simply be specified by a string, with no right-hand side.
- Tokens specified without right-hand sides are numbered consecutively
starting from 0, and the hand-crafted scanner has to return token codes
according to this numbering scheme.
- The CONTEXT phrase in a TokenTerm means that the term is only recognized
when its right hand context in the input stream is the TokenExpr specified
in brackets.
EXAMPLES:
ident = letter { letter | digit } .
real = digit { digit } "." { digit }
[ "E" [ "+" | "-" ] digit { digit } ] .
number = digit { digit }
| digit { digit } CONTEXT ( ".." ) .
and = "&".
The CONTEXT phrase in the above example allows a distinction between reals
(e.g. 1.23) and range constructs (e.g. 1..2) that could otherwise not be
scanned with a single character lookahead.
PRAGMAS
-------
A pragma, like a comment, is a token that may occur anywhere in the input
stream, but, unlike a comment, it cannot be ignored. Pragmas are often used
to allow programmers to select compiler switches dynamically. Since it
becomes impractical to modify the phrase structure grammar to handle this, a
special mechanism is provided for the recognition and treatment of pragmas.
In Cocol they are declared like tokens, but may have an associated semantic
action that is executed whenever they are recognized by the scanner.
Pragmas = "PRAGMAS" { Pragma } .
Pragma = Token [ Action ] .
Action = "(." arbitraryText ".)" .
EXAMPLE:
option = "$" { letter } .
(. char str[50]; int i;
S_GetString(S_Pos, S_Len, str);
i = 0;
while (i < S_Len) {
switch (str[i]) {
...
}
i++;
} .)
USER NAMES
----------
Coco/R, by default, generates symbolic names for token symbols, sometimes
having a rather stereotyped form. The UserNames section may be used to prefer
user-defined names, or to help resolve name clashes (for example, between the
default names that would be chosen for "point" and ".").
UserNames = "NAMES" { UserName } .
UserName = TokenIdent "=" ( identifier | string ) "." .
EXAMPLES:
NAMES
period = "." .
ellipsis = "..." .
For special purposes the symbol on the left-hand side may also be a string,
in which case no right-hand side may be specified; this is used if the user
wishes to supply a hand-crafted scanner. Indeed, if the right-hand side
of a declaration is missing, no scanner is generated.
The ability to use names is an extension over the original Oberon
implementation.
Parser Specification
====================
The parser specification is the main part of the input to Coco/R. It contains
the productions of an attributed grammar specifying the syntax of the language
to be recognized, as well as the action to be taken as each phrase or token is
recognized.
The form of the parser specification may itself be described in EBNF as
follows. For the Modula-2 and Pascal versions we have:
ParserSpecification = "PRODUCTIONS" { Production } .
Production = NonTerminal [ FormalAttributes ]
[ LocalDeclarations ] (* Modula-2 and Pascal *)
"=" Expression "." .
FormalAttributes = "<" arbitraryText ">" | "<." arbitraryText ".>" .
LocalDeclarations = "(." arbitraryText ".)" .
NonTerminal = identifier .
For the C and C++ versions the LocalDeclarations follow the "=" instead:
Production = NonTerminal [ FormalAttributes ]
"=" [ LocalDeclarations ] /* C and C++ */
Expression "." .
Any identifier appearing in a production that was not previously declared as a
terminal token is considered to be the name of a NonTerminal, and there must
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -