?? lemon.html
字號:
<html><head><title>The Lemon Parser Generator</title></head><body bgcolor=white><h1 align=center>The Lemon Parser Generator</h1><p>Lemon is an LALR(1) parser generator for C or C++. It does the same job as ``bison'' and ``yacc''.But lemon is not another bison or yacc clone. Ituses a different grammar syntax which is designed toreduce the number of coding errors. Lemon also uses a moresophisticated parsing engine that is faster than yacc andbison and which is both reentrant and thread-safe.Furthermore, Lemon implements features that can be usedto eliminate resource leaks, making is suitable for usein long-running programs such as graphical user interfacesor embedded controllers.</p><p>This document is an introduction to the Lemonparser generator.</p><h2>Theory of Operation</h2><p>The main goal of Lemon is to translate a context free grammar (CFG)for a particular language into C code that implements a parser forthat language.The program has two inputs:<ul><li>The grammar specification.<li>A parser template file.</ul>Typically, only the grammar specification is supplied by the programmer.Lemon comes with a default parser template which works fine for mostapplications. But the user is free to substitute a different parsertemplate if desired.</p><p>Depending on command-line options, Lemon will generate betweenone and three files of outputs.<ul><li>C code to implement the parser.<li>A header file defining an integer ID for each terminal symbol.<li>An information file that describes the states of the generated parser automaton.</ul>By default, all three of these output files are generated.The header file is suppressed if the ``-m'' command-line option isused and the report file is omitted when ``-q'' is selected.</p><p>The grammar specification file uses a ``.y'' suffix, by convention.In the examples used in this document, we'll assume the name of thegrammar file is ``gram.y''. A typical use of Lemon would be thefollowing command:<pre> lemon gram.y</pre>This command will generate three output files named ``gram.c'',``gram.h'' and ``gram.out''.The first is C code to implement the parser. The secondis the header file that defines numerical values for allterminal symbols, and the last is the report that explainsthe states used by the parser automaton.</p><h3>Command Line Options</h3><p>The behavior of Lemon can be modified using command-line options.You can obtain a list of the available command-line options togetherwith a brief explanation of what each does by typing<pre> lemon -?</pre>As of this writing, the following command-line options are supported:<ul><li><tt>-b</tt><li><tt>-c</tt><li><tt>-g</tt><li><tt>-m</tt><li><tt>-q</tt><li><tt>-s</tt><li><tt>-x</tt></ul>The ``-b'' option reduces the amount of text in the report file byprinting only the basis of each parser state, rather than the fullconfiguration.The ``-c'' option suppresses action table compression. Using -cwill make the parser a little larger and slower but it will detectsyntax errors sooner.The ``-g'' option causes no output files to be generated at all.Instead, the input grammar file is printed on standard output butwith all comments, actions and other extraneous text deleted. Thisis a useful way to get a quick summary of a grammar.The ``-m'' option causes the output C source file to be compatiblewith the ``makeheaders'' program.Makeheaders is a program that automatically generates header filesfrom C source code. When the ``-m'' option is used, the headerfile is not output since the makeheaders program will take careof generated all header files automatically.The ``-q'' option suppresses the report file.Using ``-s'' causes a brief summary of parser statistics to beprinted. Like this:<pre> Parser statistics: 74 terminals, 70 nonterminals, 179 rules 340 states, 2026 parser table entries, 0 conflicts</pre>Finally, the ``-x'' option causes Lemon to print its version numberand then stops without attempting to read the grammar or generate a parser.</p><h3>The Parser Interface</h3><p>Lemon doesn't generate a complete, working program. It only generatesa few subroutines that implement a parser. This section describesthe interface to those subroutines. It is up to the programmer tocall these subroutines in an appropriate way in order to produce acomplete system.</p><p>Before a program begins using a Lemon-generated parser, the programmust first create the parser.A new parser is created as follows:<pre> void *pParser = ParseAlloc( malloc );</pre>The ParseAlloc() routine allocates and initializes a new parser andreturns a pointer to it.The actual data structure used to represent a parser is opaque --its internal structure is not visible or usable by the calling routine.For this reason, the ParseAlloc() routine returns a pointer to voidrather than a pointer to some particular structure.The sole argument to the ParseAlloc() routine is a pointer to thesubroutine used to allocate memory. Typically this means ``malloc()''.</p><p>After a program is finished using a parser, it can reclaim allmemory allocated by that parser by calling<pre> ParseFree(pParser, free);</pre>The first argument is the same pointer returned by ParseAlloc(). Thesecond argument is a pointer to the function used to release bulkmemory back to the system.</p><p>After a parser has been allocated using ParseAlloc(), the programmermust supply the parser with a sequence of tokens (terminal symbols) tobe parsed. This is accomplished by calling the following functiononce for each token:<pre> Parse(pParser, hTokenID, sTokenData, pArg);</pre>The first argument to the Parse() routine is the pointer returned byParseAlloc().The second argument is a small positive integer that tells the parse thetype of the next token in the data stream.There is one token type for each terminal symbol in the grammar.The gram.h file generated by Lemon contains #define statements thatmap symbolic terminal symbol names into appropriate integer values.(A value of 0 for the second argument is a special flag to theparser to indicate that the end of input has been reached.)The third argument is the value of the given token. By default,the type of the third argument is integer, but the grammar willusually redefine this type to be some kind of structure.Typically the second argument will be a broad category of tokenssuch as ``identifier'' or ``number'' and the third argument willbe the name of the identifier or the value of the number.</p><p>The Parse() function may have either three or four arguments,depending on the grammar. If the grammar specification file requestit, the Parse() function will have a fourth parameter that can beof any type chosen by the programmer. The parser doesn't do anythingwith this argument except to pass it through to action routines.This is a convenient mechanism for passing state information downto the action routines without having to use global variables.</p><p>A typical use of a Lemon parser might look something like thefollowing:<pre> 01 ParseTree *ParseFile(const char *zFilename){ 02 Tokenizer *pTokenizer; 03 void *pParser; 04 Token sToken; 05 int hTokenId; 06 ParserState sState; 07 08 pTokenizer = TokenizerCreate(zFilename); 09 pParser = ParseAlloc( malloc ); 10 InitParserState(&sState); 11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){ 12 Parse(pParser, hTokenId, sToken, &sState); 13 } 14 Parse(pParser, 0, sToken, &sState); 15 ParseFree(pParser, free ); 16 TokenizerFree(pTokenizer); 17 return sState.treeRoot; 18 }</pre>This example shows a user-written routine that parses a file oftext and returns a pointer to the parse tree.(We've omitted all error-handling from this example to keep itsimple.)We assume the existence of some kind of tokenizer which is createdusing TokenizerCreate() on line 8 and deleted by TokenizerFree()on line 16. The GetNextToken() function on line 11 retrieves thenext token from the input file and puts its type in the integer variable hTokenId. The sToken variable is assumed to besome kind of structure that contains details about each token,such as its complete text, what line it occurs on, etc. </p><p>This example also assumes the existence of structure of typeParserState that holds state information about a particular parse.An instance of such a structure is created on line 6 and initializedon line 10. A pointer to this structure is passed into the Parse()routine as the optional 4th argument.The action routine specified by the grammar for the parser can usethe ParserState structure to hold whatever information is useful andappropriate. In the example, we note that the treeRoot field ofthe ParserState structure is left pointing to the root of the parsetree.</p><p>The core of this example as it relates to Lemon is as follows:<pre> ParseFile(){ pParser = ParseAlloc( malloc ); while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){ Parse(pParser, hTokenId, sToken); } Parse(pParser, 0, sToken); ParseFree(pParser, free ); }</pre>Basically, what a program has to do to use a Lemon-generated parseris first create the parser, then send it lots of tokens obtained bytokenizing an input source. When the end of input is reached, theParse() routine should be called one last time with a token typeof 0. This step is necessary to inform the parser that the end ofinput has been reached. Finally, we reclaim memory used by theparser by calling ParseFree().</p><p>There is one other interface routine that should be mentionedbefore we move on.The ParseTrace() function can be used to generate debugging outputfrom the parser. A prototype for this routine is as follows:<pre> ParseTrace(FILE *stream, char *zPrefix);</pre>After this routine is called, a short (one-line) message is writtento the designated output stream every time the parser changes statesor calls an action routine. Each such message is prefaced usingthe text given by zPrefix. This debugging output can be turned offby calling ParseTrace() again with a first argument of NULL (0).</p><h3>Differences With YACC and BISON</h3><p>Programmers who have previously used the yacc or bison parsergenerator will notice several important differences between yacc and/orbison and Lemon.<ul><li>In yacc and bison, the parser calls the tokenizer. In Lemon, the tokenizer calls the parser.<li>Lemon uses no global variables. Yacc and bison use global variables to pass information between the tokenizer and parser.<li>Lemon allows multiple parsers to be running simultaneously. Yacc and bison do not.</ul>These differences may cause some initial confusion for programmerswith prior yacc and bison experience.But after years of experience using Lemon, I firmlybelieve that the Lemon way of doing things is better.</p><h2>Input File Syntax</h2><p>The main purpose of the grammar specification file for Lemon isto define the grammar for the parser. But the input file alsospecifies additional information Lemon requires to do its job.Most of the work in using Lemon is in writing an appropriategrammar file.</p><p>The grammar file for lemon is, for the most part, free format.It does not have sections or divisions like yacc or bison. Anydeclaration can occur at any point in the file.Lemon ignores whitespace (except where it is needed to separatetokens) and it honors the same commenting conventions as C and C++.</p><h3>Terminals and Nonterminals</h3><p>A terminal symbol (token) is any string of alphanumericand underscore charactersthat begins with an upper case letter.A terminal can contain lower class letters after the first character,but the usual convention is to make terminals all upper case.A nonterminal, on the other hand, is any string of alphanumericand underscore characters than begins with a lower case letter.Again, the usual convention is to make nonterminals use all lowercase letters.</p><p>In Lemon, terminal and nonterminal symbols do not need to be declared or identified in a separate section of the grammar file.Lemon is able to generate a list of all terminals and nonterminalsby examining the grammar rules, and it can always distinguish aterminal from a nonterminal by checking the case of the firstcharacter of the name.</p>
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -