?? tokens.txt
字號:
Using the UnderC Tokenizer Class
It's often necessary to parse complex text files, where standard i/o
is too clumsy. C programmers fall back on strtok(), but this can be
tricky to use properly. Besides, you are still responsible for keeping
strtok() fed with new input, and I don't like the schlepp.
Tokenizer is a text-parsing input stream, modelled after the (undocumented)
VCL TParser class, and based on the UnderC tokenizing preprocessor front-end.
For example, consider a test.txt file containing this text:
one "hello" 'dolly' 10 2.3 *
;> #include <uc/tokens.h>
;> Tokenizer tok;
;> tok.open("test.txt");
(bool) true
;> tok.next();
(TokenType) 1 T_TOKEN
;> tok.get_str();
(char*) "one"
The next() method grabs the next token in the stream, and returns
a code identifying the token type. The token itself is available
using get_str(). next() will eventually return T_END (zero) when
there are no more tokens available.
This is what happens when next() and get_str() are repeatedly called;
(I've used #define to save typing as usual)
;> #define F tok.next(); tok.get_str()
;> F
(TokenType) 7 T_STRING
(char*) "hello"
;> F
(TokenType) 7 T_STRING
(char*) "dolly"
;> F
(TokenType) 3 T_DOUBLE
(char*) "10"
;> F
(TokenType) 3 T_DOUBLE
(char*) "2.3"
;> F
(TokenType) 42 <undef>
(char*) "*"
;> F
(TokenType) 0 T_END
(char*) ""
;>
Anything which is not classified will be a non-whitespace character; generally
people are not interested in whitespace (tho it would not be difficult to
make Tokenizer optionally respect whitespace).
get_str() returns the token buffer, which will be overwritten each time next()
is called - so never store the returned char pointer. Either use a string, or
call get_str() with an explicit char* argument.
;> char buff[80];
;> tok.get_str(buff);
(char*) "hello"
The idea behind separating token fetching into next() and get_str() is to
make quite sure that you have the appropriate format available for conversion.
;> tok.get_str();
(char*) "10"
;> tok.get_number();
(double) 10.0
Processing String Buffers and Searching Forward
Tokenizer can work on a given string buffer; here is an example which takes a
comma-separated list of items, and extracts them as individual strings:
int get_comma_list(Tokenizer& tok, string s, string vals[], int sz)
{
Tokenizer ts;
ts.set_str(s.c_str());
int i = 0;
while (ts.next()) {
vals[i++] = ts.get_str();
if (i == sz) break;
ts.next(); // skip ','
}
return i;
}
This depends on your items being recognizable tokens, of course, but it does the
tricky part of ignoring commas inside strings. I've often found such operations
useful when parsing configuration files. You can of course use 'char** vals' instead
of 'string vals[]' but you will need to do a strdup() on the char pointer returned
by get_str().
It is very useful to be able to quickly move to the first occurance of a string
inside a file; this is done with the go_to() method. Here is a function which finds
a given section/key in a INI file:
bool find_key(Tokenizer& tok, string section, string key)
{
string sect = "[" + section + "]";
tok.go_to(sect.c_str());
while (tok.next()) {
if (key == tok.get_str()) {
tok.next(); // skip '='
return true;
} else tok.discard_line();
}
return false;
}
For each key that doesn't match, I use discard_line() to force the next line to
be input. getline() does the same thing, except copies the rest of the line
into a buffer. So to pick up the value of the key (which may contain special
characters, like a file path) I can do this:
find_key(tok,"Files","Directory");
tok.getline(buff,sizeof(buff));
There is also getch() which fetches characters one at a time - this is one Tokenizer
method which respects whitespace. This function will extract all block comments
from a C file:
void grab_comments(const char* file)
{
Tokenizer tok(file);
while (tok.go_to("/*")) {
cout << tok.line() << ": ";
do {
cout << tok.getch();
} while (! tok.next_is("*/"));
cout << endl;
}
}
next_is() is used to look ahead on the same line.
Making Tokenizer Respect C Strings.
Please note that by default Tokenizer classifies text in both single and double quotes
as T_STRING, and all kinds of numbers as T_DOUBLE. If you want to make the usual
C-like distinctions, then use the C_MODE flag:
;> tok.rewind();
;> tok.set_flags(C_MODE);
;> F
(TokenType) 1 T_TOKEN
(char*) "one"
;> F
(TokenType) 7 T_STRING
(char*) "hello"
;> F
(TokenType) 6 T_CHAR
(char*) "dolly"
;> F
(TokenType) 2 T_INT
(char*) "10"
;> F
(TokenType) 3 T_DOUBLE
(char*) "2.3"
;> F
(TokenType) 42 <undef>
(char*) "*"
;> F
(TokenType) 0 T_END
(char*) "
"
Flags
C_NUMBER: integers begining with zero are interpreted as octal, and 0x... indicates
the start of a hexadecimal number, as usual. get_int() will automatically
convert any hexadecimals.
C_STRING: a distinction is made between single and double quotes; any escape characters
inside strings are respected.
C_IDEN: words may contain '_' and numbers, like C identifiers.
STRIP_LINEFEEDS: explicitly remove linefeeds when reading text.
These flags will affect other operations as well. The next_float() method
moves along in the stream until it finds a valid T_NUMBER:
double Tokenizer::next_float()
{
TokenType t;
do {
t = next();
if (t == T_NUMBER) return get_float();
} while (t != T_END);
return 0.0;
}
If set_flags(C_NUMBER) is previously called, then it will _only_ pick up explicit
floating-point numbers (i.e. which have a fractional or an exponent part). This
function will return the sum of all the floating point numbers found in a stream
double avg_value(Tokenizer& tok)
{
double sum = 0.0;
tok.set_flags(C_NUMBER);
while (! tok.eof()) sum += tok.next_float();
return sum;
}
It will not include any integers in this summation, which is useful when wanting
a quick statistic on a data file.
Here is a more complex file format, which represents a digitized underground mine
plan. A typical MLS file will contain thousands of these segments:
...
face;
myid 4365992;
date 1995/10;
node 3,4366224;
node 1,4360216;
points (22341.1,-2118.5,2237.7),(22342.4,-2120.1,2238.7),(22344.2,-2122.2,2240)
,(22345.9,-2124.4,2241.2),(22347.8,-2126.4,2242.5),(22349.8,-2128.1,2243.7)
,(22352,-2130.1,2245.1),(22354,-2132.2,2246.5),(22356.1,-2134,2247.8)
,(22358.3,-2136.1,2249.2),(22360.4,-2138,2250.5),(22362.7,-2140.1,2252)
,(22362.9,-2140.1,2252);
endface;
...
I inherited this file format from a previous project, and it isn't ideal.
Extracting points from the comma-separated list is more laborious than it ought
to be. But it's straightforward to access these points using Tokenizer. For example,
this code works out the average of the x,y and z coordinates within a file:
double xsum = 0.0, ysum = 0.0, zsum = 0.0;
while (tok.go_to("points")) {
char ch;
do {
xsum += tok.next_float();
ysum += tok.next_float();
zsum += tok.next_float();
ch = (char)tok.next();
if (ch == ')') ch = (char)tok.next();
knt++;
} while (ch != ';');
}
I had some serious 7 Meg files in this format, and this code took just over a
second to parse these numbers. It's interesting to observe that the program took
only 35% more time using UnderC, than with the Microsoft compiler.
Error Handling
Generally, exception handling produces clean code where the 'exceptional' cases are
handled differently from business-as-usual. But libraries imported into UnderC
cannot meaningfully throw exceptions, so supporting this style requires some
helper functions.
char buff[80];
void error_expecting(const char* msg)
{
throw TokenException(buff);
}
void must_be(Tokenizer& tok, char ch)
{
TokenType t = tok.next();
if ((char)t != ch) {
sprintf(buff,"expecting '%c'\n",ch);
error_expecting(buff);
}
}
char* must_be_string(Tokenizer& tok)
{
TokenType t = tok.next();
if ((char)t != T_STRING) {
error_expecting("string");
}
return tok.get_str();
}
You can then write code which concisely describes what tokens are expected
at each point. Here is part of an XML parser.
// pick up element attributes, if any
while (t != '/' && t != '>') {
if (t != T_TOKEN) error_expecting("word");
name = tok.get_str();
must_be(tok,'=');
val = must_be_string(tok);
cout << "attrib " << name << '=' << val << endl;
t = tok.next();
}
This parser is found in lib/examples/xml.cpp, and does a lot for something
that's only 125 lines!
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -