?? wordtest.w
字號:
\datethis@* Introduction. This program is a simple filter that sorts and outputs alllines of input that do not appear in a given set of sorted files. It is called{\tt wordtest} because each line of input is considered to be a `word' andeach of the sorted files is considered to be a 'dictionary'. Words areoutput when they don't appear in any given dictionary.The character set and alphabetic order are flexible. Every 8-bitcharacter is mapped into an integer called its {\it ord}. A characteris called a {\it null\/} if its ord is zero; such characters arediscarded from the input. A character is called a {\it break\/} ifits ord is negative; such characters break the input into so-called words.Otherwise a character's ord is positive, and the character is called a{\it letter}. One letter precedes another in alphabetic order if and onlyif it has a smaller ord. Two letters are considered identical, forpurposes of sorting, if their ords are the same.The null character |'\n'| must have ord~0; thus, it must remain null.Otherwise the ord mapping is arbitrary. If the user doesn't specifyany special mapping, the default ord table simply maps every 8-bitcharacter code into itself, considering characters to be unsigned charvalues in the range 0--255, except that ASCII codes {\tt a-z} aremapped into the corresponding codes for {\tt A-Z}, and newline is abreak character. Optional command-line arguments, described below, canchange this default mapping to any other desired scheme.A word is any nonempty sequence of letters that is immediately precededand followed by break characters, when nulls are ignored. Technicallyspeaking, we pretend that a break character is present at the beginning of afile but not at the end; thus, all letters following the final break characterof a file are ignored, if any such letters are present. Two words are{\it equivalent\/} to each other if their letters have the same sequenceof ord values. If two or more words of the input are equivalent, onlythe first will be output, and it will be output only if it is notequivalent to any word in the given dictionary files. Words in eachdictionary are assumed to be in lexicographic order and to contain nonulls. Words in the output file will satisfy these conditions; therefore{\tt wordtest} can be used to generate and update the dictionaries it needs.Notice that if no dictionaries are given, {\tt wordtest} will act as asorting routine that simply discards nulls and duplicate lines.@ The \UNIX/ command line `{\tt wordtest} {\tt [options]} {\tt [dictionaries]}'is interpreted by executing option commands from left to right and then byregarding any remaining arguments as the names of dictionary files.Most of the option commands are designed to specify the |ord| table.Initially |ord[c]=c| for each unsigned char code~|c|. The command$$\line{\hskip5em\tt-b\it string\hfil}$$makes every character in the string a break character. If the string isempty, {\tt-b} makes every nonnull character a break (i.e., it sets|ord[c]=-1| for |1<=c<=255|). The command$$\line{\hskip5em\tt-n\it string\hfil}$$makes every character in the string a null character. If the string isempty, {\tt-n} makes every character null. The command$$\line{\hskip5em\tt-a\it string\hfil}$$sets the ord of the $k$th element of the string equal to $\delta+k$,where $\delta$ is an offset value (normally zero). The command$$\line{\hskip5em\tt-d\it offset\hfil}$$sets the value of $\delta$; the offset should be a decimal integer between0 and 255.There is also an option that has no effect on the |ord| table:$$\line{\hskip5em\tt-m\it length\hfil}$$defines the length of the longest word. If any word of a file hasmore than this many characters, a break is artificially insertedso that a word of this maximum length is obtained. The default value is 50.The maximum legal value is 1000.If the given options do not specify at least one break character,{\tt wordtest} applies the option commands$$\vbox{\line{\hskip5em\.{-b"\\}\hfil}\line{\.{" -d64 -a"abcdefghijklmnopqrstuvwxyz"}\hfil}}$$which generate the default mapping mentioned above (unless other ords werechanged).The program is designed to run fastest when there are at most twodictionary files (usually one large system dictionary and anotherpersonalized one), although it places no limit on the actual number ofdictionaries that can be mentioned on the command line. Users who wantto specify a multitude of dictionaries should ask themselves why theywouldn't prefer to merge their dictionaries together first (using{\tt wordtest}).@d MAX_LENGTH_DEFAULT 50@d MAX_LENGTH_LIMIT 1000@ The general organization of {\tt wordtest} is typical of applicationswritten in \CEE/, and its approach is quite simple. If any errors aredetected, an indication of the error is sent to the |stderr| file anda nonzero value is returned.@p#include <stdio.h>@#@<Typedefs@>@;int main(argc,argv) int argc; /* the number of command-line arguments */ char *argv[]; /* the arguments themselves */{ @<Local variables@>; @<Scan the command line arguments@>; @<Sort the input into memory@>; @<Output all input words that aren't in dictionaries@>; return 0;}@ @<Typedefs@>=typedef unsigned char byte; /* our bytes will range from 0 to 255 */@ @<Local variables@>=int targc; /* temporary modifications to |argc| */byte **targv; /* pointer to the current argument of interest */unsigned delta; /* the offset used in the \.{-a} and \.{-d} options */unsigned max_length=MAX_LENGTH_DEFAULT; /* longest allowable word */byte breakchar; /* break character to use in the output */int ord[256]; /* table of ord values */register int c; /* an all-purpose index */register byte *u,*v; /* pointer to current string characters */@ We try to use newline as the output break character, if possible.@<Scan the command line arguments@>=for (c=0;c<256;c++) ord[c]=c;delta=0;targc=argc-1;@+targv=(byte**)argv+1;while (targc && **targv=='-') { @<Execute the option command |targv|@>; targc--;@+targv++;}if (ord['\n']<0) breakchar='\n';else { breakchar='\0'; for (c=255;c;c--) if (ord[c]<0) breakchar=c; if (!breakchar) @<Set up the default ords@>;}@<Allocate data structures for a total of |targc| files@>;for (;targc;targc--,targv++) @<Open the dictionary file named |*targv|@>;@ @<Execute the option...@>=switch((*targv)[1]) {case 'a': for (c=delta,u=*targv+2;*u;u++) ord[*u]=++c;@+break;case 'b': if ((*targv)[2]) for (u=*targv+2;*u;u++) ord[*u]=-1; else for (c=1;c<256;c++) ord[c]=-1; break;case 'n': if ((*targv)[2]) for (u=*targv+2;*u;u++) ord[*u]=0; else for (c=1;c<256;c++) ord[c]=0; break;case 'd': if (sscanf((char*)*targv+2,"%u",&delta)==1 && delta<256) break; goto print_usage;case 'm': if (sscanf((char*)*targv+2,"%u",&max_length)==1 & max_length<=MAX_LENGTH_LIMIT) break; goto print_usage;default: print_usage: fprintf(stderr, "Usage: %s {-{{a|b|n}string|{d|m}number}}* dictionaryname*\n",*argv); return-1;}@ @<Set up the default ords@>={ ord['\n']=-1; /* newline is break character */ breakchar='\n'; for (c=1;c<=26;c++) ord['a'-1+c]='A'-1+c;}@*Treaps. The most interesting part of this program is its sorting algorithm,which is based on the ``treap'' data structure of Aragon and Seidel[{\sl 30th IEEE Symposium on Foundations of Computer Science\/} (1989),540--546].@^Aragon, Cecilia Rodriguez@>@^Seidel, Raimund@>A treap is a binary tree whose nodes have two key fields. The primarykey, which in our application is a word from the input, obeystree-search order: All descendants of the left child of node~$p$ havea primary key that is less than the primary key of~$p$, and all descendantsof its right child have a primary key that is greater. The secondary key,which in our application is a unique pseudorandom integer attached toeach input word, obeys heap order: The secondary key of~$p$'s childrenis greater than $p$'s own secondary key.A given set of nodes with distinct primary keys and distinct secondarykeys can be made into a treap in exactly one way. This unique treapcan be obtained, for example, by using ordinary tree insertion withrespect to primary keys while inserting nodes in order of theirsecondary keys. It follows that, if the secondary keys are random,the binary tree will almost always be quite well balanced.We will compute secondary keys as unsigned long integers, assigningthe key $(cn)\bmod 2^{32}$ to the $n$th node, where $c$ is an oddnumber. This will guarantee that the secondary keys are distinct.By choosing $c$ close to $2^{32}/\phi$, where $\phi$ is the goldenratio $(1+\sqrt5\,)/2$, we also spread the values out in a fashion thatis unlikely to match any existing order in the data.@d PHICLONE 2654435769 /* $\approx 2^{32}/\phi$ */@<Typedefs@>=typedef struct node_struct { struct node_struct *left,*right; /* children */ byte *keyword; /* primary key */ unsigned long rank; /* secondary key */} node; /* node of a treap */@ We want to be able to compare two strings rapidly with respect tolexicographic order, as defined by the |ord| table. This can be doneif one string is delimited by |'\0'| as usual, while the other isdelimited by a break character. Then we are sure to have an unequalcomparison, and the inner loop is fast.Here is a routine that checks to see if a word is already present in thetreap. The word is assumed to be in |buffer|, terminated by |breakchar|.The words in the treap are terminated by nulls. Thetreap is accessed by means of |root|, a pointer to its root node.@<Search for |buffer| in the treap; |goto found| if it's there@>={@+register node *p=root; while (p) { for (u=buffer,v=p->keyword;ord[*u]==ord[*v];u++,v++) ; if (*v=='\0' && *u==breakchar) goto found; if (ord[*u]<ord[*v]) p=p->left; else p=p->right; }}@ We don't need to insert nodes into the treap as often as we need tolook words up, so we don't mind repeating the comparisons already madewhen we discover that insertion is necessary. (Actually a more comprehensivestudy of this tradeoff ought to be done. But not today; I am tryinghere to keep the program short and sweet.)The insertion algorithm proceeds just as the lookup algorithm untilwe come to a node whose rank is larger than the rank of the nodeto be inserted. We insert the new node in its place, then split theold node and its descendants into two subtrees that will become theleft and right subtrees of the new node.@<Insert the |buffer| word into the treap@>={@+register node *p,**q,**qq,*r; current_rank += PHICLONE; /* unsigned addition mod $2^{32}$ */ p=root;@+q=&root; while (p) { if (p->rank>current_rank) break; /* end of the first phase */ for (u=buffer,v=p->keyword;ord[*u]==ord[*v];u++,v++) ; if (ord[*u]<ord[*v]) q=&(p->left), p=*q; else q=&(p->right), p=*q; } @<Set |r| to the address of a new node, and move |buffer| into it@>; r->rank=current_rank; *q=r; /* link the new node into the tree */ @<Split subtree |p| and attach it below node |r|@>;}@ @<Local...@>=unsigned long current_rank=0; /* pseudorandom number */@ At this point |p| may already be empty. If not, we can hook itsparts together easily. (A formal proof is a bit tricky, but the computerdoesn't slow down like people do when they get to a conceptually harderpart of an algorithm.)@<Split subtree |p| and attach it below node |r|@>=q=&(r->left);@+qq=&(r->right); /* slots to fill in as we split the subtree */while (p) { for (u=buffer,v=p->keyword;ord[*u]==ord[*v];u++,v++) ; if (ord[*u]<ord[*v]) { *qq=p; qq=&(p->left); p=*qq; } else { *q=p; q=&(p->right); p=*q;
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -