?? re.java
字號:
/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package org.apache.regexp;import java.io.Serializable;import java.util.Vector;/** * RE is an efficient, lightweight regular expression evaluator/matcher * class. Regular expressions are pattern descriptions which enable * sophisticated matching of strings. In addition to being able to * match a string against a pattern, you can also extract parts of the * match. This is especially useful in text parsing! Details on the * syntax of regular expression patterns are given below. * * <p> * To compile a regular expression (RE), you can simply construct an RE * matcher object from the string specification of the pattern, like this: * * <pre> * RE r = new RE("a*b"); * </pre> * * <p> * Once you have done this, you can call either of the RE.match methods to * perform matching on a String. For example: * * <pre> * boolean matched = r.match("aaaab"); * </pre> * * will cause the boolean matched to be set to true because the * pattern "a*b" matches the string "aaaab". * * <p> * If you were interested in the <i>number</i> of a's which matched the * first part of our example expression, you could change the expression to * "(a*)b". Then when you compiled the expression and matched it against * something like "xaaaab", you would get results like this: * * <pre> * RE r = new RE("(a*)b"); // Compile expression * boolean matched = r.match("xaaaab"); // Match against "xaaaab" * * String wholeExpr = r.getParen(0); // wholeExpr will be 'aaaab' * String insideParens = r.getParen(1); // insideParens will be 'aaaa' * * int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1 * int endWholeExpr = r.getParenEnd(0); // endWholeExpr will be index 6 * int lenWholeExpr = r.getParenLength(0); // lenWholeExpr will be 5 * * int startInside = r.getParenStart(1); // startInside will be index 1 * int endInside = r.getParenEnd(1); // endInside will be index 5 * int lenInside = r.getParenLength(1); // lenInside will be 4 * </pre> * * You can also refer to the contents of a parenthesized expression * within a regular expression itself. This is called a * 'backreference'. The first backreference in a regular expression is * denoted by \1, the second by \2 and so on. So the expression: * * <pre> * ([0-9]+)=\1 * </pre> * * will match any string of the form n=n (like 0=0 or 2=2). * * <p> * The full regular expression syntax accepted by RE is described here: * * <pre> * * <b><font face=times roman>Characters</font></b> * * <i>unicodeChar</i> Matches any identical unicode character * \ Used to quote a meta-character (like '*') * \\ Matches a single '\' character * \0nnn Matches a given octal character * \xhh Matches a given 8-bit hexadecimal character * \\uhhhh Matches a given 16-bit hexadecimal character * \t Matches an ASCII tab character * \n Matches an ASCII newline character * \r Matches an ASCII return character * \f Matches an ASCII form feed character * * * <b><font face=times roman>Character Classes</font></b> * * [abc] Simple character class * [a-zA-Z] Character class with ranges * [^abc] Negated character class * </pre> * * <b>NOTE:</b> Incomplete ranges will be interpreted as "starts * from zero" or "ends with last character". * <br> * I.e. [-a] is the same as [\\u0000-a], and [a-] is the same as [a-\\uFFFF], * [-] means "all characters". * * <pre> * * <b><font face=times roman>Standard POSIX Character Classes</font></b> * * [:alnum:] Alphanumeric characters. * [:alpha:] Alphabetic characters. * [:blank:] Space and tab characters. * [:cntrl:] Control characters. * [:digit:] Numeric characters. * [:graph:] Characters that are printable and are also visible. * (A space is printable, but not visible, while an * `a' is both.) * [:lower:] Lower-case alphabetic characters. * [:print:] Printable characters (characters that are not * control characters.) * [:punct:] Punctuation characters (characters that are not letter, * digits, control characters, or space characters). * [:space:] Space characters (such as space, tab, and formfeed, * to name a few). * [:upper:] Upper-case alphabetic characters. * [:xdigit:] Characters that are hexadecimal digits. * * * <b><font face=times roman>Non-standard POSIX-style Character Classes</font></b> * * [:javastart:] Start of a Java identifier * [:javapart:] Part of a Java identifier * * * <b><font face=times roman>Predefined Classes</font></b> * * . Matches any character other than newline * \w Matches a "word" character (alphanumeric plus "_") * \W Matches a non-word character * \s Matches a whitespace character * \S Matches a non-whitespace character * \d Matches a digit character * \D Matches a non-digit character * * * <b><font face=times roman>Boundary Matchers</font></b> * * ^ Matches only at the beginning of a line * $ Matches only at the end of a line * \b Matches only at a word boundary * \B Matches only at a non-word boundary * * * <b><font face=times roman>Greedy Closures</font></b> * * A* Matches A 0 or more times (greedy) * A+ Matches A 1 or more times (greedy) * A? Matches A 1 or 0 times (greedy) * A{n} Matches A exactly n times (greedy) * A{n,} Matches A at least n times (greedy) * A{n,m} Matches A at least n but not more than m times (greedy) * * * <b><font face=times roman>Reluctant Closures</font></b> * * A*? Matches A 0 or more times (reluctant) * A+? Matches A 1 or more times (reluctant) * A?? Matches A 0 or 1 times (reluctant) * * * <b><font face=times roman>Logical Operators</font></b> * * AB Matches A followed by B * A|B Matches either A or B * (A) Used for subexpression grouping * (?:A) Used for subexpression clustering (just like grouping but * no backrefs) * * * <b><font face=times roman>Backreferences</font></b> * * \1 Backreference to 1st parenthesized subexpression * \2 Backreference to 2nd parenthesized subexpression * \3 Backreference to 3rd parenthesized subexpression * \4 Backreference to 4th parenthesized subexpression * \5 Backreference to 5th parenthesized subexpression * \6 Backreference to 6th parenthesized subexpression * \7 Backreference to 7th parenthesized subexpression * \8 Backreference to 8th parenthesized subexpression * \9 Backreference to 9th parenthesized subexpression * </pre> * * <p> * All closure operators (+, *, ?, {m,n}) are greedy by default, meaning * that they match as many elements of the string as possible without * causing the overall match to fail. If you want a closure to be * reluctant (non-greedy), you can simply follow it with a '?'. A * reluctant closure will match as few elements of the string as * possible when finding matches. {m,n} closures don't currently * support reluctancy. * * <p> * <b><font face="times roman">Line terminators</font></b> * <br> * A line terminator is a one- or two-character sequence that marks * the end of a line of the input character sequence. The following * are recognized as line terminators: * <ul> * <li>A newline (line feed) character ('\n'),</li> * <li>A carriage-return character followed immediately by a newline character ("\r\n"),</li> * <li>A standalone carriage-return character ('\r'),</li> * <li>A next-line character ('\u0085'),</li> * <li>A line-separator character ('\u2028'), or</li> * <li>A paragraph-separator character ('\u2029).</li> * </ul> * * <p> * RE runs programs compiled by the RECompiler class. But the RE * matcher class does not include the actual regular expression compiler * for reasons of efficiency. In fact, if you want to pre-compile one * or more regular expressions, the 'recompile' class can be invoked * from the command line to produce compiled output like this: * * <pre> * // Pre-compiled regular expression "a*b" * char[] re1Instructions = * { * 0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041, * 0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047, * 0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000, * 0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000, * 0x0000, * }; * * * REProgram re1 = new REProgram(re1Instructions); * </pre> * * You can then construct a regular expression matcher (RE) object from * the pre-compiled expression re1 and thus avoid the overhead of * compiling the expression at runtime. If you require more dynamic * regular expressions, you can construct a single RECompiler object and * re-use it to compile each expression. Similarly, you can change the * program run by a given matcher object at any time. However, RE and * RECompiler are not threadsafe (for efficiency reasons, and because * requiring thread safety in this class is deemed to be a rare * requirement), so you will need to construct a separate compiler or * matcher object for each thread (unless you do thread synchronization * yourself). Once expression compiled into the REProgram object, REProgram * can be safely shared across multiple threads and RE objects. * * <br><p><br> * * <font color="red"> * <i>ISSUES:</i> * * <ul> * <li>com.weusours.util.re is not currently compatible with all * standard POSIX regcomp flags</li> * <li>com.weusours.util.re does not support POSIX equivalence classes * ([=foo=] syntax) (I18N/locale issue)</li> * <li>com.weusours.util.re does not support nested POSIX character * classes (definitely should, but not completely trivial)</li> * <li>com.weusours.util.re Does not support POSIX character collation * concepts ([.foo.] syntax) (I18N/locale issue)</li> * <li>Should there be different matching styles (simple, POSIX, Perl etc?)</li> * <li>Should RE support character iterators (for backwards RE matching!)?</li> * <li>Should RE support reluctant {m,n} closures (does anyone care)?</li> * <li>Not *all* possibilities are considered for greediness when backreferences * are involved (as POSIX suggests should be the case). The POSIX RE * "(ac*)c*d[ac]*\1", when matched against "acdacaa" should yield a match * of acdacaa where \1 is "a". This is not the case in this RE package, * and actually Perl doesn't go to this extent either! Until someone * actually complains about this, I'm not sure it's worth "fixing". * If it ever is fixed, test #137 in RETest.txt should be updated.</li> * </ul> * * </font> * * @see recompile * @see RECompiler * * @author <a href="mailto:jonl@muppetlabs.com">Jonathan Locke</a> * @author <a href="mailto:ts@sch-fer.de">Tobias Schäfer</a> * @version $Id: RE.java 518156 2007-03-14 14:31:26Z vgritsenko $ */public class RE implements Serializable{ /** * Specifies normal, case-sensitive matching behaviour. */ public static final int MATCH_NORMAL = 0x0000; /** * Flag to indicate that matching should be case-independent (folded) */ public static final int MATCH_CASEINDEPENDENT = 0x0001; /** * Newlines should match as BOL/EOL (^ and $) */ public static final int MATCH_MULTILINE = 0x0002; /** * Consider all input a single body of text - newlines are matched by . */ public static final int MATCH_SINGLELINE = 0x0004; /************************************************ * * * The format of a node in a program is: * * * * [ OPCODE ] [ OPDATA ] [ OPNEXT ] [ OPERAND ] * * * * char OPCODE - instruction * * char OPDATA - modifying data * * char OPNEXT - next node (relative offset) * * * ************************************************/ // Opcode Char Opdata/Operand Meaning // ---------- ---------- --------------- -------------------------------------------------- static final char OP_END = 'E'; // end of program static final char OP_BOL = '^'; // match only if at beginning of line static final char OP_EOL = '$'; // match only if at end of line static final char OP_ANY = '.'; // match any single character except newline static final char OP_ANYOF = '['; // count/ranges match any char in the list of ranges static final char OP_BRANCH = '|'; // node match this alternative or the next one static final char OP_ATOM = 'A'; // length/string length of string followed by string itself static final char OP_STAR = '*'; // node kleene closure static final char OP_PLUS = '+'; // node positive closure static final char OP_MAYBE = '?'; // node optional closure static final char OP_ESCAPE = '\\'; // escape special escape code char class (escape is E_* code) static final char OP_OPEN = '('; // number nth opening paren static final char OP_OPEN_CLUSTER = '<'; // opening cluster static final char OP_CLOSE = ')'; // number nth closing paren static final char OP_CLOSE_CLUSTER = '>'; // closing cluster static final char OP_BACKREF = '#'; // number reference nth already matched parenthesized string static final char OP_GOTO = 'G'; // nothing but a (back-)pointer static final char OP_NOTHING = 'N'; // match null string such as in '(a|)' static final char OP_CONTINUE = 'C'; // continue to the following command (ignore next)
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -