?? ch05_01.htm
字號:
<html><head><title>Pattern Matching (Programming Perl)</title><!-- STYLESHEET --><link rel="stylesheet" type="text/css" href="../style/style1.css"><!-- METADATA --><!--Dublin Core Metadata--><meta name="DC.Creator" content=""><meta name="DC.Date" content=""><meta name="DC.Format" content="text/xml" scheme="MIME"><meta name="DC.Generator" content="XSLT stylesheet, xt by James Clark"><meta name="DC.Identifier" content=""><meta name="DC.Language" content="en-US"><meta name="DC.Publisher" content="O'Reilly & Associates, Inc."><meta name="DC.Source" content="" scheme="ISBN"><meta name="DC.Subject.Keyword" content=""><meta name="DC.Title" content="Pattern Matching"><meta name="DC.Type" content="Text.Monograph"></head><body><!-- START OF BODY --><!-- TOP BANNER --><img src="gifs/smbanner.gif" usemap="#banner-map" border="0" alt="Book Home"><map name="banner-map"><AREA SHAPE="RECT" COORDS="0,0,466,71" HREF="index.htm" ALT="Programming Perl"><AREA SHAPE="RECT" COORDS="467,0,514,18" HREF="jobjects/fsearch.htm" ALT="Search this book"></map><!-- TOP NAV BAR --><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch04_09.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="part2.htm">Part 2: The Gory Details</a></td><td align="right" valign="top" width="172"><a href="ch05_02.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr></table></div><hr width="515" align="left"><!-- SECTION BODY --><h1 class="chapter">Chapter 5. Pattern Matching</h1><div class="htmltoc"><h4 class="tochead">Contents:</h4><p><a href="ch05_01.htm">The Regular Expression Bestiary</a><br><a href="ch05_02.htm">Pattern-Matching Operators</a><br><a href="ch05_03.htm">Metacharacters and Metasymbols</a><br><a href="ch05_04.htm">Character Classes</a><br><a href="ch05_05.htm">Quantifiers</a><br><a href="ch05_06.htm">Positions</a><br><a href="ch05_07.htm">Capturing and Clustering</a><br><a href="ch05_08.htm">Alternation</a><br><a href="ch05_09.htm">Staying in Control</a><br><a href="ch05_10.htm">Fancy Patterns</a><br></p></div><p><a name="INDEX-1251"></a><a name="INDEX-1252"></a><a name="INDEX-1253"></a><a name="INDEX-1254"></a><a name="INDEX-1255"></a><a name="INDEX-1256"></a>Perl's built-in support for pattern matching lets you search largeamounts of data conveniently and efficiently. Whether you run a hugecommercial portal site scanning every newsfeed in existence forinteresting tidbits, or a government organization dedicated tofiguring out human demographics (or the human genome), or aneducational institution just trying to get some dynamic information upon your web site, Perl is the tool of choice, in part because of itsdatabase connections, but largely because of its pattern-matchingcapabilities. If you take "text" in the widest possible sense,perhaps 90% of what you do is 90% text processing. That's reallywhat Perl is all about and always has been about--in fact, it's evenpart of Perl's name: Practical <em class="emphasis">Extraction</em> andReport Language. Perl's patterns provide a powerful way to scanthrough mountains of mere data and extract useful information from it.</p><p><a name="INDEX-1257"></a>You specify a pattern by creating a <em class="emphasis">regularexpression</em> (or <em class="emphasis">regex</em>), and Perl'sregular expression engine (the "Engine", for the rest of this chapter)then takes that expression and determines whether (and how) thepattern matches your data. While most of your data will probably betext strings, there's nothing stopping you from using regexes tosearch and replace any byte sequence, even what you'd normally thinkof as "binary" data. To Perl, bytes are just characters that happento have an ordinal value less than 256. (More on that in<a href="ch15_01.htm">Chapter 15, "Unicode"</a>.)</p><p>If you're acquainted with regular expressions from some other venue,we should warn you that regular expressions are a bit different inPerl. First, they aren't entirely "regular" in the theoretical senseof the word, which means they can do much more than the traditionalregular expressions taught in computer science classes. Second, theyare used so often in Perl that they have their own special variables,operators, and quoting conventions which are tightly integrated intothe language,not just loosely bolted on like any other library.Programmers new to Perl often look in vain for functions like these:<blockquote><pre class="programlisting">match( $string, $pattern );subst( $string, $pattern, $replacement );</pre></blockquote><a name="INDEX-1258"></a><a name="INDEX-1259"></a><a name="INDEX-1260"></a><a name="INDEX-1261"></a><a name="INDEX-1262"></a>But matching and substituting are such fundamental tasks in Perl thatthey merit one-letter operators: <tt class="literal">m/</tt><em class="replaceable">PATTERN</em><tt class="literal">/</tt> and <tt class="literal">s/</tt><em class="replaceable">PATTERN</em><tt class="literal">/</tt><em class="replaceable">REPLACEMENT</em><tt class="literal">/</tt> (<tt class="literal">m//</tt> and <tt class="literal">s///</tt>, for short). Notonly are they syntactically brief, but they're also parsed like double-quotedstrings rather than ordinary operators; nevertheless, they operate likeoperators, so we'll call them that. Throughout this chapter, you'llsee these operators used to match patterns against a string. If someportion of the string fits the pattern, we say that the match issuccessful. There are lots of cool things you can do with a successfulpattern match. In particular, if you are using <tt class="literal">s///</tt>, a successfulmatch causes the matched portion of the string to be replaced withwhatever you specified as the <em class="replaceable">REPLACEMENT</em>.</p><p>This chapter is all about how to build and use patterns. Perl'sregular expressions are potent, packing a lot of meaning into a smallspace. They can therefore be daunting if you try to intuit the meaningof a long pattern as a whole. But if you can break it up into itsparts, and if you know how the Engine interprets those parts, youcan understand any regular expression. It's not unusual to see ahundred line C or Java program expressed with a one-line regularexpression in Perl. That regex may be a little harder to understandthan any single line out of the longer program; on the other hand, theregex will likely be much easier to understand than the longer programtaken as a whole. You just have to keep these things in perspective.</p><h2 class="sect1">5.1. The Regular Expression Bestiary</h2><a name="INDEX-1263"></a><a name="INDEX-1264"></a><a name="INDEX-1265"></a><p>Before we dive into the rules for interpreting regular expressions,let's see what some patterns look like. Most characters in a regularexpression simply match themselves. If you string several charactersin a row, they must match in order, just as you'd expect. So if youwrite the pattern match:<blockquote><pre class="programlisting">/Frodo/</pre></blockquote>you can be sure that the pattern won't match unless the string containsthe substring "<tt class="literal">Frodo</tt>" somewhere. (A <em class="emphasis">substring</em> is just a part ofa string.) The match could be anywhere in the string, just as long asthose five characters occur somewhere, next to each other and in thatorder.</p><p><a name="INDEX-1266"></a><a name="INDEX-1267"></a><a name="INDEX-1268"></a> Other characters don't matchthemselves, but "misbehave" in some way. We call these<em class="emphasis">metacharacters</em>. (All metacharacters are naughtyin their own right, but some are so bad that they also cause othernearby characters to misbehave as well.)</p><p><a name="INDEX-1269"></a><a name="INDEX-1270"></a><a name="INDEX-1271"></a><a name="INDEX-1272"></a><a name="INDEX-1273"></a><a name="INDEX-1274"></a><a name="INDEX-1275"></a><a name="INDEX-1276"></a><a name="INDEX-1277"></a><a name="INDEX-1278"></a><a name="INDEX-1279"></a>Here are the miscreants:<blockquote><pre class="programlisting">\ | ( ) [ { ^ $ * + ? .</pre></blockquote>Metacharacters are actually very useful and have special meaningsinside patterns; we'll tell you all those meanings as we go along.But we do want to reassure you that you can always match any of thesetwelve characters literally by putting a backslash in front of it.For example, backslash is itself a metacharacter, so to match aliteral backslash, you'd backslash the backslash:<tt class="literal">\\</tt>.</p><p>You see, backslash is one of those characters that makes othercharacters misbehave. It just works out that when you make amisbehaving metacharacter misbehave, it ends up behaving--a doublenegative, as it were. So backslashing a character to get it to betaken literally works, but only on punctuational characters;backslashing an (ordinarily well-behaved) alphanumeric character doesthe opposite: it turns the literal character into something special.Whenever you see such a two-character sequence:<blockquote><pre class="programlisting">\b \D \t \3 \s</pre></blockquote></p><p><a name="INDEX-1280"></a><a name="INDEX-1281"></a><a name="INDEX-1282"></a><a name="INDEX-1283"></a><a name="INDEX-1284"></a>you'll know that the sequence is a <em class="emphasis">metasymbol</em> that matches somethingstrange. For instance, <tt class="literal">\b</tt> matches a word boundary, while <tt class="literal">\t</tt>matches an ordinary tab character. Notice that a tab is one characterwide, while a word boundary is zero characters wide because it's thespot between two characters. So we call <tt class="literal">\b</tt> a <em class="emphasis">zero-width</em>assertion. Still, <tt class="literal">\t</tt> and <tt class="literal">\b</tt> are alike in that they both assertsomething about a particular spot in the string. Whenever you<em class="emphasis">assert</em> something in a regular expression, you're just claiming thatthat particular something has to be true in order for the pattern tomatch.</p><p>Most pieces of a regular expression are some sort of assertion,including the ordinary characters that simply assert that they matchthemselves. To be precise, they also assert that the <em class="emphasis">next</em> thingwill match one character later in the string, which is why we talkabout the tab character being "one character wide". Some assertions
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -