Initial Commit
This commit is contained in:
549
database/perl/lib/pods/perlrequick.pod
Normal file
549
database/perl/lib/pods/perlrequick.pod
Normal file
@@ -0,0 +1,549 @@
|
||||
=head1 NAME
|
||||
|
||||
perlrequick - Perl regular expressions quick start
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
This page covers the very basics of understanding, creating and
|
||||
using regular expressions ('regexes') in Perl.
|
||||
|
||||
|
||||
=head1 The Guide
|
||||
|
||||
This page assumes you already know things, like what a "pattern" is, and
|
||||
the basic syntax of using them. If you don't, see L<perlretut>.
|
||||
|
||||
=head2 Simple word matching
|
||||
|
||||
The simplest regex is simply a word, or more generally, a string of
|
||||
characters. A regex consisting of a word matches any string that
|
||||
contains that word:
|
||||
|
||||
"Hello World" =~ /World/; # matches
|
||||
|
||||
In this statement, C<World> is a regex and the C<//> enclosing
|
||||
C</World/> tells Perl to search a string for a match. The operator
|
||||
C<=~> associates the string with the regex match and produces a true
|
||||
value if the regex matched, or false if the regex did not match. In
|
||||
our case, C<World> matches the second word in C<"Hello World">, so the
|
||||
expression is true. This idea has several variations.
|
||||
|
||||
Expressions like this are useful in conditionals:
|
||||
|
||||
print "It matches\n" if "Hello World" =~ /World/;
|
||||
|
||||
The sense of the match can be reversed by using C<!~> operator:
|
||||
|
||||
print "It doesn't match\n" if "Hello World" !~ /World/;
|
||||
|
||||
The literal string in the regex can be replaced by a variable:
|
||||
|
||||
$greeting = "World";
|
||||
print "It matches\n" if "Hello World" =~ /$greeting/;
|
||||
|
||||
If you're matching against C<$_>, the C<$_ =~> part can be omitted:
|
||||
|
||||
$_ = "Hello World";
|
||||
print "It matches\n" if /World/;
|
||||
|
||||
Finally, the C<//> default delimiters for a match can be changed to
|
||||
arbitrary delimiters by putting an C<'m'> out front:
|
||||
|
||||
"Hello World" =~ m!World!; # matches, delimited by '!'
|
||||
"Hello World" =~ m{World}; # matches, note the matching '{}'
|
||||
"/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
|
||||
# '/' becomes an ordinary char
|
||||
|
||||
Regexes must match a part of the string I<exactly> in order for the
|
||||
statement to be true:
|
||||
|
||||
"Hello World" =~ /world/; # doesn't match, case sensitive
|
||||
"Hello World" =~ /o W/; # matches, ' ' is an ordinary char
|
||||
"Hello World" =~ /World /; # doesn't match, no ' ' at end
|
||||
|
||||
Perl will always match at the earliest possible point in the string:
|
||||
|
||||
"Hello World" =~ /o/; # matches 'o' in 'Hello'
|
||||
"That hat is red" =~ /hat/; # matches 'hat' in 'That'
|
||||
|
||||
Not all characters can be used 'as is' in a match. Some characters,
|
||||
called B<metacharacters>, are considered special, and reserved for use
|
||||
in regex notation. The metacharacters are
|
||||
|
||||
{}[]()^$.|*+?\
|
||||
|
||||
A metacharacter can be matched literally by putting a backslash before
|
||||
it:
|
||||
|
||||
"2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
|
||||
"2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
|
||||
'C:\WIN32' =~ /C:\\WIN/; # matches
|
||||
"/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
|
||||
|
||||
In the last regex, the forward slash C<'/'> is also backslashed,
|
||||
because it is used to delimit the regex.
|
||||
|
||||
Most of the metacharacters aren't always special, and other characters
|
||||
(such as the ones delimitting the pattern) become special under various
|
||||
circumstances. This can be confusing and lead to unexpected results.
|
||||
L<S<C<use re 'strict'>>|re/'strict' mode> can notify you of potential
|
||||
pitfalls.
|
||||
|
||||
Non-printable ASCII characters are represented by B<escape sequences>.
|
||||
Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
|
||||
for a carriage return. Arbitrary bytes are represented by octal
|
||||
escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
|
||||
e.g., C<\x1B>:
|
||||
|
||||
"1000\t2000" =~ m(0\t2) # matches
|
||||
"cat" =~ /\143\x61\x74/ # matches in ASCII, but
|
||||
# a weird way to spell cat
|
||||
|
||||
Regexes are treated mostly as double-quoted strings, so variable
|
||||
substitution works:
|
||||
|
||||
$foo = 'house';
|
||||
'cathouse' =~ /cat$foo/; # matches
|
||||
'housecat' =~ /${foo}cat/; # matches
|
||||
|
||||
With all of the regexes above, if the regex matched anywhere in the
|
||||
string, it was considered a match. To specify I<where> it should
|
||||
match, we would use the B<anchor> metacharacters C<^> and C<$>. The
|
||||
anchor C<^> means match at the beginning of the string and the anchor
|
||||
C<$> means match at the end of the string, or before a newline at the
|
||||
end of the string. Some examples:
|
||||
|
||||
"housekeeper" =~ /keeper/; # matches
|
||||
"housekeeper" =~ /^keeper/; # doesn't match
|
||||
"housekeeper" =~ /keeper$/; # matches
|
||||
"housekeeper\n" =~ /keeper$/; # matches
|
||||
"housekeeper" =~ /^housekeeper$/; # matches
|
||||
|
||||
=head2 Using character classes
|
||||
|
||||
A B<character class> allows a set of possible characters, rather than
|
||||
just a single character, to match at a particular point in a regex.
|
||||
There are a number of different types of character classes, but usually
|
||||
when people use this term, they are referring to the type described in
|
||||
this section, which are technically called "Bracketed character
|
||||
classes", because they are denoted by brackets C<[...]>, with the set of
|
||||
characters to be possibly matched inside. But we'll drop the "bracketed"
|
||||
below to correspond with common usage. Here are some examples of
|
||||
(bracketed) character classes:
|
||||
|
||||
/cat/; # matches 'cat'
|
||||
/[bcr]at/; # matches 'bat', 'cat', or 'rat'
|
||||
"abc" =~ /[cab]/; # matches 'a'
|
||||
|
||||
In the last statement, even though C<'c'> is the first character in
|
||||
the class, the earliest point at which the regex can match is C<'a'>.
|
||||
|
||||
/[yY][eE][sS]/; # match 'yes' in a case-insensitive way
|
||||
# 'yes', 'Yes', 'YES', etc.
|
||||
/yes/i; # also match 'yes' in a case-insensitive way
|
||||
|
||||
The last example shows a match with an C<'i'> B<modifier>, which makes
|
||||
the match case-insensitive.
|
||||
|
||||
Character classes also have ordinary and special characters, but the
|
||||
sets of ordinary and special characters inside a character class are
|
||||
different than those outside a character class. The special
|
||||
characters for a character class are C<-]\^$> and are matched using an
|
||||
escape:
|
||||
|
||||
/[\]c]def/; # matches ']def' or 'cdef'
|
||||
$x = 'bcr';
|
||||
/[$x]at/; # matches 'bat, 'cat', or 'rat'
|
||||
/[\$x]at/; # matches '$at' or 'xat'
|
||||
/[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
|
||||
|
||||
The special character C<'-'> acts as a range operator within character
|
||||
classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
|
||||
become the svelte C<[0-9]> and C<[a-z]>:
|
||||
|
||||
/item[0-9]/; # matches 'item0' or ... or 'item9'
|
||||
/[0-9a-fA-F]/; # matches a hexadecimal digit
|
||||
|
||||
If C<'-'> is the first or last character in a character class, it is
|
||||
treated as an ordinary character.
|
||||
|
||||
The special character C<^> in the first position of a character class
|
||||
denotes a B<negated character class>, which matches any character but
|
||||
those in the brackets. Both C<[...]> and C<[^...]> must match a
|
||||
character, or the match fails. Then
|
||||
|
||||
/[^a]at/; # doesn't match 'aat' or 'at', but matches
|
||||
# all other 'bat', 'cat, '0at', '%at', etc.
|
||||
/[^0-9]/; # matches a non-numeric character
|
||||
/[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
|
||||
|
||||
Perl has several abbreviations for common character classes. (These
|
||||
definitions are those that Perl uses in ASCII-safe mode with the C</a> modifier.
|
||||
Otherwise they could match many more non-ASCII Unicode characters as
|
||||
well. See L<perlrecharclass/Backslash sequences> for details.)
|
||||
|
||||
=over 4
|
||||
|
||||
=item *
|
||||
|
||||
\d is a digit and represents
|
||||
|
||||
[0-9]
|
||||
|
||||
=item *
|
||||
|
||||
\s is a whitespace character and represents
|
||||
|
||||
[\ \t\r\n\f]
|
||||
|
||||
=item *
|
||||
|
||||
\w is a word character (alphanumeric or _) and represents
|
||||
|
||||
[0-9a-zA-Z_]
|
||||
|
||||
=item *
|
||||
|
||||
\D is a negated \d; it represents any character but a digit
|
||||
|
||||
[^0-9]
|
||||
|
||||
=item *
|
||||
|
||||
\S is a negated \s; it represents any non-whitespace character
|
||||
|
||||
[^\s]
|
||||
|
||||
=item *
|
||||
|
||||
\W is a negated \w; it represents any non-word character
|
||||
|
||||
[^\w]
|
||||
|
||||
=item *
|
||||
|
||||
The period '.' matches any character but "\n"
|
||||
|
||||
=back
|
||||
|
||||
The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
|
||||
of character classes. Here are some in use:
|
||||
|
||||
/\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
|
||||
/[\d\s]/; # matches any digit or whitespace character
|
||||
/\w\W\w/; # matches a word char, followed by a
|
||||
# non-word char, followed by a word char
|
||||
/..rt/; # matches any two chars, followed by 'rt'
|
||||
/end\./; # matches 'end.'
|
||||
/end[.]/; # same thing, matches 'end.'
|
||||
|
||||
The S<B<word anchor> > C<\b> matches a boundary between a word
|
||||
character and a non-word character C<\w\W> or C<\W\w>:
|
||||
|
||||
$x = "Housecat catenates house and cat";
|
||||
$x =~ /\bcat/; # matches cat in 'catenates'
|
||||
$x =~ /cat\b/; # matches cat in 'housecat'
|
||||
$x =~ /\bcat\b/; # matches 'cat' at end of string
|
||||
|
||||
In the last example, the end of the string is considered a word
|
||||
boundary.
|
||||
|
||||
For natural language processing (so that, for example, apostrophes are
|
||||
included in words), use instead C<\b{wb}>
|
||||
|
||||
"don't" =~ / .+? \b{wb} /x; # matches the whole string
|
||||
|
||||
=head2 Matching this or that
|
||||
|
||||
We can match different character strings with the B<alternation>
|
||||
metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex
|
||||
C<dog|cat>. As before, Perl will try to match the regex at the
|
||||
earliest possible point in the string. At each character position,
|
||||
Perl will first try to match the first alternative, C<dog>. If
|
||||
C<dog> doesn't match, Perl will then try the next alternative, C<cat>.
|
||||
If C<cat> doesn't match either, then the match fails and Perl moves to
|
||||
the next position in the string. Some examples:
|
||||
|
||||
"cats and dogs" =~ /cat|dog|bird/; # matches "cat"
|
||||
"cats and dogs" =~ /dog|cat|bird/; # matches "cat"
|
||||
|
||||
Even though C<dog> is the first alternative in the second regex,
|
||||
C<cat> is able to match earlier in the string.
|
||||
|
||||
"cats" =~ /c|ca|cat|cats/; # matches "c"
|
||||
"cats" =~ /cats|cat|ca|c/; # matches "cats"
|
||||
|
||||
At a given character position, the first alternative that allows the
|
||||
regex match to succeed will be the one that matches. Here, all the
|
||||
alternatives match at the first string position, so the first matches.
|
||||
|
||||
=head2 Grouping things and hierarchical matching
|
||||
|
||||
The B<grouping> metacharacters C<()> allow a part of a regex to be
|
||||
treated as a single unit. Parts of a regex are grouped by enclosing
|
||||
them in parentheses. The regex C<house(cat|keeper)> means match
|
||||
C<house> followed by either C<cat> or C<keeper>. Some more examples
|
||||
are
|
||||
|
||||
/(a|b)b/; # matches 'ab' or 'bb'
|
||||
/(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
|
||||
|
||||
/house(cat|)/; # matches either 'housecat' or 'house'
|
||||
/house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
|
||||
# 'house'. Note groups can be nested.
|
||||
|
||||
"20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
|
||||
# because '20\d\d' can't match
|
||||
|
||||
=head2 Extracting matches
|
||||
|
||||
The grouping metacharacters C<()> also allow the extraction of the
|
||||
parts of a string that matched. For each grouping, the part that
|
||||
matched inside goes into the special variables C<$1>, C<$2>, etc.
|
||||
They can be used just as ordinary variables:
|
||||
|
||||
# extract hours, minutes, seconds
|
||||
$time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
|
||||
$hours = $1;
|
||||
$minutes = $2;
|
||||
$seconds = $3;
|
||||
|
||||
In list context, a match C</regex/> with groupings will return the
|
||||
list of matched values C<($1,$2,...)>. So we could rewrite it as
|
||||
|
||||
($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
|
||||
|
||||
If the groupings in a regex are nested, C<$1> gets the group with the
|
||||
leftmost opening parenthesis, C<$2> the next opening parenthesis,
|
||||
etc. For example, here is a complex regex and the matching variables
|
||||
indicated below it:
|
||||
|
||||
/(ab(cd|ef)((gi)|j))/;
|
||||
1 2 34
|
||||
|
||||
Associated with the matching variables C<$1>, C<$2>, ... are
|
||||
the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are
|
||||
matching variables that can be used I<inside> a regex:
|
||||
|
||||
/(\w\w\w)\s\g1/; # find sequences like 'the the' in string
|
||||
|
||||
C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>,
|
||||
C<\g2>, ... only inside a regex.
|
||||
|
||||
=head2 Matching repetitions
|
||||
|
||||
The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
|
||||
to determine the number of repeats of a portion of a regex we
|
||||
consider to be a match. Quantifiers are put immediately after the
|
||||
character, character class, or grouping that we want to specify. They
|
||||
have the following meanings:
|
||||
|
||||
=over 4
|
||||
|
||||
=item *
|
||||
|
||||
C<a?> = match 'a' 1 or 0 times
|
||||
|
||||
=item *
|
||||
|
||||
C<a*> = match 'a' 0 or more times, i.e., any number of times
|
||||
|
||||
=item *
|
||||
|
||||
C<a+> = match 'a' 1 or more times, i.e., at least once
|
||||
|
||||
=item *
|
||||
|
||||
C<a{n,m}> = match at least C<n> times, but not more than C<m>
|
||||
times.
|
||||
|
||||
=item *
|
||||
|
||||
C<a{n,}> = match at least C<n> or more times
|
||||
|
||||
=item *
|
||||
|
||||
C<a{n}> = match exactly C<n> times
|
||||
|
||||
=back
|
||||
|
||||
Here are some examples:
|
||||
|
||||
/[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
|
||||
# any number of digits
|
||||
/(\w+)\s+\g1/; # match doubled words of arbitrary length
|
||||
$year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
|
||||
# than 4 digits
|
||||
$year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3 digit dates
|
||||
|
||||
These quantifiers will try to match as much of the string as possible,
|
||||
while still allowing the regex to match. So we have
|
||||
|
||||
$x = 'the cat in the hat';
|
||||
$x =~ /^(.*)(at)(.*)$/; # matches,
|
||||
# $1 = 'the cat in the h'
|
||||
# $2 = 'at'
|
||||
# $3 = '' (0 matches)
|
||||
|
||||
The first quantifier C<.*> grabs as much of the string as possible
|
||||
while still having the regex match. The second quantifier C<.*> has
|
||||
no string left to it, so it matches 0 times.
|
||||
|
||||
=head2 More matching
|
||||
|
||||
There are a few more things you might want to know about matching
|
||||
operators.
|
||||
The global modifier C</g> allows the matching operator to match
|
||||
within a string as many times as possible. In scalar context,
|
||||
successive matches against a string will have C</g> jump from match
|
||||
to match, keeping track of position in the string as it goes along.
|
||||
You can get or set the position with the C<pos()> function.
|
||||
For example,
|
||||
|
||||
$x = "cat dog house"; # 3 words
|
||||
while ($x =~ /(\w+)/g) {
|
||||
print "Word is $1, ends at position ", pos $x, "\n";
|
||||
}
|
||||
|
||||
prints
|
||||
|
||||
Word is cat, ends at position 3
|
||||
Word is dog, ends at position 7
|
||||
Word is house, ends at position 13
|
||||
|
||||
A failed match or changing the target string resets the position. If
|
||||
you don't want the position reset after failure to match, add the
|
||||
C</c>, as in C</regex/gc>.
|
||||
|
||||
In list context, C</g> returns a list of matched groupings, or if
|
||||
there are no groupings, a list of matches to the whole regex. So
|
||||
|
||||
@words = ($x =~ /(\w+)/g); # matches,
|
||||
# $word[0] = 'cat'
|
||||
# $word[1] = 'dog'
|
||||
# $word[2] = 'house'
|
||||
|
||||
=head2 Search and replace
|
||||
|
||||
Search and replace is performed using C<s/regex/replacement/modifiers>.
|
||||
The C<replacement> is a Perl double-quoted string that replaces in the
|
||||
string whatever is matched with the C<regex>. The operator C<=~> is
|
||||
also used here to associate a string with C<s///>. If matching
|
||||
against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match,
|
||||
C<s///> returns the number of substitutions made; otherwise it returns
|
||||
false. Here are a few examples:
|
||||
|
||||
$x = "Time to feed the cat!";
|
||||
$x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
|
||||
$y = "'quoted words'";
|
||||
$y =~ s/^'(.*)'$/$1/; # strip single quotes,
|
||||
# $y contains "quoted words"
|
||||
|
||||
With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
|
||||
are immediately available for use in the replacement expression. With
|
||||
the global modifier, C<s///g> will search and replace all occurrences
|
||||
of the regex in the string:
|
||||
|
||||
$x = "I batted 4 for 4";
|
||||
$x =~ s/4/four/; # $x contains "I batted four for 4"
|
||||
$x = "I batted 4 for 4";
|
||||
$x =~ s/4/four/g; # $x contains "I batted four for four"
|
||||
|
||||
The non-destructive modifier C<s///r> causes the result of the substitution
|
||||
to be returned instead of modifying C<$_> (or whatever variable the
|
||||
substitute was bound to with C<=~>):
|
||||
|
||||
$x = "I like dogs.";
|
||||
$y = $x =~ s/dogs/cats/r;
|
||||
print "$x $y\n"; # prints "I like dogs. I like cats."
|
||||
|
||||
$x = "Cats are great.";
|
||||
print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
|
||||
s/Frogs/Hedgehogs/r, "\n";
|
||||
# prints "Hedgehogs are great."
|
||||
|
||||
@foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
|
||||
# @foo is now qw(X X X 1 2 3)
|
||||
|
||||
The evaluation modifier C<s///e> wraps an C<eval{...}> around the
|
||||
replacement string and the evaluated result is substituted for the
|
||||
matched substring. Some examples:
|
||||
|
||||
# reverse all the words in a string
|
||||
$x = "the cat in the hat";
|
||||
$x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah"
|
||||
|
||||
# convert percentage to decimal
|
||||
$x = "A 39% hit rate";
|
||||
$x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"
|
||||
|
||||
The last example shows that C<s///> can use other delimiters, such as
|
||||
C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used
|
||||
C<s'''>, then the regex and replacement are treated as single-quoted
|
||||
strings.
|
||||
|
||||
=head2 The split operator
|
||||
|
||||
C<split /regex/, string> splits C<string> into a list of substrings
|
||||
and returns that list. The regex determines the character sequence
|
||||
that C<string> is split with respect to. For example, to split a
|
||||
string into words, use
|
||||
|
||||
$x = "Calvin and Hobbes";
|
||||
@word = split /\s+/, $x; # $word[0] = 'Calvin'
|
||||
# $word[1] = 'and'
|
||||
# $word[2] = 'Hobbes'
|
||||
|
||||
To extract a comma-delimited list of numbers, use
|
||||
|
||||
$x = "1.618,2.718, 3.142";
|
||||
@const = split /,\s*/, $x; # $const[0] = '1.618'
|
||||
# $const[1] = '2.718'
|
||||
# $const[2] = '3.142'
|
||||
|
||||
If the empty regex C<//> is used, the string is split into individual
|
||||
characters. If the regex has groupings, then the list produced contains
|
||||
the matched substrings from the groupings as well:
|
||||
|
||||
$x = "/usr/bin";
|
||||
@parts = split m!(/)!, $x; # $parts[0] = ''
|
||||
# $parts[1] = '/'
|
||||
# $parts[2] = 'usr'
|
||||
# $parts[3] = '/'
|
||||
# $parts[4] = 'bin'
|
||||
|
||||
Since the first character of $x matched the regex, C<split> prepended
|
||||
an empty initial element to the list.
|
||||
|
||||
=head2 C<use re 'strict'>
|
||||
|
||||
New in v5.22, this applies stricter rules than otherwise when compiling
|
||||
regular expression patterns. It can find things that, while legal, may
|
||||
not be what you intended.
|
||||
|
||||
See L<'strict' in re|re/'strict' mode>.
|
||||
|
||||
=head1 BUGS
|
||||
|
||||
None.
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
This is just a quick start guide. For a more in-depth tutorial on
|
||||
regexes, see L<perlretut> and for the reference page, see L<perlre>.
|
||||
|
||||
=head1 AUTHOR AND COPYRIGHT
|
||||
|
||||
Copyright (c) 2000 Mark Kvale
|
||||
All rights reserved.
|
||||
|
||||
This document may be distributed under the same terms as Perl itself.
|
||||
|
||||
=head2 Acknowledgments
|
||||
|
||||
The author would like to thank Mark-Jason Dominus, Tom Christiansen,
|
||||
Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
|
||||
comments.
|
||||
|
||||
=cut
|
||||
|
||||
Reference in New Issue
Block a user