Initial Commit
This commit is contained in:
781
database/perl/lib/pods/perlrebackslash.pod
Normal file
781
database/perl/lib/pods/perlrebackslash.pod
Normal file
@@ -0,0 +1,781 @@
|
||||
=head1 NAME
|
||||
|
||||
perlrebackslash - Perl Regular Expression Backslash Sequences and Escapes
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
The top level documentation about Perl regular expressions
|
||||
is found in L<perlre>.
|
||||
|
||||
This document describes all backslash and escape sequences. After
|
||||
explaining the role of the backslash, it lists all the sequences that have
|
||||
a special meaning in Perl regular expressions (in alphabetical order),
|
||||
then describes each of them.
|
||||
|
||||
Most sequences are described in detail in different documents; the primary
|
||||
purpose of this document is to have a quick reference guide describing all
|
||||
backslash and escape sequences.
|
||||
|
||||
=head2 The backslash
|
||||
|
||||
In a regular expression, the backslash can perform one of two tasks:
|
||||
it either takes away the special meaning of the character following it
|
||||
(for instance, C<\|> matches a vertical bar, it's not an alternation),
|
||||
or it is the start of a backslash or escape sequence.
|
||||
|
||||
The rules determining what it is are quite simple: if the character
|
||||
following the backslash is an ASCII punctuation (non-word) character (that is,
|
||||
anything that is not a letter, digit, or underscore), then the backslash just
|
||||
takes away any special meaning of the character following it.
|
||||
|
||||
If the character following the backslash is an ASCII letter or an ASCII digit,
|
||||
then the sequence may be special; if so, it's listed below. A few letters have
|
||||
not been used yet, so escaping them with a backslash doesn't change them to be
|
||||
special. A future version of Perl may assign a special meaning to them, so if
|
||||
you have warnings turned on, Perl issues a warning if you use such a
|
||||
sequence. [1].
|
||||
|
||||
It is however guaranteed that backslash or escape sequences never have a
|
||||
punctuation character following the backslash, not now, and not in a future
|
||||
version of Perl 5. So it is safe to put a backslash in front of a non-word
|
||||
character.
|
||||
|
||||
Note that the backslash itself is special; if you want to match a backslash,
|
||||
you have to escape the backslash with a backslash: C</\\/> matches a single
|
||||
backslash.
|
||||
|
||||
=over 4
|
||||
|
||||
=item [1]
|
||||
|
||||
There is one exception. If you use an alphanumeric character as the
|
||||
delimiter of your pattern (which you probably shouldn't do for readability
|
||||
reasons), you have to escape the delimiter if you want to match
|
||||
it. Perl won't warn then. See also L<perlop/Gory details of parsing
|
||||
quoted constructs>.
|
||||
|
||||
=back
|
||||
|
||||
|
||||
=head2 All the sequences and escapes
|
||||
|
||||
Those not usable within a bracketed character class (like C<[\da-z]>) are marked
|
||||
as C<Not in [].>
|
||||
|
||||
\000 Octal escape sequence. See also \o{}.
|
||||
\1 Absolute backreference. Not in [].
|
||||
\a Alarm or bell.
|
||||
\A Beginning of string. Not in [].
|
||||
\b{}, \b Boundary. (\b is a backspace in []).
|
||||
\B{}, \B Not a boundary. Not in [].
|
||||
\cX Control-X.
|
||||
\d Match any digit character.
|
||||
\D Match any character that isn't a digit.
|
||||
\e Escape character.
|
||||
\E Turn off \Q, \L and \U processing. Not in [].
|
||||
\f Form feed.
|
||||
\F Foldcase till \E. Not in [].
|
||||
\g{}, \g1 Named, absolute or relative backreference.
|
||||
Not in [].
|
||||
\G Pos assertion. Not in [].
|
||||
\h Match any horizontal whitespace character.
|
||||
\H Match any character that isn't horizontal whitespace.
|
||||
\k{}, \k<>, \k'' Named backreference. Not in [].
|
||||
\K Keep the stuff left of \K. Not in [].
|
||||
\l Lowercase next character. Not in [].
|
||||
\L Lowercase till \E. Not in [].
|
||||
\n (Logical) newline character.
|
||||
\N Match any character but newline. Not in [].
|
||||
\N{} Named or numbered (Unicode) character or sequence.
|
||||
\o{} Octal escape sequence.
|
||||
\p{}, \pP Match any character with the given Unicode property.
|
||||
\P{}, \PP Match any character without the given property.
|
||||
\Q Quote (disable) pattern metacharacters till \E. Not
|
||||
in [].
|
||||
\r Return character.
|
||||
\R Generic new line. Not in [].
|
||||
\s Match any whitespace character.
|
||||
\S Match any character that isn't a whitespace.
|
||||
\t Tab character.
|
||||
\u Titlecase next character. Not in [].
|
||||
\U Uppercase till \E. Not in [].
|
||||
\v Match any vertical whitespace character.
|
||||
\V Match any character that isn't vertical whitespace
|
||||
\w Match any word character.
|
||||
\W Match any character that isn't a word character.
|
||||
\x{}, \x00 Hexadecimal escape sequence.
|
||||
\X Unicode "extended grapheme cluster". Not in [].
|
||||
\z End of string. Not in [].
|
||||
\Z End of string. Not in [].
|
||||
|
||||
=head2 Character Escapes
|
||||
|
||||
=head3 Fixed characters
|
||||
|
||||
A handful of characters have a dedicated I<character escape>. The following
|
||||
table shows them, along with their ASCII code points (in decimal and hex),
|
||||
their ASCII name, the control escape on ASCII platforms and a short
|
||||
description. (For EBCDIC platforms, see L<perlebcdic/OPERATOR DIFFERENCES>.)
|
||||
|
||||
Seq. Code Point ASCII Cntrl Description.
|
||||
Dec Hex
|
||||
\a 7 07 BEL \cG alarm or bell
|
||||
\b 8 08 BS \cH backspace [1]
|
||||
\e 27 1B ESC \c[ escape character
|
||||
\f 12 0C FF \cL form feed
|
||||
\n 10 0A LF \cJ line feed [2]
|
||||
\r 13 0D CR \cM carriage return
|
||||
\t 9 09 TAB \cI tab
|
||||
|
||||
=over 4
|
||||
|
||||
=item [1]
|
||||
|
||||
C<\b> is the backspace character only inside a character class. Outside a
|
||||
character class, C<\b> alone is a word-character/non-word-character
|
||||
boundary, and C<\b{}> is some other type of boundary.
|
||||
|
||||
=item [2]
|
||||
|
||||
C<\n> matches a logical newline. Perl converts between C<\n> and your
|
||||
OS's native newline character when reading from or writing to text files.
|
||||
|
||||
=back
|
||||
|
||||
=head4 Example
|
||||
|
||||
$str =~ /\t/; # Matches if $str contains a (horizontal) tab.
|
||||
|
||||
=head3 Control characters
|
||||
|
||||
C<\c> is used to denote a control character; the character following C<\c>
|
||||
determines the value of the construct. For example the value of C<\cA> is
|
||||
C<chr(1)>, and the value of C<\cb> is C<chr(2)>, etc.
|
||||
The gory details are in L<perlop/"Regexp Quote-Like Operators">. A complete
|
||||
list of what C<chr(1)>, etc. means for ASCII and EBCDIC platforms is in
|
||||
L<perlebcdic/OPERATOR DIFFERENCES>.
|
||||
|
||||
Note that C<\c\> alone at the end of a regular expression (or doubled-quoted
|
||||
string) is not valid. The backslash must be followed by another character.
|
||||
That is, C<\c\I<X>> means C<chr(28) . 'I<X>'> for all characters I<X>.
|
||||
|
||||
To write platform-independent code, you must use C<\N{I<NAME>}> instead, like
|
||||
C<\N{ESCAPE}> or C<\N{U+001B}>, see L<charnames>.
|
||||
|
||||
Mnemonic: I<c>ontrol character.
|
||||
|
||||
=head4 Example
|
||||
|
||||
$str =~ /\cK/; # Matches if $str contains a vertical tab (control-K).
|
||||
|
||||
=head3 Named or numbered characters and character sequences
|
||||
|
||||
Unicode characters have a Unicode name and numeric code point (ordinal)
|
||||
value. Use the
|
||||
C<\N{}> construct to specify a character by either of these values.
|
||||
Certain sequences of characters also have names.
|
||||
|
||||
To specify by name, the name of the character or character sequence goes
|
||||
between the curly braces.
|
||||
|
||||
To specify a character by Unicode code point, use the form C<\N{U+I<code
|
||||
point>}>, where I<code point> is a number in hexadecimal that gives the
|
||||
code point that Unicode has assigned to the desired character. It is
|
||||
customary but not required to use leading zeros to pad the number to 4
|
||||
digits. Thus C<\N{U+0041}> means C<LATIN CAPITAL LETTER A>, and you will
|
||||
rarely see it written without the two leading zeros. C<\N{U+0041}> means
|
||||
"A" even on EBCDIC machines (where the ordinal value of "A" is not 0x41).
|
||||
|
||||
It is even possible to give your own names to characters and character
|
||||
sequences by using the L<charnames> module. These custom names are
|
||||
lexically scoped, and so a given code point may have different names
|
||||
in different scopes. The name used is what is in effect at the time the
|
||||
C<\N{}> is expanded. For patterns in double-quotish context, that means
|
||||
at the time the pattern is parsed. But for patterns that are delimitted
|
||||
by single quotes, the expansion is deferred until pattern compilation
|
||||
time, which may very well have a different C<charnames> translator in
|
||||
effect.
|
||||
|
||||
(There is an expanded internal form that you may see in debug output:
|
||||
C<\N{U+I<code point>.I<code point>...}>.
|
||||
The C<...> means any number of these I<code point>s separated by dots.
|
||||
This represents the sequence formed by the characters. This is an internal
|
||||
form only, subject to change, and you should not try to use it yourself.)
|
||||
|
||||
Mnemonic: I<N>amed character.
|
||||
|
||||
Note that a character or character sequence expressed as a named
|
||||
or numbered character is considered a character without special
|
||||
meaning by the regex engine, and will match "as is".
|
||||
|
||||
=head4 Example
|
||||
|
||||
$str =~ /\N{THAI CHARACTER SO SO}/; # Matches the Thai SO SO character
|
||||
|
||||
use charnames 'Cyrillic'; # Loads Cyrillic names.
|
||||
$str =~ /\N{ZHE}\N{KA}/; # Match "ZHE" followed by "KA".
|
||||
|
||||
=head3 Octal escapes
|
||||
|
||||
There are two forms of octal escapes. Each is used to specify a character by
|
||||
its code point specified in octal notation.
|
||||
|
||||
One form, available starting in Perl 5.14 looks like C<\o{...}>, where the dots
|
||||
represent one or more octal digits. It can be used for any Unicode character.
|
||||
|
||||
It was introduced to avoid the potential problems with the other form,
|
||||
available in all Perls. That form consists of a backslash followed by three
|
||||
octal digits. One problem with this form is that it can look exactly like an
|
||||
old-style backreference (see
|
||||
L</Disambiguation rules between old-style octal escapes and backreferences>
|
||||
below.) You can avoid this by making the first of the three digits always a
|
||||
zero, but that makes \077 the largest code point specifiable.
|
||||
|
||||
In some contexts, a backslash followed by two or even one octal digits may be
|
||||
interpreted as an octal escape, sometimes with a warning, and because of some
|
||||
bugs, sometimes with surprising results. Also, if you are creating a regex
|
||||
out of smaller snippets concatenated together, and you use fewer than three
|
||||
digits, the beginning of one snippet may be interpreted as adding digits to the
|
||||
ending of the snippet before it. See L</Absolute referencing> for more
|
||||
discussion and examples of the snippet problem.
|
||||
|
||||
Note that a character expressed as an octal escape is considered
|
||||
a character without special meaning by the regex engine, and will match
|
||||
"as is".
|
||||
|
||||
To summarize, the C<\o{}> form is always safe to use, and the other form is
|
||||
safe to use for code points through \077 when you use exactly three digits to
|
||||
specify them.
|
||||
|
||||
Mnemonic: I<0>ctal or I<o>ctal.
|
||||
|
||||
=head4 Examples (assuming an ASCII platform)
|
||||
|
||||
$str = "Perl";
|
||||
$str =~ /\o{120}/; # Match, "\120" is "P".
|
||||
$str =~ /\120/; # Same.
|
||||
$str =~ /\o{120}+/; # Match, "\120" is "P",
|
||||
# it's repeated at least once.
|
||||
$str =~ /\120+/; # Same.
|
||||
$str =~ /P\053/; # No match, "\053" is "+" and taken literally.
|
||||
/\o{23073}/ # Black foreground, white background smiling face.
|
||||
/\o{4801234567}/ # Raises a warning, and yields chr(4).
|
||||
|
||||
=head4 Disambiguation rules between old-style octal escapes and backreferences
|
||||
|
||||
Octal escapes of the C<\000> form outside of bracketed character classes
|
||||
potentially clash with old-style backreferences (see L</Absolute referencing>
|
||||
below). They both consist of a backslash followed by numbers. So Perl has to
|
||||
use heuristics to determine whether it is a backreference or an octal escape.
|
||||
Perl uses the following rules to disambiguate:
|
||||
|
||||
=over 4
|
||||
|
||||
=item 1
|
||||
|
||||
If the backslash is followed by a single digit, it's a backreference.
|
||||
|
||||
=item 2
|
||||
|
||||
If the first digit following the backslash is a 0, it's an octal escape.
|
||||
|
||||
=item 3
|
||||
|
||||
If the number following the backslash is N (in decimal), and Perl already
|
||||
has seen N capture groups, Perl considers this a backreference. Otherwise,
|
||||
it considers it an octal escape. If N has more than three digits, Perl
|
||||
takes only the first three for the octal escape; the rest are matched as is.
|
||||
|
||||
my $pat = "(" x 999;
|
||||
$pat .= "a";
|
||||
$pat .= ")" x 999;
|
||||
/^($pat)\1000$/; # Matches 'aa'; there are 1000 capture groups.
|
||||
/^$pat\1000$/; # Matches 'a@0'; there are 999 capture groups
|
||||
# and \1000 is seen as \100 (a '@') and a '0'.
|
||||
|
||||
=back
|
||||
|
||||
You can force a backreference interpretation always by using the C<\g{...}>
|
||||
form. You can the force an octal interpretation always by using the C<\o{...}>
|
||||
form, or for numbers up through \077 (= 63 decimal), by using three digits,
|
||||
beginning with a "0".
|
||||
|
||||
=head3 Hexadecimal escapes
|
||||
|
||||
Like octal escapes, there are two forms of hexadecimal escapes, but both start
|
||||
with the sequence C<\x>. This is followed by either exactly two hexadecimal
|
||||
digits forming a number, or a hexadecimal number of arbitrary length surrounded
|
||||
by curly braces. The hexadecimal number is the code point of the character you
|
||||
want to express.
|
||||
|
||||
Note that a character expressed as one of these escapes is considered a
|
||||
character without special meaning by the regex engine, and will match
|
||||
"as is".
|
||||
|
||||
Mnemonic: heI<x>adecimal.
|
||||
|
||||
=head4 Examples (assuming an ASCII platform)
|
||||
|
||||
$str = "Perl";
|
||||
$str =~ /\x50/; # Match, "\x50" is "P".
|
||||
$str =~ /\x50+/; # Match, "\x50" is "P", it is repeated at least once
|
||||
$str =~ /P\x2B/; # No match, "\x2B" is "+" and taken literally.
|
||||
|
||||
/\x{2603}\x{2602}/ # Snowman with an umbrella.
|
||||
# The Unicode character 2603 is a snowman,
|
||||
# the Unicode character 2602 is an umbrella.
|
||||
/\x{263B}/ # Black smiling face.
|
||||
/\x{263b}/ # Same, the hex digits A - F are case insensitive.
|
||||
|
||||
=head2 Modifiers
|
||||
|
||||
A number of backslash sequences have to do with changing the character,
|
||||
or characters following them. C<\l> will lowercase the character following
|
||||
it, while C<\u> will uppercase (or, more accurately, titlecase) the
|
||||
character following it. They provide functionality similar to the
|
||||
functions C<lcfirst> and C<ucfirst>.
|
||||
|
||||
To uppercase or lowercase several characters, one might want to use
|
||||
C<\L> or C<\U>, which will lowercase/uppercase all characters following
|
||||
them, until either the end of the pattern or the next occurrence of
|
||||
C<\E>, whichever comes first. They provide functionality similar to what
|
||||
the functions C<lc> and C<uc> provide.
|
||||
|
||||
C<\Q> is used to quote (disable) pattern metacharacters, up to the next
|
||||
C<\E> or the end of the pattern. C<\Q> adds a backslash to any character
|
||||
that could have special meaning to Perl. In the ASCII range, it quotes
|
||||
every character that isn't a letter, digit, or underscore. See
|
||||
L<perlfunc/quotemeta> for details on what gets quoted for non-ASCII
|
||||
code points. Using this ensures that any character between C<\Q> and
|
||||
C<\E> will be matched literally, not interpreted as a metacharacter by
|
||||
the regex engine.
|
||||
|
||||
C<\F> can be used to casefold all characters following, up to the next C<\E>
|
||||
or the end of the pattern. It provides the functionality similar to
|
||||
the C<fc> function.
|
||||
|
||||
Mnemonic: I<L>owercase, I<U>ppercase, I<F>old-case, I<Q>uotemeta, I<E>nd.
|
||||
|
||||
=head4 Examples
|
||||
|
||||
$sid = "sid";
|
||||
$greg = "GrEg";
|
||||
$miranda = "(Miranda)";
|
||||
$str =~ /\u$sid/; # Matches 'Sid'
|
||||
$str =~ /\L$greg/; # Matches 'greg'
|
||||
$str =~ /\Q$miranda\E/; # Matches '(Miranda)', as if the pattern
|
||||
# had been written as /\(Miranda\)/
|
||||
|
||||
=head2 Character classes
|
||||
|
||||
Perl regular expressions have a large range of character classes. Some of
|
||||
the character classes are written as a backslash sequence. We will briefly
|
||||
discuss those here; full details of character classes can be found in
|
||||
L<perlrecharclass>.
|
||||
|
||||
C<\w> is a character class that matches any single I<word> character
|
||||
(letters, digits, Unicode marks, and connector punctuation (like the
|
||||
underscore)). C<\d> is a character class that matches any decimal
|
||||
digit, while the character class C<\s> matches any whitespace character.
|
||||
New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
|
||||
and vertical whitespace characters.
|
||||
|
||||
The exact set of characters matched by C<\d>, C<\s>, and C<\w> varies
|
||||
depending on various pragma and regular expression modifiers. It is
|
||||
possible to restrict the match to the ASCII range by using the C</a>
|
||||
regular expression modifier. See L<perlrecharclass>.
|
||||
|
||||
The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are
|
||||
character classes that match, respectively, any character that isn't a
|
||||
word character, digit, whitespace, horizontal whitespace, or vertical
|
||||
whitespace.
|
||||
|
||||
Mnemonics: I<w>ord, I<d>igit, I<s>pace, I<h>orizontal, I<v>ertical.
|
||||
|
||||
=head3 Unicode classes
|
||||
|
||||
C<\pP> (where C<P> is a single letter) and C<\p{Property}> are used to
|
||||
match a character that matches the given Unicode property; properties
|
||||
include things like "letter", or "thai character". Capitalizing the
|
||||
sequence to C<\PP> and C<\P{Property}> make the sequence match a character
|
||||
that doesn't match the given Unicode property. For more details, see
|
||||
L<perlrecharclass/Backslash sequences> and
|
||||
L<perlunicode/Unicode Character Properties>.
|
||||
|
||||
Mnemonic: I<p>roperty.
|
||||
|
||||
=head2 Referencing
|
||||
|
||||
If capturing parenthesis are used in a regular expression, we can refer
|
||||
to the part of the source string that was matched, and match exactly the
|
||||
same thing. There are three ways of referring to such I<backreference>:
|
||||
absolutely, relatively, and by name.
|
||||
|
||||
=for later add link to perlrecapture
|
||||
|
||||
=head3 Absolute referencing
|
||||
|
||||
Either C<\gI<N>> (starting in Perl 5.10.0), or C<\I<N>> (old-style) where I<N>
|
||||
is a positive (unsigned) decimal number of any length is an absolute reference
|
||||
to a capturing group.
|
||||
|
||||
I<N> refers to the Nth set of parentheses, so C<\gI<N>> refers to whatever has
|
||||
been matched by that set of parentheses. Thus C<\g1> refers to the first
|
||||
capture group in the regex.
|
||||
|
||||
The C<\gI<N>> form can be equivalently written as C<\g{I<N>}>
|
||||
which avoids ambiguity when building a regex by concatenating shorter
|
||||
strings. Otherwise if you had a regex C<qr/$a$b/>, and C<$a> contained
|
||||
C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which is
|
||||
probably not what you intended.
|
||||
|
||||
In the C<\I<N>> form, I<N> must not begin with a "0", and there must be at
|
||||
least I<N> capturing groups, or else I<N> is considered an octal escape
|
||||
(but something like C<\18> is the same as C<\0018>; that is, the octal escape
|
||||
C<"\001"> followed by a literal digit C<"8">).
|
||||
|
||||
Mnemonic: I<g>roup.
|
||||
|
||||
=head4 Examples
|
||||
|
||||
/(\w+) \g1/; # Finds a duplicated word, (e.g. "cat cat").
|
||||
/(\w+) \1/; # Same thing; written old-style.
|
||||
/(.)(.)\g2\g1/; # Match a four letter palindrome (e.g. "ABBA").
|
||||
|
||||
|
||||
=head3 Relative referencing
|
||||
|
||||
C<\g-I<N>> (starting in Perl 5.10.0) is used for relative addressing. (It can
|
||||
be written as C<\g{-I<N>}>.) It refers to the I<N>th group before the
|
||||
C<\g{-I<N>}>.
|
||||
|
||||
The big advantage of this form is that it makes it much easier to write
|
||||
patterns with references that can be interpolated in larger patterns,
|
||||
even if the larger pattern also contains capture groups.
|
||||
|
||||
=head4 Examples
|
||||
|
||||
/(A) # Group 1
|
||||
( # Group 2
|
||||
(B) # Group 3
|
||||
\g{-1} # Refers to group 3 (B)
|
||||
\g{-3} # Refers to group 1 (A)
|
||||
)
|
||||
/x; # Matches "ABBA".
|
||||
|
||||
my $qr = qr /(.)(.)\g{-2}\g{-1}/; # Matches 'abab', 'cdcd', etc.
|
||||
/$qr$qr/ # Matches 'ababcdcd'.
|
||||
|
||||
=head3 Named referencing
|
||||
|
||||
C<\g{I<name>}> (starting in Perl 5.10.0) can be used to back refer to a
|
||||
named capture group, dispensing completely with having to think about capture
|
||||
buffer positions.
|
||||
|
||||
To be compatible with .Net regular expressions, C<\g{name}> may also be
|
||||
written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>.
|
||||
|
||||
To prevent any ambiguity, I<name> must not start with a digit nor contain a
|
||||
hyphen.
|
||||
|
||||
=head4 Examples
|
||||
|
||||
/(?<word>\w+) \g{word}/ # Finds duplicated word, (e.g. "cat cat")
|
||||
/(?<word>\w+) \k{word}/ # Same.
|
||||
/(?<word>\w+) \k<word>/ # Same.
|
||||
/(?<letter1>.)(?<letter2>.)\g{letter2}\g{letter1}/
|
||||
# Match a four letter palindrome (e.g. "ABBA")
|
||||
|
||||
=head2 Assertions
|
||||
|
||||
Assertions are conditions that have to be true; they don't actually
|
||||
match parts of the substring. There are six assertions that are written as
|
||||
backslash sequences.
|
||||
|
||||
=over 4
|
||||
|
||||
=item \A
|
||||
|
||||
C<\A> only matches at the beginning of the string. If the C</m> modifier
|
||||
isn't used, then C</\A/> is equivalent to C</^/>. However, if the C</m>
|
||||
modifier is used, then C</^/> matches internal newlines, but the meaning
|
||||
of C</\A/> isn't changed by the C</m> modifier. C<\A> matches at the beginning
|
||||
of the string regardless whether the C</m> modifier is used.
|
||||
|
||||
=item \z, \Z
|
||||
|
||||
C<\z> and C<\Z> match at the end of the string. If the C</m> modifier isn't
|
||||
used, then C</\Z/> is equivalent to C</$/>; that is, it matches at the
|
||||
end of the string, or one before the newline at the end of the string. If the
|
||||
C</m> modifier is used, then C</$/> matches at internal newlines, but the
|
||||
meaning of C</\Z/> isn't changed by the C</m> modifier. C<\Z> matches at
|
||||
the end of the string (or just before a trailing newline) regardless whether
|
||||
the C</m> modifier is used.
|
||||
|
||||
C<\z> is just like C<\Z>, except that it does not match before a trailing
|
||||
newline. C<\z> matches at the end of the string only, regardless of the
|
||||
modifiers used, and not just before a newline. It is how to anchor the
|
||||
match to the true end of the string under all conditions.
|
||||
|
||||
=item \G
|
||||
|
||||
C<\G> is usually used only in combination with the C</g> modifier. If the
|
||||
C</g> modifier is used and the match is done in scalar context, Perl
|
||||
remembers where in the source string the last match ended, and the next time,
|
||||
it will start the match from where it ended the previous time.
|
||||
|
||||
C<\G> matches the point where the previous match on that string ended,
|
||||
or the beginning of that string if there was no previous match.
|
||||
|
||||
=for later add link to perlremodifiers
|
||||
|
||||
Mnemonic: I<G>lobal.
|
||||
|
||||
=item \b{}, \b, \B{}, \B
|
||||
|
||||
C<\b{...}>, available starting in v5.22, matches a boundary (between two
|
||||
characters, or before the first character of the string, or after the
|
||||
final character of the string) based on the Unicode rules for the
|
||||
boundary type specified inside the braces. The boundary
|
||||
types are given a few paragraphs below. C<\B{...}> matches at any place
|
||||
between characters where C<\b{...}> of the same type doesn't match.
|
||||
|
||||
C<\b> when not immediately followed by a C<"{"> matches at any place
|
||||
between a word (something matched by C<\w>) and a non-word character
|
||||
(C<\W>); C<\B> when not immediately followed by a C<"{"> matches at any
|
||||
place between characters where C<\b> doesn't match. To get better
|
||||
word matching of natural language text, see L</\b{wb}> below.
|
||||
|
||||
C<\b>
|
||||
and C<\B> assume there's a non-word character before the beginning and after
|
||||
the end of the source string; so C<\b> will match at the beginning (or end)
|
||||
of the source string if the source string begins (or ends) with a word
|
||||
character. Otherwise, C<\B> will match.
|
||||
|
||||
Do not use something like C<\b=head\d\b> and expect it to match the
|
||||
beginning of a line. It can't, because for there to be a boundary before
|
||||
the non-word "=", there must be a word character immediately previous.
|
||||
All plain C<\b> and C<\B> boundary determinations look for word
|
||||
characters alone, not for
|
||||
non-word characters nor for string ends. It may help to understand how
|
||||
C<\b> and C<\B> work by equating them as follows:
|
||||
|
||||
\b really means (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
|
||||
\B really means (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
|
||||
|
||||
In contrast, C<\b{...}> and C<\B{...}> may or may not match at the
|
||||
beginning and end of the line, depending on the boundary type. These
|
||||
implement the Unicode default boundaries, specified in
|
||||
L<https://www.unicode.org/reports/tr14/> and
|
||||
L<https://www.unicode.org/reports/tr29/>.
|
||||
The boundary types are:
|
||||
|
||||
=over
|
||||
|
||||
=item C<\b{gcb}> or C<\b{g}>
|
||||
|
||||
This matches a Unicode "Grapheme Cluster Boundary". (Actually Perl
|
||||
always uses the improved "extended" grapheme cluster"). These are
|
||||
explained below under L</C<\X>>. In fact, C<\X> is another way to get
|
||||
the same functionality. It is equivalent to C</.+?\b{gcb}/>. Use
|
||||
whichever is most convenient for your situation.
|
||||
|
||||
=item C<\b{lb}>
|
||||
|
||||
This matches according to the default Unicode Line Breaking Algorithm
|
||||
(L<https://www.unicode.org/reports/tr14/>), as customized in that
|
||||
document
|
||||
(L<Example 7 of revision 35|https://www.unicode.org/reports/tr14/tr14-35.html#Example7>)
|
||||
for better handling of numeric expressions.
|
||||
|
||||
This is suitable for many purposes, but the L<Unicode::LineBreak> module
|
||||
is available on CPAN that provides many more features, including
|
||||
customization.
|
||||
|
||||
=item C<\b{sb}>
|
||||
|
||||
This matches a Unicode "Sentence Boundary". This is an aid to parsing
|
||||
natural language sentences. It gives good, but imperfect results. For
|
||||
example, it thinks that "Mr. Smith" is two sentences. More details are
|
||||
at L<https://www.unicode.org/reports/tr29/>. Note also that it thinks
|
||||
that anything matching L</\R> (except form feed and vertical tab) is a
|
||||
sentence boundary. C<\b{sb}> works with text designed for
|
||||
word-processors which wrap lines
|
||||
automatically for display, but hard-coded line boundaries are considered
|
||||
to be essentially the ends of text blocks (paragraphs really), and hence
|
||||
the ends of sentences. C<\b{sb}> doesn't do well with text containing
|
||||
embedded newlines, like the source text of the document you are reading.
|
||||
Such text needs to be preprocessed to get rid of the line separators
|
||||
before looking for sentence boundaries. Some people view this as a bug
|
||||
in the Unicode standard, and this behavior is quite subject to change in
|
||||
future Perl versions.
|
||||
|
||||
=item C<\b{wb}>
|
||||
|
||||
This matches a Unicode "Word Boundary", but tailored to Perl
|
||||
expectations. This gives better (though not
|
||||
perfect) results for natural language processing than plain C<\b>
|
||||
(without braces) does. For example, it understands that apostrophes can
|
||||
be in the middle of words and that parentheses aren't (see the examples
|
||||
below). More details are at L<https://www.unicode.org/reports/tr29/>.
|
||||
|
||||
The current Unicode definition of a Word Boundary matches between every
|
||||
white space character. Perl tailors this, starting in version 5.24, to
|
||||
generally not break up spans of white space, just as plain C<\b> has
|
||||
always functioned. This allows C<\b{wb}> to be a drop-in replacement for
|
||||
C<\b>, but with generally better results for natural language
|
||||
processing. (The exception to this tailoring is when a span of white
|
||||
space is immediately followed by something like U+0303, COMBINING TILDE.
|
||||
If the final space character in the span is a horizontal white space, it
|
||||
is broken out so that it attaches instead to the combining character.
|
||||
To be precise, if a span of white space that ends in a horizontal space
|
||||
has the character immediately following it have any of the Word
|
||||
Boundary property values "Extend", "Format" or "ZWJ", the boundary between the
|
||||
final horizontal space character and the rest of the span matches
|
||||
C<\b{wb}>. In all other cases the boundary between two white space
|
||||
characters matches C<\B{wb}>.)
|
||||
|
||||
=back
|
||||
|
||||
It is important to realize when you use these Unicode boundaries,
|
||||
that you are taking a risk that a future version of Perl which contains
|
||||
a later version of the Unicode Standard will not work precisely the same
|
||||
way as it did when your code was written. These rules are not
|
||||
considered stable and have been somewhat more subject to change than the
|
||||
rest of the Standard. Unicode reserves the right to change them at
|
||||
will, and Perl reserves the right to update its implementation to
|
||||
Unicode's new rules. In the past, some changes have been because new
|
||||
characters have been added to the Standard which have different
|
||||
characteristics than all previous characters, so new rules are
|
||||
formulated for handling them. These should not cause any backward
|
||||
compatibility issues. But some changes have changed the treatment of
|
||||
existing characters because the Unicode Technical Committee has decided
|
||||
that the change is warranted for whatever reason. This could be to fix
|
||||
a bug, or because they think better results are obtained with the new
|
||||
rule.
|
||||
|
||||
It is also important to realize that these are default boundary
|
||||
definitions, and that implementations may wish to tailor the results for
|
||||
particular purposes and locales. For example, some languages, such as
|
||||
Japanese and Thai, require dictionary lookup to accurately determine
|
||||
word boundaries.
|
||||
|
||||
Mnemonic: I<b>oundary.
|
||||
|
||||
=back
|
||||
|
||||
=head4 Examples
|
||||
|
||||
"cat" =~ /\Acat/; # Match.
|
||||
"cat" =~ /cat\Z/; # Match.
|
||||
"cat\n" =~ /cat\Z/; # Match.
|
||||
"cat\n" =~ /cat\z/; # No match.
|
||||
|
||||
"cat" =~ /\bcat\b/; # Matches.
|
||||
"cats" =~ /\bcat\b/; # No match.
|
||||
"cat" =~ /\bcat\B/; # No match.
|
||||
"cats" =~ /\bcat\B/; # Match.
|
||||
|
||||
while ("cat dog" =~ /(\w+)/g) {
|
||||
print $1; # Prints 'catdog'
|
||||
}
|
||||
while ("cat dog" =~ /\G(\w+)/g) {
|
||||
print $1; # Prints 'cat'
|
||||
}
|
||||
|
||||
my $s = "He said, \"Is pi 3.14? (I'm not sure).\"";
|
||||
print join("|", $s =~ m/ ( .+? \b ) /xg), "\n";
|
||||
print join("|", $s =~ m/ ( .+? \b{wb} ) /xg), "\n";
|
||||
prints
|
||||
He| |said|, "|Is| |pi| |3|.|14|? (|I|'|m| |not| |sure
|
||||
He| |said|,| |"|Is| |pi| |3.14|?| |(|I'm| |not| |sure|)|.|"
|
||||
|
||||
=head2 Misc
|
||||
|
||||
Here we document the backslash sequences that don't fall in one of the
|
||||
categories above. These are:
|
||||
|
||||
=over 4
|
||||
|
||||
=item \K
|
||||
|
||||
This appeared in perl 5.10.0. Anything matched left of C<\K> is
|
||||
not included in C<$&>, and will not be replaced if the pattern is
|
||||
used in a substitution. This lets you write C<s/PAT1 \K PAT2/REPL/x>
|
||||
instead of C<s/(PAT1) PAT2/${1}REPL/x> or C<s/(?<=PAT1) PAT2/REPL/x>.
|
||||
|
||||
Mnemonic: I<K>eep.
|
||||
|
||||
=item \N
|
||||
|
||||
This feature, available starting in v5.12, matches any character
|
||||
that is B<not> a newline. It is a short-hand for writing C<[^\n]>, and is
|
||||
identical to the C<.> metasymbol, except under the C</s> flag, which changes
|
||||
the meaning of C<.>, but not C<\N>.
|
||||
|
||||
Note that C<\N{...}> can mean a
|
||||
L<named or numbered character
|
||||
|/Named or numbered characters and character sequences>.
|
||||
|
||||
Mnemonic: Complement of I<\n>.
|
||||
|
||||
=item \R
|
||||
X<\R>
|
||||
|
||||
C<\R> matches a I<generic newline>; that is, anything considered a
|
||||
linebreak sequence by Unicode. This includes all characters matched by
|
||||
C<\v> (vertical whitespace), and the multi character sequence C<"\x0D\x0A">
|
||||
(carriage return followed by a line feed, sometimes called the network
|
||||
newline; it's the end of line sequence used in Microsoft text files opened
|
||||
in binary mode). C<\R> is equivalent to C<< (?>\x0D\x0A|\v) >>. (The
|
||||
reason it doesn't backtrack is that the sequence is considered
|
||||
inseparable. That means that
|
||||
|
||||
"\x0D\x0A" =~ /^\R\x0A$/ # No match
|
||||
|
||||
fails, because the C<\R> matches the entire string, and won't backtrack
|
||||
to match just the C<"\x0D">.) Since
|
||||
C<\R> can match a sequence of more than one character, it cannot be put
|
||||
inside a bracketed character class; C</[\R]/> is an error; use C<\v>
|
||||
instead. C<\R> was introduced in perl 5.10.0.
|
||||
|
||||
Note that this does not respect any locale that might be in effect; it
|
||||
matches according to the platform's native character set.
|
||||
|
||||
Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>,
|
||||
and more importantly because Unicode recommends such a regular expression
|
||||
metacharacter, and suggests C<\R> as its notation.
|
||||
|
||||
=item \X
|
||||
X<\X>
|
||||
|
||||
This matches a Unicode I<extended grapheme cluster>.
|
||||
|
||||
C<\X> matches quite well what normal (non-Unicode-programmer) usage
|
||||
would consider a single character. As an example, consider a G with some sort
|
||||
of diacritic mark, such as an arrow. There is no such single character in
|
||||
Unicode, but one can be composed by using a G followed by a Unicode "COMBINING
|
||||
UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it
|
||||
were a single character.
|
||||
|
||||
The match is greedy and non-backtracking, so that the cluster is never
|
||||
broken up into smaller components.
|
||||
|
||||
See also L<C<\b{gcb}>|/\b{}, \b, \B{}, \B>.
|
||||
|
||||
Mnemonic: eI<X>tended Unicode character.
|
||||
|
||||
=back
|
||||
|
||||
=head4 Examples
|
||||
|
||||
$str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz'
|
||||
$str =~ s/(.)\K\g1//g; # Delete duplicated characters.
|
||||
|
||||
"\n" =~ /^\R$/; # Match, \n is a generic newline.
|
||||
"\r" =~ /^\R$/; # Match, \r is a generic newline.
|
||||
"\r\n" =~ /^\R$/; # Match, \r\n is a generic newline.
|
||||
|
||||
"P\x{307}" =~ /^\X$/ # \X matches a P with a dot above.
|
||||
|
||||
=cut
|
||||
Reference in New Issue
Block a user