782 lines
32 KiB
Plaintext
782 lines
32 KiB
Plaintext
=head1 NAME
|
|
|
|
perlrebackslash - Perl Regular Expression Backslash Sequences and Escapes
|
|
|
|
=head1 DESCRIPTION
|
|
|
|
The top level documentation about Perl regular expressions
|
|
is found in L<perlre>.
|
|
|
|
This document describes all backslash and escape sequences. After
|
|
explaining the role of the backslash, it lists all the sequences that have
|
|
a special meaning in Perl regular expressions (in alphabetical order),
|
|
then describes each of them.
|
|
|
|
Most sequences are described in detail in different documents; the primary
|
|
purpose of this document is to have a quick reference guide describing all
|
|
backslash and escape sequences.
|
|
|
|
=head2 The backslash
|
|
|
|
In a regular expression, the backslash can perform one of two tasks:
|
|
it either takes away the special meaning of the character following it
|
|
(for instance, C<\|> matches a vertical bar, it's not an alternation),
|
|
or it is the start of a backslash or escape sequence.
|
|
|
|
The rules determining what it is are quite simple: if the character
|
|
following the backslash is an ASCII punctuation (non-word) character (that is,
|
|
anything that is not a letter, digit, or underscore), then the backslash just
|
|
takes away any special meaning of the character following it.
|
|
|
|
If the character following the backslash is an ASCII letter or an ASCII digit,
|
|
then the sequence may be special; if so, it's listed below. A few letters have
|
|
not been used yet, so escaping them with a backslash doesn't change them to be
|
|
special. A future version of Perl may assign a special meaning to them, so if
|
|
you have warnings turned on, Perl issues a warning if you use such a
|
|
sequence. [1].
|
|
|
|
It is however guaranteed that backslash or escape sequences never have a
|
|
punctuation character following the backslash, not now, and not in a future
|
|
version of Perl 5. So it is safe to put a backslash in front of a non-word
|
|
character.
|
|
|
|
Note that the backslash itself is special; if you want to match a backslash,
|
|
you have to escape the backslash with a backslash: C</\\/> matches a single
|
|
backslash.
|
|
|
|
=over 4
|
|
|
|
=item [1]
|
|
|
|
There is one exception. If you use an alphanumeric character as the
|
|
delimiter of your pattern (which you probably shouldn't do for readability
|
|
reasons), you have to escape the delimiter if you want to match
|
|
it. Perl won't warn then. See also L<perlop/Gory details of parsing
|
|
quoted constructs>.
|
|
|
|
=back
|
|
|
|
|
|
=head2 All the sequences and escapes
|
|
|
|
Those not usable within a bracketed character class (like C<[\da-z]>) are marked
|
|
as C<Not in [].>
|
|
|
|
\000 Octal escape sequence. See also \o{}.
|
|
\1 Absolute backreference. Not in [].
|
|
\a Alarm or bell.
|
|
\A Beginning of string. Not in [].
|
|
\b{}, \b Boundary. (\b is a backspace in []).
|
|
\B{}, \B Not a boundary. Not in [].
|
|
\cX Control-X.
|
|
\d Match any digit character.
|
|
\D Match any character that isn't a digit.
|
|
\e Escape character.
|
|
\E Turn off \Q, \L and \U processing. Not in [].
|
|
\f Form feed.
|
|
\F Foldcase till \E. Not in [].
|
|
\g{}, \g1 Named, absolute or relative backreference.
|
|
Not in [].
|
|
\G Pos assertion. Not in [].
|
|
\h Match any horizontal whitespace character.
|
|
\H Match any character that isn't horizontal whitespace.
|
|
\k{}, \k<>, \k'' Named backreference. Not in [].
|
|
\K Keep the stuff left of \K. Not in [].
|
|
\l Lowercase next character. Not in [].
|
|
\L Lowercase till \E. Not in [].
|
|
\n (Logical) newline character.
|
|
\N Match any character but newline. Not in [].
|
|
\N{} Named or numbered (Unicode) character or sequence.
|
|
\o{} Octal escape sequence.
|
|
\p{}, \pP Match any character with the given Unicode property.
|
|
\P{}, \PP Match any character without the given property.
|
|
\Q Quote (disable) pattern metacharacters till \E. Not
|
|
in [].
|
|
\r Return character.
|
|
\R Generic new line. Not in [].
|
|
\s Match any whitespace character.
|
|
\S Match any character that isn't a whitespace.
|
|
\t Tab character.
|
|
\u Titlecase next character. Not in [].
|
|
\U Uppercase till \E. Not in [].
|
|
\v Match any vertical whitespace character.
|
|
\V Match any character that isn't vertical whitespace
|
|
\w Match any word character.
|
|
\W Match any character that isn't a word character.
|
|
\x{}, \x00 Hexadecimal escape sequence.
|
|
\X Unicode "extended grapheme cluster". Not in [].
|
|
\z End of string. Not in [].
|
|
\Z End of string. Not in [].
|
|
|
|
=head2 Character Escapes
|
|
|
|
=head3 Fixed characters
|
|
|
|
A handful of characters have a dedicated I<character escape>. The following
|
|
table shows them, along with their ASCII code points (in decimal and hex),
|
|
their ASCII name, the control escape on ASCII platforms and a short
|
|
description. (For EBCDIC platforms, see L<perlebcdic/OPERATOR DIFFERENCES>.)
|
|
|
|
Seq. Code Point ASCII Cntrl Description.
|
|
Dec Hex
|
|
\a 7 07 BEL \cG alarm or bell
|
|
\b 8 08 BS \cH backspace [1]
|
|
\e 27 1B ESC \c[ escape character
|
|
\f 12 0C FF \cL form feed
|
|
\n 10 0A LF \cJ line feed [2]
|
|
\r 13 0D CR \cM carriage return
|
|
\t 9 09 TAB \cI tab
|
|
|
|
=over 4
|
|
|
|
=item [1]
|
|
|
|
C<\b> is the backspace character only inside a character class. Outside a
|
|
character class, C<\b> alone is a word-character/non-word-character
|
|
boundary, and C<\b{}> is some other type of boundary.
|
|
|
|
=item [2]
|
|
|
|
C<\n> matches a logical newline. Perl converts between C<\n> and your
|
|
OS's native newline character when reading from or writing to text files.
|
|
|
|
=back
|
|
|
|
=head4 Example
|
|
|
|
$str =~ /\t/; # Matches if $str contains a (horizontal) tab.
|
|
|
|
=head3 Control characters
|
|
|
|
C<\c> is used to denote a control character; the character following C<\c>
|
|
determines the value of the construct. For example the value of C<\cA> is
|
|
C<chr(1)>, and the value of C<\cb> is C<chr(2)>, etc.
|
|
The gory details are in L<perlop/"Regexp Quote-Like Operators">. A complete
|
|
list of what C<chr(1)>, etc. means for ASCII and EBCDIC platforms is in
|
|
L<perlebcdic/OPERATOR DIFFERENCES>.
|
|
|
|
Note that C<\c\> alone at the end of a regular expression (or doubled-quoted
|
|
string) is not valid. The backslash must be followed by another character.
|
|
That is, C<\c\I<X>> means C<chr(28) . 'I<X>'> for all characters I<X>.
|
|
|
|
To write platform-independent code, you must use C<\N{I<NAME>}> instead, like
|
|
C<\N{ESCAPE}> or C<\N{U+001B}>, see L<charnames>.
|
|
|
|
Mnemonic: I<c>ontrol character.
|
|
|
|
=head4 Example
|
|
|
|
$str =~ /\cK/; # Matches if $str contains a vertical tab (control-K).
|
|
|
|
=head3 Named or numbered characters and character sequences
|
|
|
|
Unicode characters have a Unicode name and numeric code point (ordinal)
|
|
value. Use the
|
|
C<\N{}> construct to specify a character by either of these values.
|
|
Certain sequences of characters also have names.
|
|
|
|
To specify by name, the name of the character or character sequence goes
|
|
between the curly braces.
|
|
|
|
To specify a character by Unicode code point, use the form C<\N{U+I<code
|
|
point>}>, where I<code point> is a number in hexadecimal that gives the
|
|
code point that Unicode has assigned to the desired character. It is
|
|
customary but not required to use leading zeros to pad the number to 4
|
|
digits. Thus C<\N{U+0041}> means C<LATIN CAPITAL LETTER A>, and you will
|
|
rarely see it written without the two leading zeros. C<\N{U+0041}> means
|
|
"A" even on EBCDIC machines (where the ordinal value of "A" is not 0x41).
|
|
|
|
It is even possible to give your own names to characters and character
|
|
sequences by using the L<charnames> module. These custom names are
|
|
lexically scoped, and so a given code point may have different names
|
|
in different scopes. The name used is what is in effect at the time the
|
|
C<\N{}> is expanded. For patterns in double-quotish context, that means
|
|
at the time the pattern is parsed. But for patterns that are delimitted
|
|
by single quotes, the expansion is deferred until pattern compilation
|
|
time, which may very well have a different C<charnames> translator in
|
|
effect.
|
|
|
|
(There is an expanded internal form that you may see in debug output:
|
|
C<\N{U+I<code point>.I<code point>...}>.
|
|
The C<...> means any number of these I<code point>s separated by dots.
|
|
This represents the sequence formed by the characters. This is an internal
|
|
form only, subject to change, and you should not try to use it yourself.)
|
|
|
|
Mnemonic: I<N>amed character.
|
|
|
|
Note that a character or character sequence expressed as a named
|
|
or numbered character is considered a character without special
|
|
meaning by the regex engine, and will match "as is".
|
|
|
|
=head4 Example
|
|
|
|
$str =~ /\N{THAI CHARACTER SO SO}/; # Matches the Thai SO SO character
|
|
|
|
use charnames 'Cyrillic'; # Loads Cyrillic names.
|
|
$str =~ /\N{ZHE}\N{KA}/; # Match "ZHE" followed by "KA".
|
|
|
|
=head3 Octal escapes
|
|
|
|
There are two forms of octal escapes. Each is used to specify a character by
|
|
its code point specified in octal notation.
|
|
|
|
One form, available starting in Perl 5.14 looks like C<\o{...}>, where the dots
|
|
represent one or more octal digits. It can be used for any Unicode character.
|
|
|
|
It was introduced to avoid the potential problems with the other form,
|
|
available in all Perls. That form consists of a backslash followed by three
|
|
octal digits. One problem with this form is that it can look exactly like an
|
|
old-style backreference (see
|
|
L</Disambiguation rules between old-style octal escapes and backreferences>
|
|
below.) You can avoid this by making the first of the three digits always a
|
|
zero, but that makes \077 the largest code point specifiable.
|
|
|
|
In some contexts, a backslash followed by two or even one octal digits may be
|
|
interpreted as an octal escape, sometimes with a warning, and because of some
|
|
bugs, sometimes with surprising results. Also, if you are creating a regex
|
|
out of smaller snippets concatenated together, and you use fewer than three
|
|
digits, the beginning of one snippet may be interpreted as adding digits to the
|
|
ending of the snippet before it. See L</Absolute referencing> for more
|
|
discussion and examples of the snippet problem.
|
|
|
|
Note that a character expressed as an octal escape is considered
|
|
a character without special meaning by the regex engine, and will match
|
|
"as is".
|
|
|
|
To summarize, the C<\o{}> form is always safe to use, and the other form is
|
|
safe to use for code points through \077 when you use exactly three digits to
|
|
specify them.
|
|
|
|
Mnemonic: I<0>ctal or I<o>ctal.
|
|
|
|
=head4 Examples (assuming an ASCII platform)
|
|
|
|
$str = "Perl";
|
|
$str =~ /\o{120}/; # Match, "\120" is "P".
|
|
$str =~ /\120/; # Same.
|
|
$str =~ /\o{120}+/; # Match, "\120" is "P",
|
|
# it's repeated at least once.
|
|
$str =~ /\120+/; # Same.
|
|
$str =~ /P\053/; # No match, "\053" is "+" and taken literally.
|
|
/\o{23073}/ # Black foreground, white background smiling face.
|
|
/\o{4801234567}/ # Raises a warning, and yields chr(4).
|
|
|
|
=head4 Disambiguation rules between old-style octal escapes and backreferences
|
|
|
|
Octal escapes of the C<\000> form outside of bracketed character classes
|
|
potentially clash with old-style backreferences (see L</Absolute referencing>
|
|
below). They both consist of a backslash followed by numbers. So Perl has to
|
|
use heuristics to determine whether it is a backreference or an octal escape.
|
|
Perl uses the following rules to disambiguate:
|
|
|
|
=over 4
|
|
|
|
=item 1
|
|
|
|
If the backslash is followed by a single digit, it's a backreference.
|
|
|
|
=item 2
|
|
|
|
If the first digit following the backslash is a 0, it's an octal escape.
|
|
|
|
=item 3
|
|
|
|
If the number following the backslash is N (in decimal), and Perl already
|
|
has seen N capture groups, Perl considers this a backreference. Otherwise,
|
|
it considers it an octal escape. If N has more than three digits, Perl
|
|
takes only the first three for the octal escape; the rest are matched as is.
|
|
|
|
my $pat = "(" x 999;
|
|
$pat .= "a";
|
|
$pat .= ")" x 999;
|
|
/^($pat)\1000$/; # Matches 'aa'; there are 1000 capture groups.
|
|
/^$pat\1000$/; # Matches 'a@0'; there are 999 capture groups
|
|
# and \1000 is seen as \100 (a '@') and a '0'.
|
|
|
|
=back
|
|
|
|
You can force a backreference interpretation always by using the C<\g{...}>
|
|
form. You can the force an octal interpretation always by using the C<\o{...}>
|
|
form, or for numbers up through \077 (= 63 decimal), by using three digits,
|
|
beginning with a "0".
|
|
|
|
=head3 Hexadecimal escapes
|
|
|
|
Like octal escapes, there are two forms of hexadecimal escapes, but both start
|
|
with the sequence C<\x>. This is followed by either exactly two hexadecimal
|
|
digits forming a number, or a hexadecimal number of arbitrary length surrounded
|
|
by curly braces. The hexadecimal number is the code point of the character you
|
|
want to express.
|
|
|
|
Note that a character expressed as one of these escapes is considered a
|
|
character without special meaning by the regex engine, and will match
|
|
"as is".
|
|
|
|
Mnemonic: heI<x>adecimal.
|
|
|
|
=head4 Examples (assuming an ASCII platform)
|
|
|
|
$str = "Perl";
|
|
$str =~ /\x50/; # Match, "\x50" is "P".
|
|
$str =~ /\x50+/; # Match, "\x50" is "P", it is repeated at least once
|
|
$str =~ /P\x2B/; # No match, "\x2B" is "+" and taken literally.
|
|
|
|
/\x{2603}\x{2602}/ # Snowman with an umbrella.
|
|
# The Unicode character 2603 is a snowman,
|
|
# the Unicode character 2602 is an umbrella.
|
|
/\x{263B}/ # Black smiling face.
|
|
/\x{263b}/ # Same, the hex digits A - F are case insensitive.
|
|
|
|
=head2 Modifiers
|
|
|
|
A number of backslash sequences have to do with changing the character,
|
|
or characters following them. C<\l> will lowercase the character following
|
|
it, while C<\u> will uppercase (or, more accurately, titlecase) the
|
|
character following it. They provide functionality similar to the
|
|
functions C<lcfirst> and C<ucfirst>.
|
|
|
|
To uppercase or lowercase several characters, one might want to use
|
|
C<\L> or C<\U>, which will lowercase/uppercase all characters following
|
|
them, until either the end of the pattern or the next occurrence of
|
|
C<\E>, whichever comes first. They provide functionality similar to what
|
|
the functions C<lc> and C<uc> provide.
|
|
|
|
C<\Q> is used to quote (disable) pattern metacharacters, up to the next
|
|
C<\E> or the end of the pattern. C<\Q> adds a backslash to any character
|
|
that could have special meaning to Perl. In the ASCII range, it quotes
|
|
every character that isn't a letter, digit, or underscore. See
|
|
L<perlfunc/quotemeta> for details on what gets quoted for non-ASCII
|
|
code points. Using this ensures that any character between C<\Q> and
|
|
C<\E> will be matched literally, not interpreted as a metacharacter by
|
|
the regex engine.
|
|
|
|
C<\F> can be used to casefold all characters following, up to the next C<\E>
|
|
or the end of the pattern. It provides the functionality similar to
|
|
the C<fc> function.
|
|
|
|
Mnemonic: I<L>owercase, I<U>ppercase, I<F>old-case, I<Q>uotemeta, I<E>nd.
|
|
|
|
=head4 Examples
|
|
|
|
$sid = "sid";
|
|
$greg = "GrEg";
|
|
$miranda = "(Miranda)";
|
|
$str =~ /\u$sid/; # Matches 'Sid'
|
|
$str =~ /\L$greg/; # Matches 'greg'
|
|
$str =~ /\Q$miranda\E/; # Matches '(Miranda)', as if the pattern
|
|
# had been written as /\(Miranda\)/
|
|
|
|
=head2 Character classes
|
|
|
|
Perl regular expressions have a large range of character classes. Some of
|
|
the character classes are written as a backslash sequence. We will briefly
|
|
discuss those here; full details of character classes can be found in
|
|
L<perlrecharclass>.
|
|
|
|
C<\w> is a character class that matches any single I<word> character
|
|
(letters, digits, Unicode marks, and connector punctuation (like the
|
|
underscore)). C<\d> is a character class that matches any decimal
|
|
digit, while the character class C<\s> matches any whitespace character.
|
|
New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
|
|
and vertical whitespace characters.
|
|
|
|
The exact set of characters matched by C<\d>, C<\s>, and C<\w> varies
|
|
depending on various pragma and regular expression modifiers. It is
|
|
possible to restrict the match to the ASCII range by using the C</a>
|
|
regular expression modifier. See L<perlrecharclass>.
|
|
|
|
The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are
|
|
character classes that match, respectively, any character that isn't a
|
|
word character, digit, whitespace, horizontal whitespace, or vertical
|
|
whitespace.
|
|
|
|
Mnemonics: I<w>ord, I<d>igit, I<s>pace, I<h>orizontal, I<v>ertical.
|
|
|
|
=head3 Unicode classes
|
|
|
|
C<\pP> (where C<P> is a single letter) and C<\p{Property}> are used to
|
|
match a character that matches the given Unicode property; properties
|
|
include things like "letter", or "thai character". Capitalizing the
|
|
sequence to C<\PP> and C<\P{Property}> make the sequence match a character
|
|
that doesn't match the given Unicode property. For more details, see
|
|
L<perlrecharclass/Backslash sequences> and
|
|
L<perlunicode/Unicode Character Properties>.
|
|
|
|
Mnemonic: I<p>roperty.
|
|
|
|
=head2 Referencing
|
|
|
|
If capturing parenthesis are used in a regular expression, we can refer
|
|
to the part of the source string that was matched, and match exactly the
|
|
same thing. There are three ways of referring to such I<backreference>:
|
|
absolutely, relatively, and by name.
|
|
|
|
=for later add link to perlrecapture
|
|
|
|
=head3 Absolute referencing
|
|
|
|
Either C<\gI<N>> (starting in Perl 5.10.0), or C<\I<N>> (old-style) where I<N>
|
|
is a positive (unsigned) decimal number of any length is an absolute reference
|
|
to a capturing group.
|
|
|
|
I<N> refers to the Nth set of parentheses, so C<\gI<N>> refers to whatever has
|
|
been matched by that set of parentheses. Thus C<\g1> refers to the first
|
|
capture group in the regex.
|
|
|
|
The C<\gI<N>> form can be equivalently written as C<\g{I<N>}>
|
|
which avoids ambiguity when building a regex by concatenating shorter
|
|
strings. Otherwise if you had a regex C<qr/$a$b/>, and C<$a> contained
|
|
C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which is
|
|
probably not what you intended.
|
|
|
|
In the C<\I<N>> form, I<N> must not begin with a "0", and there must be at
|
|
least I<N> capturing groups, or else I<N> is considered an octal escape
|
|
(but something like C<\18> is the same as C<\0018>; that is, the octal escape
|
|
C<"\001"> followed by a literal digit C<"8">).
|
|
|
|
Mnemonic: I<g>roup.
|
|
|
|
=head4 Examples
|
|
|
|
/(\w+) \g1/; # Finds a duplicated word, (e.g. "cat cat").
|
|
/(\w+) \1/; # Same thing; written old-style.
|
|
/(.)(.)\g2\g1/; # Match a four letter palindrome (e.g. "ABBA").
|
|
|
|
|
|
=head3 Relative referencing
|
|
|
|
C<\g-I<N>> (starting in Perl 5.10.0) is used for relative addressing. (It can
|
|
be written as C<\g{-I<N>}>.) It refers to the I<N>th group before the
|
|
C<\g{-I<N>}>.
|
|
|
|
The big advantage of this form is that it makes it much easier to write
|
|
patterns with references that can be interpolated in larger patterns,
|
|
even if the larger pattern also contains capture groups.
|
|
|
|
=head4 Examples
|
|
|
|
/(A) # Group 1
|
|
( # Group 2
|
|
(B) # Group 3
|
|
\g{-1} # Refers to group 3 (B)
|
|
\g{-3} # Refers to group 1 (A)
|
|
)
|
|
/x; # Matches "ABBA".
|
|
|
|
my $qr = qr /(.)(.)\g{-2}\g{-1}/; # Matches 'abab', 'cdcd', etc.
|
|
/$qr$qr/ # Matches 'ababcdcd'.
|
|
|
|
=head3 Named referencing
|
|
|
|
C<\g{I<name>}> (starting in Perl 5.10.0) can be used to back refer to a
|
|
named capture group, dispensing completely with having to think about capture
|
|
buffer positions.
|
|
|
|
To be compatible with .Net regular expressions, C<\g{name}> may also be
|
|
written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>.
|
|
|
|
To prevent any ambiguity, I<name> must not start with a digit nor contain a
|
|
hyphen.
|
|
|
|
=head4 Examples
|
|
|
|
/(?<word>\w+) \g{word}/ # Finds duplicated word, (e.g. "cat cat")
|
|
/(?<word>\w+) \k{word}/ # Same.
|
|
/(?<word>\w+) \k<word>/ # Same.
|
|
/(?<letter1>.)(?<letter2>.)\g{letter2}\g{letter1}/
|
|
# Match a four letter palindrome (e.g. "ABBA")
|
|
|
|
=head2 Assertions
|
|
|
|
Assertions are conditions that have to be true; they don't actually
|
|
match parts of the substring. There are six assertions that are written as
|
|
backslash sequences.
|
|
|
|
=over 4
|
|
|
|
=item \A
|
|
|
|
C<\A> only matches at the beginning of the string. If the C</m> modifier
|
|
isn't used, then C</\A/> is equivalent to C</^/>. However, if the C</m>
|
|
modifier is used, then C</^/> matches internal newlines, but the meaning
|
|
of C</\A/> isn't changed by the C</m> modifier. C<\A> matches at the beginning
|
|
of the string regardless whether the C</m> modifier is used.
|
|
|
|
=item \z, \Z
|
|
|
|
C<\z> and C<\Z> match at the end of the string. If the C</m> modifier isn't
|
|
used, then C</\Z/> is equivalent to C</$/>; that is, it matches at the
|
|
end of the string, or one before the newline at the end of the string. If the
|
|
C</m> modifier is used, then C</$/> matches at internal newlines, but the
|
|
meaning of C</\Z/> isn't changed by the C</m> modifier. C<\Z> matches at
|
|
the end of the string (or just before a trailing newline) regardless whether
|
|
the C</m> modifier is used.
|
|
|
|
C<\z> is just like C<\Z>, except that it does not match before a trailing
|
|
newline. C<\z> matches at the end of the string only, regardless of the
|
|
modifiers used, and not just before a newline. It is how to anchor the
|
|
match to the true end of the string under all conditions.
|
|
|
|
=item \G
|
|
|
|
C<\G> is usually used only in combination with the C</g> modifier. If the
|
|
C</g> modifier is used and the match is done in scalar context, Perl
|
|
remembers where in the source string the last match ended, and the next time,
|
|
it will start the match from where it ended the previous time.
|
|
|
|
C<\G> matches the point where the previous match on that string ended,
|
|
or the beginning of that string if there was no previous match.
|
|
|
|
=for later add link to perlremodifiers
|
|
|
|
Mnemonic: I<G>lobal.
|
|
|
|
=item \b{}, \b, \B{}, \B
|
|
|
|
C<\b{...}>, available starting in v5.22, matches a boundary (between two
|
|
characters, or before the first character of the string, or after the
|
|
final character of the string) based on the Unicode rules for the
|
|
boundary type specified inside the braces. The boundary
|
|
types are given a few paragraphs below. C<\B{...}> matches at any place
|
|
between characters where C<\b{...}> of the same type doesn't match.
|
|
|
|
C<\b> when not immediately followed by a C<"{"> matches at any place
|
|
between a word (something matched by C<\w>) and a non-word character
|
|
(C<\W>); C<\B> when not immediately followed by a C<"{"> matches at any
|
|
place between characters where C<\b> doesn't match. To get better
|
|
word matching of natural language text, see L</\b{wb}> below.
|
|
|
|
C<\b>
|
|
and C<\B> assume there's a non-word character before the beginning and after
|
|
the end of the source string; so C<\b> will match at the beginning (or end)
|
|
of the source string if the source string begins (or ends) with a word
|
|
character. Otherwise, C<\B> will match.
|
|
|
|
Do not use something like C<\b=head\d\b> and expect it to match the
|
|
beginning of a line. It can't, because for there to be a boundary before
|
|
the non-word "=", there must be a word character immediately previous.
|
|
All plain C<\b> and C<\B> boundary determinations look for word
|
|
characters alone, not for
|
|
non-word characters nor for string ends. It may help to understand how
|
|
C<\b> and C<\B> work by equating them as follows:
|
|
|
|
\b really means (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
|
|
\B really means (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
|
|
|
|
In contrast, C<\b{...}> and C<\B{...}> may or may not match at the
|
|
beginning and end of the line, depending on the boundary type. These
|
|
implement the Unicode default boundaries, specified in
|
|
L<https://www.unicode.org/reports/tr14/> and
|
|
L<https://www.unicode.org/reports/tr29/>.
|
|
The boundary types are:
|
|
|
|
=over
|
|
|
|
=item C<\b{gcb}> or C<\b{g}>
|
|
|
|
This matches a Unicode "Grapheme Cluster Boundary". (Actually Perl
|
|
always uses the improved "extended" grapheme cluster"). These are
|
|
explained below under L</C<\X>>. In fact, C<\X> is another way to get
|
|
the same functionality. It is equivalent to C</.+?\b{gcb}/>. Use
|
|
whichever is most convenient for your situation.
|
|
|
|
=item C<\b{lb}>
|
|
|
|
This matches according to the default Unicode Line Breaking Algorithm
|
|
(L<https://www.unicode.org/reports/tr14/>), as customized in that
|
|
document
|
|
(L<Example 7 of revision 35|https://www.unicode.org/reports/tr14/tr14-35.html#Example7>)
|
|
for better handling of numeric expressions.
|
|
|
|
This is suitable for many purposes, but the L<Unicode::LineBreak> module
|
|
is available on CPAN that provides many more features, including
|
|
customization.
|
|
|
|
=item C<\b{sb}>
|
|
|
|
This matches a Unicode "Sentence Boundary". This is an aid to parsing
|
|
natural language sentences. It gives good, but imperfect results. For
|
|
example, it thinks that "Mr. Smith" is two sentences. More details are
|
|
at L<https://www.unicode.org/reports/tr29/>. Note also that it thinks
|
|
that anything matching L</\R> (except form feed and vertical tab) is a
|
|
sentence boundary. C<\b{sb}> works with text designed for
|
|
word-processors which wrap lines
|
|
automatically for display, but hard-coded line boundaries are considered
|
|
to be essentially the ends of text blocks (paragraphs really), and hence
|
|
the ends of sentences. C<\b{sb}> doesn't do well with text containing
|
|
embedded newlines, like the source text of the document you are reading.
|
|
Such text needs to be preprocessed to get rid of the line separators
|
|
before looking for sentence boundaries. Some people view this as a bug
|
|
in the Unicode standard, and this behavior is quite subject to change in
|
|
future Perl versions.
|
|
|
|
=item C<\b{wb}>
|
|
|
|
This matches a Unicode "Word Boundary", but tailored to Perl
|
|
expectations. This gives better (though not
|
|
perfect) results for natural language processing than plain C<\b>
|
|
(without braces) does. For example, it understands that apostrophes can
|
|
be in the middle of words and that parentheses aren't (see the examples
|
|
below). More details are at L<https://www.unicode.org/reports/tr29/>.
|
|
|
|
The current Unicode definition of a Word Boundary matches between every
|
|
white space character. Perl tailors this, starting in version 5.24, to
|
|
generally not break up spans of white space, just as plain C<\b> has
|
|
always functioned. This allows C<\b{wb}> to be a drop-in replacement for
|
|
C<\b>, but with generally better results for natural language
|
|
processing. (The exception to this tailoring is when a span of white
|
|
space is immediately followed by something like U+0303, COMBINING TILDE.
|
|
If the final space character in the span is a horizontal white space, it
|
|
is broken out so that it attaches instead to the combining character.
|
|
To be precise, if a span of white space that ends in a horizontal space
|
|
has the character immediately following it have any of the Word
|
|
Boundary property values "Extend", "Format" or "ZWJ", the boundary between the
|
|
final horizontal space character and the rest of the span matches
|
|
C<\b{wb}>. In all other cases the boundary between two white space
|
|
characters matches C<\B{wb}>.)
|
|
|
|
=back
|
|
|
|
It is important to realize when you use these Unicode boundaries,
|
|
that you are taking a risk that a future version of Perl which contains
|
|
a later version of the Unicode Standard will not work precisely the same
|
|
way as it did when your code was written. These rules are not
|
|
considered stable and have been somewhat more subject to change than the
|
|
rest of the Standard. Unicode reserves the right to change them at
|
|
will, and Perl reserves the right to update its implementation to
|
|
Unicode's new rules. In the past, some changes have been because new
|
|
characters have been added to the Standard which have different
|
|
characteristics than all previous characters, so new rules are
|
|
formulated for handling them. These should not cause any backward
|
|
compatibility issues. But some changes have changed the treatment of
|
|
existing characters because the Unicode Technical Committee has decided
|
|
that the change is warranted for whatever reason. This could be to fix
|
|
a bug, or because they think better results are obtained with the new
|
|
rule.
|
|
|
|
It is also important to realize that these are default boundary
|
|
definitions, and that implementations may wish to tailor the results for
|
|
particular purposes and locales. For example, some languages, such as
|
|
Japanese and Thai, require dictionary lookup to accurately determine
|
|
word boundaries.
|
|
|
|
Mnemonic: I<b>oundary.
|
|
|
|
=back
|
|
|
|
=head4 Examples
|
|
|
|
"cat" =~ /\Acat/; # Match.
|
|
"cat" =~ /cat\Z/; # Match.
|
|
"cat\n" =~ /cat\Z/; # Match.
|
|
"cat\n" =~ /cat\z/; # No match.
|
|
|
|
"cat" =~ /\bcat\b/; # Matches.
|
|
"cats" =~ /\bcat\b/; # No match.
|
|
"cat" =~ /\bcat\B/; # No match.
|
|
"cats" =~ /\bcat\B/; # Match.
|
|
|
|
while ("cat dog" =~ /(\w+)/g) {
|
|
print $1; # Prints 'catdog'
|
|
}
|
|
while ("cat dog" =~ /\G(\w+)/g) {
|
|
print $1; # Prints 'cat'
|
|
}
|
|
|
|
my $s = "He said, \"Is pi 3.14? (I'm not sure).\"";
|
|
print join("|", $s =~ m/ ( .+? \b ) /xg), "\n";
|
|
print join("|", $s =~ m/ ( .+? \b{wb} ) /xg), "\n";
|
|
prints
|
|
He| |said|, "|Is| |pi| |3|.|14|? (|I|'|m| |not| |sure
|
|
He| |said|,| |"|Is| |pi| |3.14|?| |(|I'm| |not| |sure|)|.|"
|
|
|
|
=head2 Misc
|
|
|
|
Here we document the backslash sequences that don't fall in one of the
|
|
categories above. These are:
|
|
|
|
=over 4
|
|
|
|
=item \K
|
|
|
|
This appeared in perl 5.10.0. Anything matched left of C<\K> is
|
|
not included in C<$&>, and will not be replaced if the pattern is
|
|
used in a substitution. This lets you write C<s/PAT1 \K PAT2/REPL/x>
|
|
instead of C<s/(PAT1) PAT2/${1}REPL/x> or C<s/(?<=PAT1) PAT2/REPL/x>.
|
|
|
|
Mnemonic: I<K>eep.
|
|
|
|
=item \N
|
|
|
|
This feature, available starting in v5.12, matches any character
|
|
that is B<not> a newline. It is a short-hand for writing C<[^\n]>, and is
|
|
identical to the C<.> metasymbol, except under the C</s> flag, which changes
|
|
the meaning of C<.>, but not C<\N>.
|
|
|
|
Note that C<\N{...}> can mean a
|
|
L<named or numbered character
|
|
|/Named or numbered characters and character sequences>.
|
|
|
|
Mnemonic: Complement of I<\n>.
|
|
|
|
=item \R
|
|
X<\R>
|
|
|
|
C<\R> matches a I<generic newline>; that is, anything considered a
|
|
linebreak sequence by Unicode. This includes all characters matched by
|
|
C<\v> (vertical whitespace), and the multi character sequence C<"\x0D\x0A">
|
|
(carriage return followed by a line feed, sometimes called the network
|
|
newline; it's the end of line sequence used in Microsoft text files opened
|
|
in binary mode). C<\R> is equivalent to C<< (?>\x0D\x0A|\v) >>. (The
|
|
reason it doesn't backtrack is that the sequence is considered
|
|
inseparable. That means that
|
|
|
|
"\x0D\x0A" =~ /^\R\x0A$/ # No match
|
|
|
|
fails, because the C<\R> matches the entire string, and won't backtrack
|
|
to match just the C<"\x0D">.) Since
|
|
C<\R> can match a sequence of more than one character, it cannot be put
|
|
inside a bracketed character class; C</[\R]/> is an error; use C<\v>
|
|
instead. C<\R> was introduced in perl 5.10.0.
|
|
|
|
Note that this does not respect any locale that might be in effect; it
|
|
matches according to the platform's native character set.
|
|
|
|
Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>,
|
|
and more importantly because Unicode recommends such a regular expression
|
|
metacharacter, and suggests C<\R> as its notation.
|
|
|
|
=item \X
|
|
X<\X>
|
|
|
|
This matches a Unicode I<extended grapheme cluster>.
|
|
|
|
C<\X> matches quite well what normal (non-Unicode-programmer) usage
|
|
would consider a single character. As an example, consider a G with some sort
|
|
of diacritic mark, such as an arrow. There is no such single character in
|
|
Unicode, but one can be composed by using a G followed by a Unicode "COMBINING
|
|
UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it
|
|
were a single character.
|
|
|
|
The match is greedy and non-backtracking, so that the cluster is never
|
|
broken up into smaller components.
|
|
|
|
See also L<C<\b{gcb}>|/\b{}, \b, \B{}, \B>.
|
|
|
|
Mnemonic: eI<X>tended Unicode character.
|
|
|
|
=back
|
|
|
|
=head4 Examples
|
|
|
|
$str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz'
|
|
$str =~ s/(.)\K\g1//g; # Delete duplicated characters.
|
|
|
|
"\n" =~ /^\R$/; # Match, \n is a generic newline.
|
|
"\r" =~ /^\R$/; # Match, \r is a generic newline.
|
|
"\r\n" =~ /^\R$/; # Match, \r\n is a generic newline.
|
|
|
|
"P\x{307}" =~ /^\X$/ # \X matches a P with a dot above.
|
|
|
|
=cut
|