Initial Commit

2025-12-03 16:38:10 +01:00
parent c5e26bf594
commit b732d8d4b5
17680 changed files with 5977495 additions and 2 deletions
--- a/database/perl/lib/pods/perlrebackslash.pod
+++ b/database/perl/lib/pods/perlrebackslash.pod
@@ -0,0 +1,781 @@
+=head1 NAME
+
+perlrebackslash - Perl Regular Expression Backslash Sequences and Escapes
+
+=head1 DESCRIPTION
+
+The top level documentation about Perl regular expressions
+is found in L<perlre>.
+
+This document describes all backslash and escape sequences. After
+explaining the role of the backslash, it lists all the sequences that have
+a special meaning in Perl regular expressions (in alphabetical order),
+then describes each of them.
+
+Most sequences are described in detail in different documents; the primary
+purpose of this document is to have a quick reference guide describing all
+backslash and escape sequences.
+
+=head2 The backslash
+
+In a regular expression, the backslash can perform one of two tasks:
+it either takes away the special meaning of the character following it
+(for instance, C<\|> matches a vertical bar, it's not an alternation),
+or it is the start of a backslash or escape sequence.
+
+The rules determining what it is are quite simple: if the character
+following the backslash is an ASCII punctuation (non-word) character (that is,
+anything that is not a letter, digit, or underscore), then the backslash just
+takes away any special meaning of the character following it.
+
+If the character following the backslash is an ASCII letter or an ASCII digit,
+then the sequence may be special; if so, it's listed below. A few letters have
+not been used yet, so escaping them with a backslash doesn't change them to be
+special.  A future version of Perl may assign a special meaning to them, so if
+you have warnings turned on, Perl issues a warning if you use such a
+sequence.  [1].
+
+It is however guaranteed that backslash or escape sequences never have a
+punctuation character following the backslash, not now, and not in a future
+version of Perl 5. So it is safe to put a backslash in front of a non-word
+character.
+
+Note that the backslash itself is special; if you want to match a backslash,
+you have to escape the backslash with a backslash: C</\\/> matches a single
+backslash.
+
+=over 4
+
+=item [1]
+
+There is one exception. If you use an alphanumeric character as the
+delimiter of your pattern (which you probably shouldn't do for readability
+reasons), you have to escape the delimiter if you want to match
+it. Perl won't warn then. See also L<perlop/Gory details of parsing
+quoted constructs>.
+
+=back
+
+
+=head2 All the sequences and escapes
+
+Those not usable within a bracketed character class (like C<[\da-z]>) are marked
+as C<Not in [].>
+
+ \000              Octal escape sequence.  See also \o{}.
+ \1                Absolute backreference.  Not in [].
+ \a                Alarm or bell.
+ \A                Beginning of string.  Not in [].
+ \b{}, \b          Boundary. (\b is a backspace in []).
+ \B{}, \B          Not a boundary.  Not in [].
+ \cX               Control-X.
+ \d                Match any digit character.
+ \D                Match any character that isn't a digit.
+ \e                Escape character.
+ \E                Turn off \Q, \L and \U processing.  Not in [].
+ \f                Form feed.
+ \F                Foldcase till \E.  Not in [].
+ \g{}, \g1         Named, absolute or relative backreference.
+                   Not in [].
+ \G                Pos assertion.  Not in [].
+ \h                Match any horizontal whitespace character.
+ \H                Match any character that isn't horizontal whitespace.
+ \k{}, \k<>, \k''  Named backreference.  Not in [].
+ \K                Keep the stuff left of \K.  Not in [].
+ \l                Lowercase next character.  Not in [].
+ \L                Lowercase till \E.  Not in [].
+ \n                (Logical) newline character.
+ \N                Match any character but newline.  Not in [].
+ \N{}              Named or numbered (Unicode) character or sequence.
+ \o{}              Octal escape sequence.
+ \p{}, \pP         Match any character with the given Unicode property.
+ \P{}, \PP         Match any character without the given property.
+ \Q                Quote (disable) pattern metacharacters till \E.  Not
+                   in [].
+ \r                Return character.
+ \R                Generic new line.  Not in [].
+ \s                Match any whitespace character.
+ \S                Match any character that isn't a whitespace.
+ \t                Tab character.
+ \u                Titlecase next character.  Not in [].
+ \U                Uppercase till \E.  Not in [].
+ \v                Match any vertical whitespace character.
+ \V                Match any character that isn't vertical whitespace
+ \w                Match any word character.
+ \W                Match any character that isn't a word character.
+ \x{}, \x00        Hexadecimal escape sequence.
+ \X                Unicode "extended grapheme cluster".  Not in [].
+ \z                End of string.  Not in [].
+ \Z                End of string.  Not in [].
+
+=head2 Character Escapes
+
+=head3  Fixed characters
+
+A handful of characters have a dedicated I<character escape>. The following
+table shows them, along with their ASCII code points (in decimal and hex),
+their ASCII name, the control escape on ASCII platforms and a short
+description.  (For EBCDIC platforms, see L<perlebcdic/OPERATOR DIFFERENCES>.)
+
+ Seq.  Code Point  ASCII   Cntrl   Description.
+       Dec    Hex
+  \a     7     07    BEL    \cG    alarm or bell
+  \b     8     08     BS    \cH    backspace [1]
+  \e    27     1B    ESC    \c[    escape character
+  \f    12     0C     FF    \cL    form feed
+  \n    10     0A     LF    \cJ    line feed [2]
+  \r    13     0D     CR    \cM    carriage return
+  \t     9     09    TAB    \cI    tab
+
+=over 4
+
+=item [1]
+
+C<\b> is the backspace character only inside a character class. Outside a
+character class, C<\b> alone is a word-character/non-word-character
+boundary, and C<\b{}> is some other type of boundary.
+
+=item [2]
+
+C<\n> matches a logical newline. Perl converts between C<\n> and your
+OS's native newline character when reading from or writing to text files.
+
+=back
+
+=head4 Example
+
+ $str =~ /\t/;   # Matches if $str contains a (horizontal) tab.
+
+=head3 Control characters
+
+C<\c> is used to denote a control character; the character following C<\c>
+determines the value of the construct.  For example the value of C<\cA> is
+C<chr(1)>, and the value of C<\cb> is C<chr(2)>, etc.
+The gory details are in L<perlop/"Regexp Quote-Like Operators">.  A complete
+list of what C<chr(1)>, etc. means for ASCII and EBCDIC platforms is in
+L<perlebcdic/OPERATOR DIFFERENCES>.
+
+Note that C<\c\> alone at the end of a regular expression (or doubled-quoted
+string) is not valid.  The backslash must be followed by another character.
+That is, C<\c\I<X>> means C<chr(28) . 'I<X>'> for all characters I<X>.
+
+To write platform-independent code, you must use C<\N{I<NAME>}> instead, like
+C<\N{ESCAPE}> or C<\N{U+001B}>, see L<charnames>.
+
+Mnemonic: I<c>ontrol character.
+
+=head4 Example
+
+ $str =~ /\cK/;  # Matches if $str contains a vertical tab (control-K).
+
+=head3 Named or numbered characters and character sequences
+
+Unicode characters have a Unicode name and numeric code point (ordinal)
+value.  Use the
+C<\N{}> construct to specify a character by either of these values.
+Certain sequences of characters also have names.
+
+To specify by name, the name of the character or character sequence goes
+between the curly braces.
+
+To specify a character by Unicode code point, use the form C<\N{U+I<code
+point>}>, where I<code point> is a number in hexadecimal that gives the
+code point that Unicode has assigned to the desired character.  It is
+customary but not required to use leading zeros to pad the number to 4
+digits.  Thus C<\N{U+0041}> means C<LATIN CAPITAL LETTER A>, and you will
+rarely see it written without the two leading zeros.  C<\N{U+0041}> means
+"A" even on EBCDIC machines (where the ordinal value of "A" is not 0x41).
+
+It is even possible to give your own names to characters and character
+sequences by using the L<charnames> module.  These custom names are
+lexically scoped, and so a given code point may have different names
+in different scopes.  The name used is what is in effect at the time the
+C<\N{}> is expanded.  For patterns in double-quotish context, that means
+at the time the pattern is parsed.  But for patterns that are delimitted
+by single quotes, the expansion is deferred until pattern compilation
+time, which may very well have a different C<charnames> translator in
+effect.
+
+(There is an expanded internal form that you may see in debug output:
+C<\N{U+I<code point>.I<code point>...}>.
+The C<...> means any number of these I<code point>s separated by dots.
+This represents the sequence formed by the characters.  This is an internal
+form only, subject to change, and you should not try to use it yourself.)
+
+Mnemonic: I<N>amed character.
+
+Note that a character or character sequence expressed as a named
+or numbered character is considered a character without special
+meaning by the regex engine, and will match "as is".
+
+=head4 Example
+
+ $str =~ /\N{THAI CHARACTER SO SO}/;  # Matches the Thai SO SO character
+
+ use charnames 'Cyrillic';            # Loads Cyrillic names.
+ $str =~ /\N{ZHE}\N{KA}/;             # Match "ZHE" followed by "KA".
+
+=head3 Octal escapes
+
+There are two forms of octal escapes.  Each is used to specify a character by
+its code point specified in octal notation.
+
+One form, available starting in Perl 5.14 looks like C<\o{...}>, where the dots
+represent one or more octal digits.  It can be used for any Unicode character.
+
+It was introduced to avoid the potential problems with the other form,
+available in all Perls.  That form consists of a backslash followed by three
+octal digits.  One problem with this form is that it can look exactly like an
+old-style backreference (see
+L</Disambiguation rules between old-style octal escapes and backreferences>
+below.)  You can avoid this by making the first of the three digits always a
+zero, but that makes \077 the largest code point specifiable.
+
+In some contexts, a backslash followed by two or even one octal digits may be
+interpreted as an octal escape, sometimes with a warning, and because of some
+bugs, sometimes with surprising results.  Also, if you are creating a regex
+out of smaller snippets concatenated together, and you use fewer than three
+digits, the beginning of one snippet may be interpreted as adding digits to the
+ending of the snippet before it.  See L</Absolute referencing> for more
+discussion and examples of the snippet problem.
+
+Note that a character expressed as an octal escape is considered
+a character without special meaning by the regex engine, and will match
+"as is".
+
+To summarize, the C<\o{}> form is always safe to use, and the other form is
+safe to use for code points through \077 when you use exactly three digits to
+specify them.
+
+Mnemonic: I<0>ctal or I<o>ctal.
+
+=head4 Examples (assuming an ASCII platform)
+
+ $str = "Perl";
+ $str =~ /\o{120}/;  # Match, "\120" is "P".
+ $str =~ /\120/;     # Same.
+ $str =~ /\o{120}+/; # Match, "\120" is "P",
+                     # it's repeated at least once.
+ $str =~ /\120+/;    # Same.
+ $str =~ /P\053/;    # No match, "\053" is "+" and taken literally.
+ /\o{23073}/         # Black foreground, white background smiling face.
+ /\o{4801234567}/    # Raises a warning, and yields chr(4).
+
+=head4 Disambiguation rules between old-style octal escapes and backreferences
+
+Octal escapes of the C<\000> form outside of bracketed character classes
+potentially clash with old-style backreferences (see L</Absolute referencing>
+below).  They both consist of a backslash followed by numbers.  So Perl has to
+use heuristics to determine whether it is a backreference or an octal escape.
+Perl uses the following rules to disambiguate:
+
+=over 4
+
+=item 1
+
+If the backslash is followed by a single digit, it's a backreference.
+
+=item 2
+
+If the first digit following the backslash is a 0, it's an octal escape.
+
+=item 3
+
+If the number following the backslash is N (in decimal), and Perl already
+has seen N capture groups, Perl considers this a backreference.  Otherwise,
+it considers it an octal escape. If N has more than three digits, Perl
+takes only the first three for the octal escape; the rest are matched as is.
+
+ my $pat  = "(" x 999;
+    $pat .= "a";
+    $pat .= ")" x 999;
+ /^($pat)\1000$/;   #  Matches 'aa'; there are 1000 capture groups.
+ /^$pat\1000$/;     #  Matches 'a@0'; there are 999 capture groups
+                    #  and \1000 is seen as \100 (a '@') and a '0'.
+
+=back
+
+You can force a backreference interpretation always by using the C<\g{...}>
+form.  You can the force an octal interpretation always by using the C<\o{...}>
+form, or for numbers up through \077 (= 63 decimal), by using three digits,
+beginning with a "0".
+
+=head3 Hexadecimal escapes
+
+Like octal escapes, there are two forms of hexadecimal escapes, but both start
+with the sequence C<\x>.  This is followed by either exactly two hexadecimal
+digits forming a number, or a hexadecimal number of arbitrary length surrounded
+by curly braces. The hexadecimal number is the code point of the character you
+want to express.
+
+Note that a character expressed as one of these escapes is considered a
+character without special meaning by the regex engine, and will match
+"as is".
+
+Mnemonic: heI<x>adecimal.
+
+=head4 Examples (assuming an ASCII platform)
+
+ $str = "Perl";
+ $str =~ /\x50/;    # Match, "\x50" is "P".
+ $str =~ /\x50+/;   # Match, "\x50" is "P", it is repeated at least once
+ $str =~ /P\x2B/;   # No match, "\x2B" is "+" and taken literally.
+
+ /\x{2603}\x{2602}/ # Snowman with an umbrella.
+                    # The Unicode character 2603 is a snowman,
+                    # the Unicode character 2602 is an umbrella.
+ /\x{263B}/         # Black smiling face.
+ /\x{263b}/         # Same, the hex digits A - F are case insensitive.
+
+=head2 Modifiers
+
+A number of backslash sequences have to do with changing the character,
+or characters following them. C<\l> will lowercase the character following
+it, while C<\u> will uppercase (or, more accurately, titlecase) the
+character following it. They provide functionality similar to the
+functions C<lcfirst> and C<ucfirst>.
+
+To uppercase or lowercase several characters, one might want to use
+C<\L> or C<\U>, which will lowercase/uppercase all characters following
+them, until either the end of the pattern or the next occurrence of
+C<\E>, whichever comes first. They provide functionality similar to what
+the functions C<lc> and C<uc> provide.
+
+C<\Q> is used to quote (disable) pattern metacharacters, up to the next
+C<\E> or the end of the pattern. C<\Q> adds a backslash to any character
+that could have special meaning to Perl.  In the ASCII range, it quotes
+every character that isn't a letter, digit, or underscore.  See
+L<perlfunc/quotemeta> for details on what gets quoted for non-ASCII
+code points.  Using this ensures that any character between C<\Q> and
+C<\E> will be matched literally, not interpreted as a metacharacter by
+the regex engine.
+
+C<\F> can be used to casefold all characters following, up to the next C<\E>
+or the end of the pattern. It provides the functionality similar to
+the C<fc> function.
+
+Mnemonic: I<L>owercase, I<U>ppercase, I<F>old-case, I<Q>uotemeta, I<E>nd.
+
+=head4 Examples
+
+ $sid     = "sid";
+ $greg    = "GrEg";
+ $miranda = "(Miranda)";
+ $str     =~ /\u$sid/;        # Matches 'Sid'
+ $str     =~ /\L$greg/;       # Matches 'greg'
+ $str     =~ /\Q$miranda\E/;  # Matches '(Miranda)', as if the pattern
+                              #   had been written as /\(Miranda\)/
+
+=head2 Character classes
+
+Perl regular expressions have a large range of character classes. Some of
+the character classes are written as a backslash sequence. We will briefly
+discuss those here; full details of character classes can be found in
+L<perlrecharclass>.
+
+C<\w> is a character class that matches any single I<word> character
+(letters, digits, Unicode marks, and connector punctuation (like the
+underscore)).  C<\d> is a character class that matches any decimal
+digit, while the character class C<\s> matches any whitespace character.
+New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
+and vertical whitespace characters.
+
+The exact set of characters matched by C<\d>, C<\s>, and C<\w> varies
+depending on various pragma and regular expression modifiers.  It is
+possible to restrict the match to the ASCII range by using the C</a>
+regular expression modifier.  See L<perlrecharclass>.
+
+The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are
+character classes that match, respectively, any character that isn't a
+word character, digit, whitespace, horizontal whitespace, or vertical
+whitespace.
+
+Mnemonics: I<w>ord, I<d>igit, I<s>pace, I<h>orizontal, I<v>ertical.
+
+=head3 Unicode classes
+
+C<\pP> (where C<P> is a single letter) and C<\p{Property}> are used to
+match a character that matches the given Unicode property; properties
+include things like "letter", or "thai character". Capitalizing the
+sequence to C<\PP> and C<\P{Property}> make the sequence match a character
+that doesn't match the given Unicode property. For more details, see
+L<perlrecharclass/Backslash sequences> and
+L<perlunicode/Unicode Character Properties>.
+
+Mnemonic: I<p>roperty.
+
+=head2 Referencing
+
+If capturing parenthesis are used in a regular expression, we can refer
+to the part of the source string that was matched, and match exactly the
+same thing. There are three ways of referring to such I<backreference>:
+absolutely, relatively, and by name.
+
+=for later add link to perlrecapture
+
+=head3 Absolute referencing
+
+Either C<\gI<N>> (starting in Perl 5.10.0), or C<\I<N>> (old-style) where I<N>
+is a positive (unsigned) decimal number of any length is an absolute reference
+to a capturing group.
+
+I<N> refers to the Nth set of parentheses, so C<\gI<N>> refers to whatever has
+been matched by that set of parentheses.  Thus C<\g1> refers to the first
+capture group in the regex.
+
+The C<\gI<N>> form can be equivalently written as C<\g{I<N>}>
+which avoids ambiguity when building a regex by concatenating shorter
+strings.  Otherwise if you had a regex C<qr/$a$b/>, and C<$a> contained
+C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which is
+probably not what you intended.
+
+In the C<\I<N>> form, I<N> must not begin with a "0", and there must be at
+least I<N> capturing groups, or else I<N> is considered an octal escape
+(but something like C<\18> is the same as C<\0018>; that is, the octal escape
+C<"\001"> followed by a literal digit C<"8">).
+
+Mnemonic: I<g>roup.
+
+=head4 Examples
+
+ /(\w+) \g1/;    # Finds a duplicated word, (e.g. "cat cat").
+ /(\w+) \1/;     # Same thing; written old-style.
+ /(.)(.)\g2\g1/;  # Match a four letter palindrome (e.g. "ABBA").
+
+
+=head3 Relative referencing
+
+C<\g-I<N>> (starting in Perl 5.10.0) is used for relative addressing.  (It can
+be written as C<\g{-I<N>}>.)  It refers to the I<N>th group before the
+C<\g{-I<N>}>.
+
+The big advantage of this form is that it makes it much easier to write
+patterns with references that can be interpolated in larger patterns,
+even if the larger pattern also contains capture groups.
+
+=head4 Examples
+
+ /(A)        # Group 1
+  (          # Group 2
+    (B)      # Group 3
+    \g{-1}   # Refers to group 3 (B)
+    \g{-3}   # Refers to group 1 (A)
+  )
+ /x;         # Matches "ABBA".
+
+ my $qr = qr /(.)(.)\g{-2}\g{-1}/;  # Matches 'abab', 'cdcd', etc.
+ /$qr$qr/                           # Matches 'ababcdcd'.
+
+=head3 Named referencing
+
+C<\g{I<name>}> (starting in Perl 5.10.0) can be used to back refer to a
+named capture group, dispensing completely with having to think about capture
+buffer positions.
+
+To be compatible with .Net regular expressions, C<\g{name}> may also be
+written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>.
+
+To prevent any ambiguity, I<name> must not start with a digit nor contain a
+hyphen.
+
+=head4 Examples
+
+ /(?<word>\w+) \g{word}/ # Finds duplicated word, (e.g. "cat cat")
+ /(?<word>\w+) \k{word}/ # Same.
+ /(?<word>\w+) \k<word>/ # Same.
+ /(?<letter1>.)(?<letter2>.)\g{letter2}\g{letter1}/
+                         # Match a four letter palindrome (e.g. "ABBA")
+
+=head2 Assertions
+
+Assertions are conditions that have to be true; they don't actually
+match parts of the substring. There are six assertions that are written as
+backslash sequences.
+
+=over 4
+
+=item \A
+
+C<\A> only matches at the beginning of the string. If the C</m> modifier
+isn't used, then C</\A/> is equivalent to C</^/>. However, if the C</m>
+modifier is used, then C</^/> matches internal newlines, but the meaning
+of C</\A/> isn't changed by the C</m> modifier. C<\A> matches at the beginning
+of the string regardless whether the C</m> modifier is used.
+
+=item \z, \Z
+
+C<\z> and C<\Z> match at the end of the string. If the C</m> modifier isn't
+used, then C</\Z/> is equivalent to C</$/>; that is, it matches at the
+end of the string, or one before the newline at the end of the string. If the
+C</m> modifier is used, then C</$/> matches at internal newlines, but the
+meaning of C</\Z/> isn't changed by the C</m> modifier. C<\Z> matches at
+the end of the string (or just before a trailing newline) regardless whether
+the C</m> modifier is used.
+
+C<\z> is just like C<\Z>, except that it does not match before a trailing
+newline. C<\z> matches at the end of the string only, regardless of the
+modifiers used, and not just before a newline.  It is how to anchor the
+match to the true end of the string under all conditions.
+
+=item \G
+
+C<\G> is usually used only in combination with the C</g> modifier. If the
+C</g> modifier is used and the match is done in scalar context, Perl
+remembers where in the source string the last match ended, and the next time,
+it will start the match from where it ended the previous time.
+
+C<\G> matches the point where the previous match on that string ended,
+or the beginning of that string if there was no previous match.
+
+=for later add link to perlremodifiers
+
+Mnemonic: I<G>lobal.
+
+=item \b{}, \b, \B{}, \B
+
+C<\b{...}>, available starting in v5.22, matches a boundary (between two
+characters, or before the first character of the string, or after the
+final character of the string) based on the Unicode rules for the
+boundary type specified inside the braces.  The boundary
+types are given a few paragraphs below.  C<\B{...}> matches at any place
+between characters where C<\b{...}> of the same type doesn't match.
+
+C<\b> when not immediately followed by a C<"{"> matches at any place
+between a word (something matched by C<\w>) and a non-word character
+(C<\W>); C<\B> when not immediately followed by a C<"{"> matches at any
+place between characters where C<\b> doesn't match.  To get better
+word matching of natural language text, see L</\b{wb}> below.
+
+C<\b>
+and C<\B> assume there's a non-word character before the beginning and after
+the end of the source string; so C<\b> will match at the beginning (or end)
+of the source string if the source string begins (or ends) with a word
+character. Otherwise, C<\B> will match.
+
+Do not use something like C<\b=head\d\b> and expect it to match the
+beginning of a line.  It can't, because for there to be a boundary before
+the non-word "=", there must be a word character immediately previous.
+All plain C<\b> and C<\B> boundary determinations look for word
+characters alone, not for
+non-word characters nor for string ends.  It may help to understand how
+C<\b> and C<\B> work by equating them as follows:
+
+    \b	really means	(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
+    \B	really means	(?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
+
+In contrast, C<\b{...}> and C<\B{...}> may or may not match at the
+beginning and end of the line, depending on the boundary type.  These
+implement the Unicode default boundaries, specified in
+L<https://www.unicode.org/reports/tr14/> and
+L<https://www.unicode.org/reports/tr29/>.
+The boundary types are:
+
+=over
+
+=item C<\b{gcb}> or C<\b{g}>
+
+This matches a Unicode "Grapheme Cluster Boundary".  (Actually Perl
+always uses the improved "extended" grapheme cluster").  These are
+explained below under L</C<\X>>.  In fact, C<\X> is another way to get
+the same functionality.  It is equivalent to C</.+?\b{gcb}/>.  Use
+whichever is most convenient for your situation.
+
+=item C<\b{lb}>
+
+This matches according to the default Unicode Line Breaking Algorithm
+(L<https://www.unicode.org/reports/tr14/>), as customized in that
+document
+(L<Example 7 of revision 35|https://www.unicode.org/reports/tr14/tr14-35.html#Example7>)
+for better handling of numeric expressions.
+
+This is suitable for many purposes, but the L<Unicode::LineBreak> module
+is available on CPAN that provides many more features, including
+customization.
+
+=item C<\b{sb}>
+
+This matches a Unicode "Sentence Boundary".  This is an aid to parsing
+natural language sentences.  It gives good, but imperfect results.  For
+example, it thinks that "Mr. Smith" is two sentences.  More details are
+at L<https://www.unicode.org/reports/tr29/>.  Note also that it thinks
+that anything matching L</\R> (except form feed and vertical tab) is a
+sentence boundary.  C<\b{sb}> works with text designed for
+word-processors which wrap lines
+automatically for display, but hard-coded line boundaries are considered
+to be essentially the ends of text blocks (paragraphs really), and hence
+the ends of sentences.  C<\b{sb}> doesn't do well with text containing
+embedded newlines, like the source text of the document you are reading.
+Such text needs to be preprocessed to get rid of the line separators
+before looking for sentence boundaries.  Some people view this as a bug
+in the Unicode standard, and this behavior is quite subject to change in
+future Perl versions.
+
+=item C<\b{wb}>
+
+This matches a Unicode "Word Boundary", but tailored to Perl
+expectations.  This gives better (though not
+perfect) results for natural language processing than plain C<\b>
+(without braces) does.  For example, it understands that apostrophes can
+be in the middle of words and that parentheses aren't (see the examples
+below).  More details are at L<https://www.unicode.org/reports/tr29/>.
+
+The current Unicode definition of a Word Boundary matches between every
+white space character.  Perl tailors this, starting in version 5.24, to
+generally not break up spans of white space, just as plain C<\b> has
+always functioned.  This allows C<\b{wb}> to be a drop-in replacement for
+C<\b>, but with generally better results for natural language
+processing.  (The exception to this tailoring is when a span of white
+space is immediately followed by something like U+0303, COMBINING TILDE.
+If the final space character in the span is a horizontal white space, it
+is broken out so that it attaches instead to the combining character.
+To be precise, if a span of white space that ends in a horizontal space
+has the character immediately following it have any of the Word
+Boundary property values "Extend", "Format" or "ZWJ", the boundary between the
+final horizontal space character and the rest of the span matches
+C<\b{wb}>.  In all other cases the boundary between two white space
+characters matches C<\B{wb}>.)
+
+=back
+
+It is important to realize when you use these Unicode boundaries,
+that you are taking a risk that a future version of Perl which contains
+a later version of the Unicode Standard will not work precisely the same
+way as it did when your code was written.  These rules are not
+considered stable and have been somewhat more subject to change than the
+rest of the Standard.  Unicode reserves the right to change them at
+will, and Perl reserves the right to update its implementation to
+Unicode's new rules.  In the past, some changes have been because new
+characters have been added to the Standard which have different
+characteristics than all previous characters, so new rules are
+formulated for handling them.  These should not cause any backward
+compatibility issues.  But some changes have changed the treatment of
+existing characters because the Unicode Technical Committee has decided
+that the change is warranted for whatever reason.  This could be to fix
+a bug, or because they think better results are obtained with the new
+rule.
+
+It is also important to realize that these are default boundary
+definitions, and that implementations may wish to tailor the results for
+particular purposes and locales.  For example, some languages, such as
+Japanese and Thai, require dictionary lookup to accurately determine
+word boundaries.
+
+Mnemonic: I<b>oundary.
+
+=back
+
+=head4 Examples
+
+  "cat"   =~ /\Acat/;     # Match.
+  "cat"   =~ /cat\Z/;     # Match.
+  "cat\n" =~ /cat\Z/;     # Match.
+  "cat\n" =~ /cat\z/;     # No match.
+
+  "cat"   =~ /\bcat\b/;   # Matches.
+  "cats"  =~ /\bcat\b/;   # No match.
+  "cat"   =~ /\bcat\B/;   # No match.
+  "cats"  =~ /\bcat\B/;   # Match.
+
+  while ("cat dog" =~ /(\w+)/g) {
+      print $1;           # Prints 'catdog'
+  }
+  while ("cat dog" =~ /\G(\w+)/g) {
+      print $1;           # Prints 'cat'
+  }
+
+  my $s = "He said, \"Is pi 3.14? (I'm not sure).\"";
+  print join("|", $s =~ m/ ( .+? \b     ) /xg), "\n";
+  print join("|", $s =~ m/ ( .+? \b{wb} ) /xg), "\n";
+ prints
+  He| |said|, "|Is| |pi| |3|.|14|? (|I|'|m| |not| |sure
+  He| |said|,| |"|Is| |pi| |3.14|?| |(|I'm| |not| |sure|)|.|"
+
+=head2 Misc
+
+Here we document the backslash sequences that don't fall in one of the
+categories above. These are:
+
+=over 4
+
+=item \K
+
+This appeared in perl 5.10.0. Anything matched left of C<\K> is
+not included in C<$&>, and will not be replaced if the pattern is
+used in a substitution. This lets you write C<s/PAT1 \K PAT2/REPL/x>
+instead of C<s/(PAT1) PAT2/${1}REPL/x> or C<s/(?<=PAT1) PAT2/REPL/x>.
+
+Mnemonic: I<K>eep.
+
+=item \N
+
+This feature, available starting in v5.12,  matches any character
+that is B<not> a newline.  It is a short-hand for writing C<[^\n]>, and is
+identical to the C<.> metasymbol, except under the C</s> flag, which changes
+the meaning of C<.>, but not C<\N>.
+
+Note that C<\N{...}> can mean a
+L<named or numbered character
+|/Named or numbered characters and character sequences>.
+
+Mnemonic: Complement of I<\n>.
+
+=item \R
+X<\R>
+
+C<\R> matches a I<generic newline>; that is, anything considered a
+linebreak sequence by Unicode. This includes all characters matched by
+C<\v> (vertical whitespace), and the multi character sequence C<"\x0D\x0A">
+(carriage return followed by a line feed, sometimes called the network
+newline; it's the end of line sequence used in Microsoft text files opened
+in binary mode). C<\R> is equivalent to C<< (?>\x0D\x0A|\v) >>.  (The
+reason it doesn't backtrack is that the sequence is considered
+inseparable.  That means that
+
+ "\x0D\x0A" =~ /^\R\x0A$/   # No match
+
+fails, because the C<\R> matches the entire string, and won't backtrack
+to match just the C<"\x0D">.)  Since
+C<\R> can match a sequence of more than one character, it cannot be put
+inside a bracketed character class; C</[\R]/> is an error; use C<\v>
+instead.  C<\R> was introduced in perl 5.10.0.
+
+Note that this does not respect any locale that might be in effect; it
+matches according to the platform's native character set.
+
+Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>,
+and more importantly because Unicode recommends such a regular expression
+metacharacter, and suggests C<\R> as its notation.
+
+=item \X
+X<\X>
+
+This matches a Unicode I<extended grapheme cluster>.
+
+C<\X> matches quite well what normal (non-Unicode-programmer) usage
+would consider a single character.  As an example, consider a G with some sort
+of diacritic mark, such as an arrow.  There is no such single character in
+Unicode, but one can be composed by using a G followed by a Unicode "COMBINING
+UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it
+were a single character.
+
+The match is greedy and non-backtracking, so that the cluster is never
+broken up into smaller components.
+
+See also L<C<\b{gcb}>|/\b{}, \b, \B{}, \B>.
+
+Mnemonic: eI<X>tended Unicode character.
+
+=back
+
+=head4 Examples
+
+ $str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz'
+ $str =~ s/(.)\K\g1//g;    # Delete duplicated characters.
+
+ "\n"   =~ /^\R$/;         # Match, \n   is a generic newline.
+ "\r"   =~ /^\R$/;         # Match, \r   is a generic newline.
+ "\r\n" =~ /^\R$/;         # Match, \r\n is a generic newline.
+
+ "P\x{307}" =~ /^\X$/     # \X matches a P with a dot above.
+
+=cut