Initial Commit

2025-12-03 16:38:10 +01:00
parent c5e26bf594
commit b732d8d4b5
17680 changed files with 5977495 additions and 2 deletions
--- a/database/perl/lib/pods/perlreguts.pod
+++ b/database/perl/lib/pods/perlreguts.pod
@@ -0,0 +1,913 @@
+=head1 NAME
+
+perlreguts - Description of the Perl regular expression engine.
+
+=head1 DESCRIPTION
+
+This document is an attempt to shine some light on the guts of the regex
+engine and how it works. The regex engine represents a significant chunk
+of the perl codebase, but is relatively poorly understood. This document
+is a meagre attempt at addressing this situation. It is derived from the
+author's experience, comments in the source code, other papers on the
+regex engine, feedback on the perl5-porters mail list, and no doubt other
+places as well.
+
+B<NOTICE!> It should be clearly understood that the behavior and
+structures discussed in this represents the state of the engine as the
+author understood it at the time of writing. It is B<NOT> an API
+definition, it is purely an internals guide for those who want to hack
+the regex engine, or understand how the regex engine works. Readers of
+this document are expected to understand perl's regex syntax and its
+usage in detail. If you want to learn about the basics of Perl's
+regular expressions, see L<perlre>. And if you want to replace the
+regex engine with your own, see L<perlreapi>.
+
+=head1 OVERVIEW
+
+=head2 A quick note on terms
+
+There is some debate as to whether to say "regexp" or "regex". In this
+document we will use the term "regex" unless there is a special reason
+not to, in which case we will explain why.
+
+When speaking about regexes we need to distinguish between their source
+code form and their internal form. In this document we will use the term
+"pattern" when we speak of their textual, source code form, and the term
+"program" when we speak of their internal representation. These
+correspond to the terms I<S-regex> and I<B-regex> that Mark Jason
+Dominus employs in his paper on "Rx" ([1] in L</REFERENCES>).
+
+=head2 What is a regular expression engine?
+
+A regular expression engine is a program that takes a set of constraints
+specified in a mini-language, and then applies those constraints to a
+target string, and determines whether or not the string satisfies the
+constraints. See L<perlre> for a full definition of the language.
+
+In less grandiose terms, the first part of the job is to turn a pattern into
+something the computer can efficiently use to find the matching point in
+the string, and the second part is performing the search itself.
+
+To do this we need to produce a program by parsing the text. We then
+need to execute the program to find the point in the string that
+matches. And we need to do the whole thing efficiently.
+
+=head2 Structure of a Regexp Program
+
+=head3 High Level
+
+Although it is a bit confusing and some people object to the terminology, it
+is worth taking a look at a comment that has
+been in F<regexp.h> for years:
+
+I<This is essentially a linear encoding of a nondeterministic
+finite-state machine (aka syntax charts or "railroad normal form" in
+parsing technology).>
+
+The term "railroad normal form" is a bit esoteric, with "syntax
+diagram/charts", or "railroad diagram/charts" being more common terms.
+Nevertheless it provides a useful mental image of a regex program: each
+node can be thought of as a unit of track, with a single entry and in
+most cases a single exit point (there are pieces of track that fork, but
+statistically not many), and the whole forms a layout with a
+single entry and single exit point. The matching process can be thought
+of as a car that moves along the track, with the particular route through
+the system being determined by the character read at each possible
+connector point. A car can fall off the track at any point but it may
+only proceed as long as it matches the track.
+
+Thus the pattern C</foo(?:\w+|\d+|\s+)bar/> can be thought of as the
+following chart:
+
+                      [start]
+                         |
+                       <foo>
+                         |
+                   +-----+-----+
+                   |     |     |
+                 <\w+> <\d+> <\s+>
+                   |     |     |
+                   +-----+-----+
+                         |
+                       <bar>
+                         |
+                       [end]
+
+The truth of the matter is that perl's regular expressions these days are
+much more complex than this kind of structure, but visualising it this way
+can help when trying to get your bearings, and it matches the
+current implementation pretty closely.
+
+To be more precise, we will say that a regex program is an encoding
+of a graph. Each node in the graph corresponds to part of
+the original regex pattern, such as a literal string or a branch,
+and has a pointer to the nodes representing the next component
+to be matched. Since "node" and "opcode" already have other meanings in the
+perl source, we will call the nodes in a regex program "regops".
+
+The program is represented by an array of C<regnode> structures, one or
+more of which represent a single regop of the program. Struct
+C<regnode> is the smallest struct needed, and has a field structure which is
+shared with all the other larger structures.  (Outside this document, the term
+"regnode" is sometimes used to mean "regop", which could be confusing.)
+
+The "next" pointers of all regops except C<BRANCH> implement concatenation;
+a "next" pointer with a C<BRANCH> on both ends of it is connecting two
+alternatives.  [Here we have one of the subtle syntax dependencies: an
+individual C<BRANCH> (as opposed to a collection of them) is never
+concatenated with anything because of operator precedence.]
+
+The operand of some types of regop is a literal string; for others,
+it is a regop leading into a sub-program.  In particular, the operand
+of a C<BRANCH> node is the first regop of the branch.
+
+B<NOTE>: As the railroad metaphor suggests, this is B<not> a tree
+structure:  the tail of the branch connects to the thing following the
+set of C<BRANCH>es.  It is a like a single line of railway track that
+splits as it goes into a station or railway yard and rejoins as it comes
+out the other side.
+
+=head3 Regops
+
+The base structure of a regop is defined in F<regexp.h> as follows:
+
+    struct regnode {
+        U8  flags;    /* Various purposes, sometimes overridden */
+        U8  type;     /* Opcode value as specified by regnodes.h */
+        U16 next_off; /* Offset in size regnode */
+    };
+
+Other larger C<regnode>-like structures are defined in F<regcomp.h>. They
+are almost like subclasses in that they have the same fields as
+C<regnode>, with possibly additional fields following in
+the structure, and in some cases the specific meaning (and name)
+of some of base fields are overridden. The following is a more
+complete description.
+
+=over 4
+
+=item C<regnode_1>
+
+=item C<regnode_2>
+
+C<regnode_1> structures have the same header, followed by a single
+four-byte argument; C<regnode_2> structures contain two two-byte
+arguments instead:
+
+    regnode_1                U32 arg1;
+    regnode_2                U16 arg1;  U16 arg2;
+
+=item C<regnode_string>
+
+C<regnode_string> structures, used for literal strings, follow the header
+with a one-byte length and then the string data. Strings are padded on
+the tail end with zero bytes so that the total length of the node is a
+multiple of four bytes:
+
+    regnode_string           char string[1];
+                             U8 str_len; /* overrides flags */
+
+=item C<regnode_charclass>
+
+Bracketed character classes are represented by C<regnode_charclass>
+structures, which have a four-byte argument and then a 32-byte (256-bit)
+bitmap indicating which characters in the Latin1 range are included in
+the class.
+
+    regnode_charclass        U32 arg1;
+                             char bitmap[ANYOF_BITMAP_SIZE];
+
+Various flags whose names begin with C<ANYOF_> are used for special
+situations.  Above Latin1 matches and things not known until run-time
+are stored in L</Perl's pprivate structure>.
+
+=item C<regnode_charclass_posixl>
+
+There is also a larger form of a char class structure used to represent
+POSIX char classes under C</l> matching,
+called C<regnode_charclass_posixl> which has an
+additional 32-bit bitmap indicating which POSIX char classes
+have been included.
+
+   regnode_charclass_posixl U32 arg1;
+                            char bitmap[ANYOF_BITMAP_SIZE];
+                            U32 classflags;
+
+=back
+
+F<regnodes.h> defines an array called C<regarglen[]> which gives the size
+of each opcode in units of C<size regnode> (4-byte). A macro is used
+to calculate the size of an C<EXACT> node based on its C<str_len> field.
+
+The regops are defined in F<regnodes.h> which is generated from
+F<regcomp.sym> by F<regcomp.pl>. Currently the maximum possible number
+of distinct regops is restricted to 256, with about a quarter already
+used.
+
+A set of macros makes accessing the fields
+easier and more consistent. These include C<OP()>, which is used to determine
+the type of a C<regnode>-like structure; C<NEXT_OFF()>, which is the offset to
+the next node (more on this later); C<ARG()>, C<ARG1()>, C<ARG2()>, C<ARG_SET()>,
+and equivalents for reading and setting the arguments; and C<STR_LEN()>,
+C<STRING()> and C<OPERAND()> for manipulating strings and regop bearing
+types.
+
+=head3 What regop is next?
+
+There are three distinct concepts of "next" in the regex engine, and
+it is important to keep them clear.
+
+=over 4
+
+=item *
+
+There is the "next regnode" from a given regnode, a value which is
+rarely useful except that sometimes it matches up in terms of value
+with one of the others, and that sometimes the code assumes this to
+always be so.
+
+=item *
+
+There is the "next regop" from a given regop/regnode. This is the
+regop physically located after the current one, as determined by
+the size of the current regop. This is often useful, such as when
+dumping the structure we use this order to traverse. Sometimes the code
+assumes that the "next regnode" is the same as the "next regop", or in
+other words assumes that the sizeof a given regop type is always going
+to be one regnode large.
+
+=item *
+
+There is the "regnext" from a given regop. This is the regop which
+is reached by jumping forward by the value of C<NEXT_OFF()>,
+or in a few cases for longer jumps by the C<arg1> field of the C<regnode_1>
+structure. The subroutine C<regnext()> handles this transparently.
+This is the logical successor of the node, which in some cases, like
+that of the C<BRANCH> regop, has special meaning.
+
+=back
+
+=head1 Process Overview
+
+Broadly speaking, performing a match of a string against a pattern
+involves the following steps:
+
+=over 5
+
+=item A. Compilation
+
+=over 5
+
+=item 1. Parsing
+
+=item 2. Peep-hole optimisation and analysis
+
+=back
+
+=item B. Execution
+
+=over 5
+
+=item 3. Start position and no-match optimisations
+
+=item 4. Program execution
+
+=back
+
+=back
+
+
+Where these steps occur in the actual execution of a perl program is
+determined by whether the pattern involves interpolating any string
+variables. If interpolation occurs, then compilation happens at run time. If it
+does not, then compilation is performed at compile time. (The C</o> modifier changes this,
+as does C<qr//> to a certain extent.) The engine doesn't really care that
+much.
+
+=head2 Compilation
+
+This code resides primarily in F<regcomp.c>, along with the header files
+F<regcomp.h>, F<regexp.h> and F<regnodes.h>.
+
+Compilation starts with C<pregcomp()>, which is mostly an initialisation
+wrapper which farms work out to two other routines for the heavy lifting: the
+first is C<reg()>, which is the start point for parsing; the second,
+C<study_chunk()>, is responsible for optimisation.
+
+Initialisation in C<pregcomp()> mostly involves the creation and data-filling
+of a special structure, C<RExC_state_t> (defined in F<regcomp.c>).
+Almost all internally-used routines in F<regcomp.h> take a pointer to one
+of these structures as their first argument, with the name C<pRExC_state>.
+This structure is used to store the compilation state and contains many
+fields. Likewise there are many macros which operate on this
+variable: anything that looks like C<RExC_xxxx> is a macro that operates on
+this pointer/structure.
+
+C<reg()> is the start of the parse process. It is responsible for
+parsing an arbitrary chunk of pattern up to either the end of the
+string, or the first closing parenthesis it encounters in the pattern.
+This means it can be used to parse the top-level regex, or any section
+inside of a grouping parenthesis. It also handles the "special parens"
+that perl's regexes have. For instance when parsing C</x(?:foo)y/>,
+C<reg()> will at one point be called to parse from the "?" symbol up to
+and including the ")".
+
+Additionally, C<reg()> is responsible for parsing the one or more
+branches from the pattern, and for "finishing them off" by correctly
+setting their next pointers. In order to do the parsing, it repeatedly
+calls out to C<regbranch()>, which is responsible for handling up to the
+first C<|> symbol it sees.
+
+C<regbranch()> in turn calls C<regpiece()> which
+handles "things" followed by a quantifier. In order to parse the
+"things", C<regatom()> is called. This is the lowest level routine, which
+parses out constant strings, character classes, and the
+various special symbols like C<$>. If C<regatom()> encounters a "("
+character it in turn calls C<reg()>.
+
+There used to be two main passes involved in parsing, the first to
+calculate the size of the compiled program, and the second to actually
+compile it.  But now there is only one main pass, with an initial crude
+guess based on the length of the input pattern, which is increased if
+necessary as parsing proceeds, and afterwards, trimmed to the actual
+amount used.
+
+However, it may happen that parsing must be restarted at the beginning
+when various circumstances occur along the way.  An example is if the
+program turns out to be so large that there are jumps in it that won't
+fit in the normal 16 bits available.  There are two special regops that
+can hold bigger jump destinations, BRANCHJ and LONGBRANCH.  The parse is
+restarted, and these are used instead of the normal shorter ones.
+Whenever restarting the parse is required, the function returns failure
+and sets a flag as to what needs to be done.  This is passed up to the
+top level routine which takes the appropriate action and restarts from
+scratch.  In the case of needing longer jumps, the C<RExC_use_BRANCHJ>
+flag is set in the C<RExC_state_t> structure, which the functions know
+to inspect before deciding how to do branches.
+
+In most instances, the function that discovers the issue sets the causal
+flag and returns failure immediately.  L</Parsing complications>
+contains an explicit example of how this works.  In other cases, such as
+a forward reference to a numbered parenthetical grouping, we need to
+finish the parse to know if that numbered grouping actually appears in
+the pattern.  In those cases, the parse is just redone at the end, with
+the knowledge of how many groupings occur in it.
+
+The routine C<regtail()> is called by both C<reg()> and C<regbranch()>
+in order to "set the tail pointer" correctly. When executing and
+we get to the end of a branch, we need to go to the node following the
+grouping parens. When parsing, however, we don't know where the end will
+be until we get there, so when we do we must go back and update the
+offsets as appropriate. C<regtail> is used to make this easier.
+
+A subtlety of the parsing process means that a regex like C</foo/> is
+originally parsed into an alternation with a single branch. It is only
+afterwards that the optimiser converts single branch alternations into the
+simpler form.
+
+=head3 Parse Call Graph and a Grammar
+
+The call graph looks like this:
+
+ reg()                        # parse a top level regex, or inside of
+                              # parens
+     regbranch()              # parse a single branch of an alternation
+         regpiece()           # parse a pattern followed by a quantifier
+             regatom()        # parse a simple pattern
+                 regclass()   #   used to handle a class
+                 reg()        #   used to handle a parenthesised
+                              #   subpattern
+                 ....
+         ...
+         regtail()            # finish off the branch
+     ...
+     regtail()                # finish off the branch sequence. Tie each
+                              # branch's tail to the tail of the
+                              # sequence
+                              # (NEW) In Debug mode this is
+                              # regtail_study().
+
+A grammar form might be something like this:
+
+    atom  : constant | class
+    quant : '*' | '+' | '?' | '{min,max}'
+    _branch: piece
+           | piece _branch
+           | nothing
+    branch: _branch
+          | _branch '|' branch
+    group : '(' branch ')'
+    _piece: atom | group
+    piece : _piece
+          | _piece quant
+
+=head3 Parsing complications
+
+The implication of the above description is that a pattern containing nested
+parentheses will result in a call graph which cycles through C<reg()>,
+C<regbranch()>, C<regpiece()>, C<regatom()>, C<reg()>, C<regbranch()> I<etc>
+multiple times, until the deepest level of nesting is reached. All the above
+routines return a pointer to a C<regnode>, which is usually the last regnode
+added to the program. However, one complication is that reg() returns NULL
+for parsing C<(?:)> syntax for embedded modifiers, setting the flag
+C<TRYAGAIN>. The C<TRYAGAIN> propagates upwards until it is captured, in
+some cases by C<regatom()>, but otherwise unconditionally by
+C<regbranch()>. Hence it will never be returned by C<regbranch()> to
+C<reg()>. This flag permits patterns such as C<(?i)+> to be detected as
+errors (I<Quantifier follows nothing in regex; marked by <-- HERE in m/(?i)+
+<-- HERE />).
+
+Another complication is that the representation used for the program differs
+if it needs to store Unicode, but it's not always possible to know for sure
+whether it does until midway through parsing. The Unicode representation for
+the program is larger, and cannot be matched as efficiently. (See L</Unicode
+and Localisation Support> below for more details as to why.)  If the pattern
+contains literal Unicode, it's obvious that the program needs to store
+Unicode. Otherwise, the parser optimistically assumes that the more
+efficient representation can be used, and starts sizing on this basis.
+However, if it then encounters something in the pattern which must be stored
+as Unicode, such as an C<\x{...}> escape sequence representing a character
+literal, then this means that all previously calculated sizes need to be
+redone, using values appropriate for the Unicode representation.  This
+is another instance where the parsing needs to be restarted, and it can
+and is done immediately.  The function returns failure, and sets the
+flag C<RESTART_UTF8> (encapsulated by using the macro C<REQUIRE_UTF8>).
+This restart request is propagated up the call chain in a similar
+fashion, until it is "caught" in C<Perl_re_op_compile()>, which marks
+the pattern as containing Unicode, and restarts the sizing pass. It is
+also possible for constructions within run-time code blocks to turn out
+to need Unicode representation., which is signalled by
+C<S_compile_runtime_code()> returning false to C<Perl_re_op_compile()>.
+
+The restart was previously implemented using a C<longjmp> in C<regatom()>
+back to a C<setjmp> in C<Perl_re_op_compile()>, but this proved to be
+problematic as the latter is a large function containing many automatic
+variables, which interact badly with the emergent control flow of C<setjmp>.
+
+=head3 Debug Output
+
+Starting in the 5.9.x development version of perl you can C<< use re
+Debug => 'PARSE' >> to see some trace information about the parse
+process. We will start with some simple patterns and build up to more
+complex patterns.
+
+So when we parse C</foo/> we see something like the following table. The
+left shows what is being parsed, and the number indicates where the next regop
+would go. The stuff on the right is the trace output of the graph. The
+names are chosen to be short to make it less dense on the screen. 'tsdy'
+is a special form of C<regtail()> which does some extra analysis.
+
+ >foo<             1    reg
+                          brnc
+                            piec
+                              atom
+ ><                4      tsdy~ EXACT <foo> (EXACT) (1)
+                              ~ attach to END (3) offset to 2
+
+The resulting program then looks like:
+
+   1: EXACT <foo>(3)
+   3: END(0)
+
+As you can see, even though we parsed out a branch and a piece, it was ultimately
+only an atom. The final program shows us how things work. We have an C<EXACT> regop,
+followed by an C<END> regop. The number in parens indicates where the C<regnext> of
+the node goes. The C<regnext> of an C<END> regop is unused, as C<END> regops mean
+we have successfully matched. The number on the left indicates the position of
+the regop in the regnode array.
+
+Now let's try a harder pattern. We will add a quantifier, so now we have the pattern
+C</foo+/>. We will see that C<regbranch()> calls C<regpiece()> twice.
+
+ >foo+<            1    reg
+                          brnc
+                            piec
+                              atom
+ >o+<              3        piec
+                              atom
+ ><                6        tail~ EXACT <fo> (1)
+                   7      tsdy~ EXACT <fo> (EXACT) (1)
+                              ~ PLUS (END) (3)
+                              ~ attach to END (6) offset to 3
+
+And we end up with the program:
+
+   1: EXACT <fo>(3)
+   3: PLUS(6)
+   4:   EXACT <o>(0)
+   6: END(0)
+
+Now we have a special case. The C<EXACT> regop has a C<regnext> of 0. This is
+because if it matches it should try to match itself again. The C<PLUS> regop
+handles the actual failure of the C<EXACT> regop and acts appropriately (going
+to regnode 6 if the C<EXACT> matched at least once, or failing if it didn't).
+
+Now for something much more complex: C</x(?:foo*|b[a][rR])(foo|bar)$/>
+
+ >x(?:foo*|b...    1    reg
+                          brnc
+                            piec
+                              atom
+ >(?:foo*|b[...    3        piec
+                              atom
+ >?:foo*|b[a...                 reg
+ >foo*|b[a][...                   brnc
+                                    piec
+                                      atom
+ >o*|b[a][rR...    5                piec
+                                      atom
+ >|b[a][rR])...    8                tail~ EXACT <fo> (3)
+ >b[a][rR])(...    9              brnc
+                  10                piec
+                                      atom
+ >[a][rR])(f...   12                piec
+                                      atom
+ >a][rR])(fo...                         clas
+ >[rR])(foo|...   14                tail~ EXACT <b> (10)
+                                    piec
+                                      atom
+ >rR])(foo|b...                         clas
+ >)(foo|bar)...   25                tail~ EXACT <a> (12)
+                                  tail~ BRANCH (3)
+                  26              tsdy~ BRANCH (END) (9)
+                                      ~ attach to TAIL (25) offset to 16
+                                  tsdy~ EXACT <fo> (EXACT) (4)
+                                      ~ STAR (END) (6)
+                                      ~ attach to TAIL (25) offset to 19
+                                  tsdy~ EXACT <b> (EXACT) (10)
+                                      ~ EXACT <a> (EXACT) (12)
+                                      ~ ANYOF[Rr] (END) (14)
+                                      ~ attach to TAIL (25) offset to 11
+ >(foo|bar)$<               tail~ EXACT <x> (1)
+                            piec
+                              atom
+ >foo|bar)$<                    reg
+                  28              brnc
+                                    piec
+                                      atom
+ >|bar)$<         31              tail~ OPEN1 (26)
+ >bar)$<                          brnc
+                  32                piec
+                                      atom
+ >)$<             34              tail~ BRANCH (28)
+                  36              tsdy~ BRANCH (END) (31)
+                                     ~ attach to CLOSE1 (34) offset to 3
+                                  tsdy~ EXACT <foo> (EXACT) (29)
+                                     ~ attach to CLOSE1 (34) offset to 5
+                                  tsdy~ EXACT <bar> (EXACT) (32)
+                                     ~ attach to CLOSE1 (34) offset to 2
+ >$<                        tail~ BRANCH (3)
+                                ~ BRANCH (9)
+                                ~ TAIL (25)
+                            piec
+                              atom
+ ><               37        tail~ OPEN1 (26)
+                                ~ BRANCH (28)
+                                ~ BRANCH (31)
+                                ~ CLOSE1 (34)
+                  38      tsdy~ EXACT <x> (EXACT) (1)
+                              ~ BRANCH (END) (3)
+                              ~ BRANCH (END) (9)
+                              ~ TAIL (END) (25)
+                              ~ OPEN1 (END) (26)
+                              ~ BRANCH (END) (28)
+                              ~ BRANCH (END) (31)
+                              ~ CLOSE1 (END) (34)
+                              ~ EOL (END) (36)
+                              ~ attach to END (37) offset to 1
+
+Resulting in the program
+
+   1: EXACT <x>(3)
+   3: BRANCH(9)
+   4:   EXACT <fo>(6)
+   6:   STAR(26)
+   7:     EXACT <o>(0)
+   9: BRANCH(25)
+  10:   EXACT <ba>(14)
+  12:   OPTIMIZED (2 nodes)
+  14:   ANYOF[Rr](26)
+  25: TAIL(26)
+  26: OPEN1(28)
+  28:   TRIE-EXACT(34)
+        [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
+          <foo>
+          <bar>
+  30:   OPTIMIZED (4 nodes)
+  34: CLOSE1(36)
+  36: EOL(37)
+  37: END(0)
+
+Here we can see a much more complex program, with various optimisations in
+play. At regnode 10 we see an example where a character class with only
+one character in it was turned into an C<EXACT> node. We can also see where
+an entire alternation was turned into a C<TRIE-EXACT> node. As a consequence,
+some of the regnodes have been marked as optimised away. We can see that
+the C<$> symbol has been converted into an C<EOL> regop, a special piece of
+code that looks for C<\n> or the end of the string.
+
+The next pointer for C<BRANCH>es is interesting in that it points at where
+execution should go if the branch fails. When executing, if the engine
+tries to traverse from a branch to a C<regnext> that isn't a branch then
+the engine will know that the entire set of branches has failed.
+
+=head3 Peep-hole Optimisation and Analysis
+
+The regular expression engine can be a weighty tool to wield. On long
+strings and complex patterns it can end up having to do a lot of work
+to find a match, and even more to decide that no match is possible.
+Consider a situation like the following pattern.
+
+   'ababababababababababab' =~ /(a|b)*z/
+
+The C<(a|b)*> part can match at every char in the string, and then fail
+every time because there is no C<z> in the string. So obviously we can
+avoid using the regex engine unless there is a C<z> in the string.
+Likewise in a pattern like:
+
+   /foo(\w+)bar/
+
+In this case we know that the string must contain a C<foo> which must be
+followed by C<bar>. We can use Fast Boyer-Moore matching as implemented
+in C<fbm_instr()> to find the location of these strings. If they don't exist
+then we don't need to resort to the much more expensive regex engine.
+Even better, if they do exist then we can use their positions to
+reduce the search space that the regex engine needs to cover to determine
+if the entire pattern matches.
+
+There are various aspects of the pattern that can be used to facilitate
+optimisations along these lines:
+
+=over 5
+
+=item * anchored fixed strings
+
+=item * floating fixed strings
+
+=item * minimum and maximum length requirements
+
+=item * start class
+
+=item * Beginning/End of line positions
+
+=back
+
+Another form of optimisation that can occur is the post-parse "peep-hole"
+optimisation, where inefficient constructs are replaced by more efficient
+constructs. The C<TAIL> regops which are used during parsing to mark the end
+of branches and the end of groups are examples of this. These regops are used
+as place-holders during construction and "always match" so they can be
+"optimised away" by making the things that point to the C<TAIL> point to the
+thing that C<TAIL> points to, thus "skipping" the node.
+
+Another optimisation that can occur is that of "C<EXACT> merging" which is
+where two consecutive C<EXACT> nodes are merged into a single
+regop. An even more aggressive form of this is that a branch
+sequence of the form C<EXACT BRANCH ... EXACT> can be converted into a
+C<TRIE-EXACT> regop.
+
+All of this occurs in the routine C<study_chunk()> which uses a special
+structure C<scan_data_t> to store the analysis that it has performed, and
+does the "peep-hole" optimisations as it goes.
+
+The code involved in C<study_chunk()> is extremely cryptic. Be careful. :-)
+
+=head2 Execution
+
+Execution of a regex generally involves two phases, the first being
+finding the start point in the string where we should match from,
+and the second being running the regop interpreter.
+
+If we can tell that there is no valid start point then we don't bother running
+the interpreter at all. Likewise, if we know from the analysis phase that we
+cannot detect a short-cut to the start position, we go straight to the
+interpreter.
+
+The two entry points are C<re_intuit_start()> and C<pregexec()>. These routines
+have a somewhat incestuous relationship with overlap between their functions,
+and C<pregexec()> may even call C<re_intuit_start()> on its own. Nevertheless
+other parts of the perl source code may call into either, or both.
+
+Execution of the interpreter itself used to be recursive, but thanks to the
+efforts of Dave Mitchell in the 5.9.x development track, that has changed: now an
+internal stack is maintained on the heap and the routine is fully
+iterative. This can make it tricky as the code is quite conservative
+about what state it stores, with the result that two consecutive lines in the
+code can actually be running in totally different contexts due to the
+simulated recursion.
+
+=head3 Start position and no-match optimisations
+
+C<re_intuit_start()> is responsible for handling start points and no-match
+optimisations as determined by the results of the analysis done by
+C<study_chunk()> (and described in L</Peep-hole Optimisation and Analysis>).
+
+The basic structure of this routine is to try to find the start- and/or
+end-points of where the pattern could match, and to ensure that the string
+is long enough to match the pattern. It tries to use more efficient
+methods over less efficient methods and may involve considerable
+cross-checking of constraints to find the place in the string that matches.
+For instance it may try to determine that a given fixed string must be
+not only present but a certain number of chars before the end of the
+string, or whatever.
+
+It calls several other routines, such as C<fbm_instr()> which does
+Fast Boyer Moore matching and C<find_byclass()> which is responsible for
+finding the start using the first mandatory regop in the program.
+
+When the optimisation criteria have been satisfied, C<reg_try()> is called
+to perform the match.
+
+=head3 Program execution
+
+C<pregexec()> is the main entry point for running a regex. It contains
+support for initialising the regex interpreter's state, running
+C<re_intuit_start()> if needed, and running the interpreter on the string
+from various start positions as needed. When it is necessary to use
+the regex interpreter C<pregexec()> calls C<regtry()>.
+
+C<regtry()> is the entry point into the regex interpreter. It expects
+as arguments a pointer to a C<regmatch_info> structure and a pointer to
+a string.  It returns an integer 1 for success and a 0 for failure.
+It is basically a set-up wrapper around C<regmatch()>.
+
+C<regmatch> is the main "recursive loop" of the interpreter. It is
+basically a giant switch statement that implements a state machine, where
+the possible states are the regops themselves, plus a number of additional
+intermediate and failure states. A few of the states are implemented as
+subroutines but the bulk are inline code.
+
+=head1 MISCELLANEOUS
+
+=head2 Unicode and Localisation Support
+
+When dealing with strings containing characters that cannot be represented
+using an eight-bit character set, perl uses an internal representation
+that is a permissive version of Unicode's UTF-8 encoding[2]. This uses single
+bytes to represent characters from the ASCII character set, and sequences
+of two or more bytes for all other characters. (See L<perlunitut>
+for more information about the relationship between UTF-8 and perl's
+encoding, utf8. The difference isn't important for this discussion.)
+
+No matter how you look at it, Unicode support is going to be a pain in a
+regex engine. Tricks that might be fine when you have 256 possible
+characters often won't scale to handle the size of the UTF-8 character
+set.  Things you can take for granted with ASCII may not be true with
+Unicode. For instance, in ASCII, it is safe to assume that
+C<sizeof(char1) == sizeof(char2)>, but in UTF-8 it isn't. Unicode case folding is
+vastly more complex than the simple rules of ASCII, and even when not
+using Unicode but only localised single byte encodings, things can get
+tricky (for example, B<LATIN SMALL LETTER SHARP S> (U+00DF, E<szlig>)
+should match 'SS' in localised case-insensitive matching).
+
+Making things worse is that UTF-8 support was a later addition to the
+regex engine (as it was to perl) and this necessarily  made things a lot
+more complicated. Obviously it is easier to design a regex engine with
+Unicode support in mind from the beginning than it is to retrofit it to
+one that wasn't.
+
+Nearly all regops that involve looking at the input string have
+two cases, one for UTF-8, and one not. In fact, it's often more complex
+than that, as the pattern may be UTF-8 as well.
+
+Care must be taken when making changes to make sure that you handle
+UTF-8 properly, both at compile time and at execution time, including
+when the string and pattern are mismatched.
+
+=head2 Base Structures
+
+The C<regexp> structure described in L<perlreapi> is common to all
+regex engines. Two of its fields are intended for the private use
+of the regex engine that compiled the pattern. These are the
+C<intflags> and pprivate members. The C<pprivate> is a void pointer to
+an arbitrary structure whose use and management is the responsibility
+of the compiling engine. perl will never modify either of these
+values. In the case of the stock engine the structure pointed to by
+C<pprivate> is called C<regexp_internal>.
+
+Its C<pprivate> and C<intflags> fields contain data
+specific to each engine.
+
+There are two structures used to store a compiled regular expression.
+One, the C<regexp> structure described in L<perlreapi> is populated by
+the engine currently being. used and some of its fields read by perl to
+implement things such as the stringification of C<qr//>.
+
+
+The other structure is pointed to by the C<regexp> struct's
+C<pprivate> and is in addition to C<intflags> in the same struct
+considered to be the property of the regex engine which compiled the
+regular expression;
+
+The regexp structure contains all the data that perl needs to be aware of
+to properly work with the regular expression. It includes data about
+optimisations that perl can use to determine if the regex engine should
+really be used, and various other control info that is needed to properly
+execute patterns in various contexts such as is the pattern anchored in
+some way, or what flags were used during the compile, or whether the
+program contains special constructs that perl needs to be aware of.
+
+In addition it contains two fields that are intended for the private use
+of the regex engine that compiled the pattern. These are the C<intflags>
+and pprivate members. The C<pprivate> is a void pointer to an arbitrary
+structure whose use and management is the responsibility of the compiling
+engine. perl will never modify either of these values.
+
+As mentioned earlier, in the case of the default engines, the C<pprivate>
+will be a pointer to a regexp_internal structure which holds the compiled
+program and any additional data that is private to the regex engine
+implementation.
+
+=head3 Perl's C<pprivate> structure
+
+The following structure is used as the C<pprivate> struct by perl's
+regex engine. Since it is specific to perl it is only of curiosity
+value to other engine implementations.
+
+ typedef struct regexp_internal {
+         U32 *offsets;           /* offset annotations 20001228 MJD
+                                  * data about mapping the program to
+                                  * the string*/
+         regnode *regstclass;    /* Optional startclass as identified or
+                                  * constructed by the optimiser */
+         struct reg_data *data;  /* Additional miscellaneous data used
+                                  * by the program.  Used to make it
+                                  * easier to clone and free arbitrary
+                                  * data that the regops need. Often the
+                                  * ARG field of a regop is an index
+                                  * into this structure */
+         regnode program[1];     /* Unwarranted chumminess with
+                                  * compiler. */
+ } regexp_internal;
+
+=over 5
+
+=item C<offsets>
+
+Offsets holds a mapping of offset in the C<program>
+to offset in the C<precomp> string. This is only used by ActiveState's
+visual regex debugger.
+
+=item C<regstclass>
+
+Special regop that is used by C<re_intuit_start()> to check if a pattern
+can match at a certain position. For instance if the regex engine knows
+that the pattern must start with a 'Z' then it can scan the string until
+it finds one and then launch the regex engine from there. The routine
+that handles this is called C<find_by_class()>. Sometimes this field
+points at a regop embedded in the program, and sometimes it points at
+an independent synthetic regop that has been constructed by the optimiser.
+
+=item C<data>
+
+This field points at a C<reg_data> structure, which is defined as follows
+
+    struct reg_data {
+        U32 count;
+        U8 *what;
+        void* data[1];
+    };
+
+This structure is used for handling data structures that the regex engine
+needs to handle specially during a clone or free operation on the compiled
+product. Each element in the data array has a corresponding element in the
+what array. During compilation regops that need special structures stored
+will add an element to each array using the add_data() routine and then store
+the index in the regop.
+
+=item C<program>
+
+Compiled program. Inlined into the structure so the entire struct can be
+treated as a single blob.
+
+=back
+
+=head1 SEE ALSO
+
+L<perlreapi>
+
+L<perlre>
+
+L<perlunitut>
+
+=head1 AUTHOR
+
+by Yves Orton, 2006.
+
+With excerpts from Perl, and contributions and suggestions from
+Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus,
+Stephen McCamant, and David Landgren.
+
+Now maintained by Perl 5 Porters.
+
+=head1 LICENCE
+
+Same terms as Perl.
+
+=head1 REFERENCES
+
+[1] L<https://perl.plover.com/Rx/paper/>
+
+[2] L<https://www.unicode.org/>
+
+=cut