Manpage For Perl Regular Expressions


       perlre - Perl regular expressions


DESCRIPTION

       For a description of how to use regular expressions in
       matching operations, see m// and s/// in the perlop
       manpage.  The matching operations can have various
       modifiers, some of which relate to the interpretation of
       the regular expression inside.  These are:

           i   Do case-insensitive pattern matching.
           m   Treat string as multiple lines.
           s   Treat string as single line.
           x   Use extended regular expressions.

       These are usually written as "the /x modifier", even
       though the delimiter in question might not actually be a
       slash.  In fact, any of these modifiers may also be
       embedded within the regular expression itself using the
       new (?...) construct.  See below.

       The /x modifier itself needs a little more explanation.
       It tells the regular expression parser to ignore
       whitespace that is not backslashed or within a character
       class.  You can use this to break up your regular
       expression into (slightly) more readable parts.  Together
       with the capability of embedding comments described later,
       this goes a long way towards making Perl 5 a readable
       language.  See the C comment deletion code in the perlop
       manpage.

       Regular Expressions

       The patterns used in pattern matching are regular
       expressions such as those supplied in the Version 8 regexp
       routines.  (In fact, the routines are derived (distantly)
       from Henry Spencer's freely redistributable
       reimplementation of the V8 routines.)  See the section on
       Version 8 Regular Expressions for details.

       In particular the following metacharacters have their
       standard egrep-ish meanings:

           "\"   Quote the next metacharacter
           ^   Match the beginning of the line
           .   Match any character (except newline)
           $   Match the end of the line
           |   Alternation
           ()  Grouping
           []  Character class

       By default, the "^" character is guaranteed to match only
       at the beginning of the string, the "$" character only at
       contains only one line.  Embedded newlines will not be
       matched by "^" or "$".  You may, however, wish to treat a
       string as a multi-line buffer, such that the "^" will
       match after any newline within the string, and "$" will
       match before any newline.  At the cost of a little more
       overhead, you can do this by using the /m modifier on the
       pattern match operator.  (Older programs did this by
       setting $*, but this practice is deprecated in Perl 5.)

       To facilitate multi-line substitutions, the "." character
       never matches a newline unless you use the /s modifier,
       which tells Perl to pretend the string is a single
       line--even if it isn't.  The /s modifier also overrides
       the setting of $*, in case you have some (badly behaved)
       older code that sets it in another module.

       The following standard quantifiers are recognized:

           *      Match 0 or more times
           +      Match 1 or more times
           ?      Match 1 or 0 times
           {n}    Match exactly n times
           {n,}   Match at least n times
           {n,m}  Match at least n but not more than m times

       (If a curly bracket occurs in any other context, it is
       treated as a regular character.)  The "*" modifier is
       equivalent to {0,}, the "+" modifier to {1,}, and the "?"
       modifier to {0,1}.  There is no limit to the size of n or
       m, but large numbers will chew up more memory.

       By default, a quantified subpattern is "greedy", that is,
       it will match as many times as possible without causing
       the rest pattern not to match.  The standard quantifiers
       are all "greedy", in that they match as many occurrences
       as possible (given a particular starting location) without
       causing the pattern to fail.  If you want it to match the
       minimum number of times possible, follow the quantifier
       with a "?" after any of them.  Note that the meanings
       don't change, just the "gravity":

           *?     Match 0 or more times
           +?     Match 1 or more times
           ??     Match 0 or 1 time
           {n}?   Match exactly n times
           {n,}?  Match at least n times
           {n,m}? Match at least n but not more than m times

       Since patterns are processed as double quoted strings, the
       following also work:


           \n          newline
           \r          return
           \f          form feed
           \v          vertical tab, whatever that is
           \a          alarm (bell)
           \e          escape
           \033        octal char
           \x1b        hex char
           \c[         control char
           \l          lowercase next char
           \u          uppercase next char
           \L          lowercase till \E
           \U          uppercase till \E
           \E          end case modification
           \Q          quote regexp metacharacters till \E

       In addition, Perl defines the following:

           \w  Match a "word" character (alphanumeric plus "_")
           \W  Match a non-word character
           \s  Match a whitespace character
           \S  Match a non-whitespace character
           \d  Match a digit character
           \D  Match a non-digit character

       Note that \w matches a single alphanumeric character, not
       a whole word.  To match a word you'd need to say \w+.  You
       may use \w, \W, \s, \S, \d and \D within character classes
       (though not as either end of a range).

       Perl defines the following zero-width assertions:

           \b  Match a word boundary
           \B  Match a non-(word boundary)
           \A  Match only at beginning of string
           \Z  Match only at end of string
           \G  Match only where previous m//g left off

       A word boundary (\b) is defined as a spot between two
       characters that has a \w on one side of it and and a \W on
       the other side of it (in either order), counting the
       imaginary characters off the beginning and end of the
       string as matching a \W.  (Within character classes \b
       represents backspace rather than a word boundary.)  The \A
       and \Z are just like "^" and "$" except that they won't
       match multiple times when the /m modifier is used, while
       "^" and "$" will match at every internal line boundary.

       When the bracketing construct ( ... ) is used, \<digit>
       matches the digit'th substring.  (Outside of the pattern,
       always use "$" instead of "\" in front of the digit.  The
       scope of $<digit> (and $`, $&, and $') extends to the end
       parentheses to delimit subpattern (e.g. a set of
       alternatives) without saving it as a subpattern, follow
       the ( with a ?.  The \<digit> notation sometimes works
       outside the current pattern, but should not be relied
       upon.)  You may have as many parentheses as you wish.  If
       you have more than 9 substrings, the variables $10, $11,
       ... refer to the corresponding substring.  Within the
       pattern, \10, \11, etc. refer back to substrings if there
       have been at least that many left parens before the
       backreference.  Otherwise (for backward compatibilty) \10
       is the same as \010, a backspace, and \11 the same as
       \011, a tab.  And so on.  (\1 through \9 are always
       backreferences.)

       $+ returns whatever the last bracket match matched.  $&
       returns the entire matched string.  ($0 used to return the
       same thing, but not any more.)  $` returns everything
       before the matched string.  $' returns everything after
       the matched string.  Examples:

           s/^([^ ]*) *([^ ]*)/$2 $1/;     # swap first two words

           if (/Time: (..):(..):(..)/) {
               $hours = $1;
               $minutes = $2;
               $seconds = $3;
           }

       You will note that all backslashed metacharacters in Perl
       are alphanumeric, such as \b, \w, \n.  Unlike some other
       regular expression languages, there are no backslashed
       symbols that aren't alphanumeric.  So anything that looks
       like \\, \(, \), \<, \>, \{, or \} is always interpreted
       as a literal character, not a metacharacter.  This makes
       it simple to quote a string that you want to use for a
       pattern but that you are afraid might contain
       metacharacters.  Simply quote all the non-alphanumeric
       characters:

           $pattern =~ s/(\W)/\\$1/g;

       You can also use the built-in quotemeta() function to do
       this.  An even easier way to quote metacharacters right in
       the match operator is to say

           /$unquoted\Q$quoted\E$unquoted/

       Perl 5 defines a consistent extension syntax for regular
       expressions.  The syntax is a pair of parens with a
       question mark as the first thing within the parens (this
       was a syntax error in Perl 4).  The character after the
       question mark gives the function of the extension.

       (?:regexp)
                 This groups things like "()" but doesn't make
                 backrefences like "()" does.  So

                     split(/\b(?:a|b|c)\b/)

                 is like

                     split(/\b(a|b|c)\b/)

                 but doesn't spit out extra fields.

       (?=regexp)
                 A zero-width positive lookahead assertion.  For
                 example, /\w+(?=\t)/ matches a word followed by
                 a tab, without including the tab in $&.

       (?!regexp)
                 A zero-width negative lookahead assertion.  For
                 example /foo(?!bar)/ matches any occurrence of
                 "foo" that isn't followed by "bar".  Note
                 however that lookahead and lookbehind are NOT
                 the same thing.  You cannot use this for
                 lookbehind: /(?!foo)bar/ will not find an
                 occurrence of "bar" that is preceded by
                 something which is not "foo".  That's because
                 the (?!foo) is just saying that the next thing
                 cannot be "foo"--and it's not, it's a "bar", so
                 "foobar" will match.  You would have to do
                 something like /(?foo)...bar/ for that.   We say
                 "like" because there's the case of your "bar"
                 not having three characters before it.  You
                 could cover that this way:
                 /(?:(?!foo)...|^..?)bar/.  Sometimes it's still
                 easier just to say:

                     if (/foo/ && $` =~ /bar$/)


       (?imsx)   One or more embedded pattern-match modifiers.
                 This is particularly useful for patterns that
                 are specified in a table somewhere, some of
                 which want to be case sensitive, and some of
                 which don't.  The case insensitive ones merely
                 need to include (?i) at the front of the
                 pattern.  For example:

                     $pattern = "foobar";
                     if ( /$pattern/i )

                     # more flexible:
                     if ( /$pattern/ )


       The specific choice of question mark for this and the new
       minimal matching construct was because 1) question mark is
       pretty rare in older regular expressions, and 2) whenever
       you see one, you should stop and "question" exactly what
       is going on.  That's psychology...

       Version 8 Regular Expressions

       In case you're not familiar with the "regular" Version 8
       regexp routines, here are the pattern-matching rules not
       described above.

       Any single character matches itself, unless it is a
       metacharacter with a special meaning described here or
       above.  You can cause characters which normally function
       as metacharacters to be interpreted literally by prefixing
       them with a "\" (e.g. "\." matches a ".", not any
       character; "\\" matches a "\").  A series of characters
       matches that series of characters in the target string, so
       the pattern blurfl would match "blurfl" in the target
       string.

       You can specify a character class, by enclosing a list of
       characters in "[]", which will match any one of the
       characters in the list.  If the first character after the
       "[" is "^", the class matches any character not in the
       list.  Within a list, the "-" character is used to specify
       a range, so that "a-z" represents all the
       characters between "a" and "z", inclusive.

       Characters may be specified using a metacharacter syntax
       much like that used in C: "\n" matches a newline, "\t" a
       tab, "\r" a carriage return, "\f" a form feed, etc.  More
       generally, "\"nnn, where nnn is a string of octal digits,
       matches the character whose ASCII value is nnn.
       Similarly, \xnn, where nn are hexidecimal digits, matches
       the character whose ASCII value is nn. The expression \cx
       matches the ASCII character control-x.  Finally, the "."
       metacharacter matches any character except "\n" (unless
       you use /s).

       You can specify a series of alternatives for a pattern
       using "|" to separate them, so that fee|fie|foe will match
       any of "fee", "fie", or "foe" in the target string (as
       would f(e|i|o)e).  Note that the first alternative
       includes everything from the last pattern delimiter ("(",
       "[", or the beginning of the pattern) up to the first "|",
       and the last alternative contains everything from the last
       "|" to the next pattern delimiter.  For this reason, it's
       however that "|" is interpreted as a literal with square
       brackets, so if you write [fee|fie|foe] you're really only
       matching [feio|].

       Within a pattern, you may designate subpatterns for later
       reference by enclosing them in parentheses, and you may
       refer back to the nth subpattern later in the pattern
       using the metacharacter "\"n.  Subpatterns are numbered
       based on the left to right order of their opening
       parenthesis.  Note that a backreference matches whatever
       actually matched the subpattern in the string being
       examined, not the rules for that subpattern.  Therefore,
       (0|0x)"\d*\s\1\d*" will match "0x1234 0x4321",but not
       "0x1234 01234", since subpattern 1 actually matched "0x",
       even though the rule 0|0x could potentially match the
       leading 0 in the second number.


Index