(This section is borrowed from the perldoc section on regular expressions, but somewhat modified.)
The patterns used in Perl pattern matching derive from supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.) In particular the following metacharacters have their standard egrep-ish meanings:
|
By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character only the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by "^" or "$". You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any newline within the string, and "$" will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator.
To simplify multi-line substitutions, the "." character never matches
a newline unless you use the /s
modifier, which in effect tells
Perl to pretend the string is a single line--even if it isn't. The /s
modifier also overrides the setting of $*
, in case you have some
(badly behaved) older code that sets it in another module.
The following standard quantifiers are recognized:
|
(If a curly bracket occurs in any other context, it is treated as a regular
character.) The "*" modifier is equivalent to {0,}
, the
"+" modifier to {1,}
, and the "?" modifier
to {0,1}
. n and m are limited to integral values less than a preset
limit defined when perl is built. This is usually 32766 on the most common platforms.
The actual limit can be seen in the error message generated by code such as
this:
|
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness":
|
Because patterns are processed as double quoted strings, the following also work:
|
If use locale
is in effect, the case map used by \l
,
\L
, \u
and \U
is taken from the current
locale. See perllocale. For documentation
of \N{name}
, see charnames.
You cannot include a literal $
or @
within a \Q
sequence. An unescaped $
or @
interpolates the corresponding
variable, while escaping will cause the literal string \$
to be
matched. You'll need to write something like m/\Quser\E\@\Qhost/
.
In addition, Perl defines the following:
|
A \w
matches a single alphanumeric character, not a whole word.
Use \w+
to match a string of Perl-identifier characters (which
isn't the same as matching an English word). You may use \w
, \W
,
\s
, \S
, \d
, and \D
within
character classes (i.e., within square brackets)
but if you try to use them as endpoints of a range, that's not a range, the
"-" is understood literally.
Perl defines the following zero-width assertions:
|
A word boundary (\b
) is a spot between two characters that has
a \w
on one side of it and a \W
on the other side
of it (in either order), counting the imaginary characters off the beginning
and end of the string as matching a \W
. (Within character classes
\b
represents backspace rather than a word boundary, just as it
normally does in any double-quoted string.) The \A
and \Z
are just like "^" and "$", except that they won't match
multiple times when the /m
modifier is used, while "^"
and "$" will match at every internal line boundary. To match the actual
end of the string and not ignore an optional trailing newline, use \z
.
The \G
assertion can be used to chain global matches (using m//g
),
as described in perlop/"Regexp
Quote-Like Operators".
The bracketing construct ( ... )
creates capture buffers. To refer
to the digit'th buffer use \<digit> within the match. Outside the match
use "$" instead of "\". (The \<digit> notation works
in certain circumstances outside the match. See the warning below about \1 vs
$1 for details.) Referring back to another part of the match is called a backreference.
There is no limit to the number of captured substrings that you may use. However Perl also uses \10, \11, etc. as aliases for \010, \011, etc. See the full documentation for what to do in that case.
Examples :
|
To find out what each of the marked lines above does, go here .
Several special variables also refer back to portions of the previous match.
$+
returns whatever the last bracket match matched. $&
returns the entire matched string. (At one point $0
did also, but
now it returns the name of the program.) $`
returns everything
before the matched string. And $'
returns everything after the
matched string.
The numbered variables ($1, $2, $3, etc.) and the related punctuation set (<$+
,
$&
, $`
, and $'
) are all dynamically
scoped until the end of the enclosing block or until the next successful match,
whichever comes first. (See perlsyn/"Compound
Statements".)
WARNING: Once Perl sees that you need one of $&
, $`
,
or $'
anywhere in the program, it has to provide them for every
pattern match. This may substantially slow your program. Perl uses the same
mechanism to produce $1, $2, etc, so you also pay a price for each pattern that
contains capturing parentheses. (To avoid this cost while retaining the grouping
behaviour, use the extended regular expression (?: ... )
instead.)
But if you never use $&
, $`
or $'
,
then patterns without capturing parentheses will not be penalized. So
avoid $&
, $'
, and $`
if you can,
but if you can't (and some algorithms really appreciate them), once you've used
them once, use them at will, because you've already paid the price. As of 5.005,
$&
is not so costly as the other two.
Backslashed metacharacters in Perl are alphanumeric, such as \b
,
\w
, \n
. Unlike some other regular expression languages,
there are no backslashed symbols that aren't alphanumeric. So anything that
looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a literal
character, not a metacharacter. This was once used in a common idiom to disable
or quote the special meanings of regular expression metacharacters in a string
that you want to use for a pattern. Simply quote all non-alphanumeric characters:
|
Today it is more common to use the quotemeta() function or the \Q
metaquoting escape sequence to disable all metacharacters' special meanings
like this:
|
Beware that if you put literal backslashes (those not inside interpolated variables)
between \Q
and \E
, double-quotish backslash interpolation
may lead to confusing results. If you need to use literal backslashes
within \Q...\E
, consult perlop/"Gory
details of parsing quoted constructs".