Regular Expressions & String Matching

Mirrored from http://www.algonquinc.on.ca/~pincka/dat2366/perl/regExp.htm without permission - the original server has been replaced by a web hosting company?

Regular Expressions & String Matching

Basic Match Operation
form: /pattern_to_match/
returns true if pattern_to_match found in the current line ($_); returns false otherwise.
for example,
```
    while ()  # read line into $_ until EOF on standard input
    {   if (/PERL/)  # if $_ (current line) contains the word "PERL"
        {   print;   # then print this line
        }
    }
```
ignore case option
by appending the character i to the match operator, a "match" is considered to have been found even if there are case differences between the target string (i.e. the current line, by default) and the pattern_to_match.
For example, if in the previous example the if statement were changed to:
if (/PERL/i)
then lines containing "Perl", "perl", "pERl", etc. would be printed (and not just those containing "PERL").
using a different target string than the current line ($_)
$target_string_name =~ /pattern_to_match/
=~ is called the binding operator; note that this is not an assignment operation or an equality test of any form.

Regular Expressions

A regular expression is a template pattern to be used in finding a match within a string; regular expressions typically contain elements with special meaning to allow for more flexibility in matching.

Regular Expression Elements

. any character except a newline

+ one or more of the preceding character (or group);
for example /be+t/ would match "he bet on the race" and "she ate a beet" and "I hope you feel better" but not "Their debt increased";
similarly /b(an)+/ would match "the band" and "banana", but not "batter"

? zero or one of the preceding character (or group);
for example, /be?t/ would match "he bet" and "in debt", but not "ate a beet"

* zero or more of the preceding character (or group);
for example, /be?t would match "better", "beet", and "debt"

[abcd] match any one of 'a', 'b', 'c', or 'd';
for example, /b[aeiou]t/ would match "rabbit" and "batter", but not "debt" nor "byte"

[a-d] match any one in the range of characters from 'a' to 'd' inclusive; for example, /[a-z][A-Z]/ (any lower case letter immediately followed by an upper case letter) would match "deSotto" and "MacIntosh" but not "Susie Smith"

[a-dA-D0-4] match any one in the range of characters from 'a' to 'd' or from 'A' to 'D" or from '0' to '9' inclusive;
for example, /0x[0-9A-F]/i ("0x followed by a hexadecimal digit, with the case ignored) would match to "carriage return = 0x0D" and "line feed is 0XA"

[^....] match something that is not one of the listed elements following the caret (^);
for example, /^a-zA-Z\s/ (not a letter or a whitespace)

{x,y} match preceding character (or group) at least x but no more than y times;
for example /\s[0-9]{2,5}\s/ (at least 2 but not more than 5 digits between two whitespaces) would match "there are 88 keys on a piano" and "1 year = are 365 days", but not "I saw 3 blind mice" nor "there are over 1000000 neurons in the brain"

{x} match the preceding character (or group) exactly x times

{x,} match the preceding character (or group) at least x times (with no upper limit)

(abcd|iou|xyz) match any substring to "abcd" or "iou" or "xyz";
for example, /profit|loss|income|expense/i would match "His net income was $37,000" or "the expenses where quite high", but not "SanPaulos seems like a dream"

\ escape special meaning of regular expression character when followed by one of the regular expression characters + ? . * ( ) { } [ ] | \ or /
for example, /[0-9]\*/ would match "multiply 2*number_of_years_in_jail", but not "there were 357 orangutangs in my bed"

\r a carriage return

\n a newline (or line feed)

\t a tab

\f a form feed

\d a digit (same as [0-9])

\D a non-digit (same as [^0-9])

\w a word character (same as [0-9a-zA-Z])

\W a non-word character (same as [^0-9a-zA-Z])

\s a space or whitespace (same as [\r\t\n\f])

\S a non-space or non-whitespace

\b a word boundary; punctuation or whitespace (or non-alphanumeric at the beginning or end of a string)

\B a non-word boundary character

^ the beginning of the string

$ the end of the string

(...) group of characters

$1,...,$9 reference to a group which matched (used by substitute operation);
$1 is the first group that matched within the string, $2 is the second group, etc.

$& last previous pattern matched

$` portion of the target string to the left of the last previous pattern match

$' portion of the target string to the right of the last previous pattern match

Substitution
form: s/pattern_to_match/substitution_string/options
if the pattern_to_match is found in the current_line ($_) then the matched pattern is replaced by substitution_string; by default this substitution is only done once.
- g "global"; substitute for every match - not just the first
- i ignore case in determining a match
- e evaluate the substitution_string as a PERL expression before performing the replacement
for example,
```
    $_ = "The C++ language is to be used in this course";
    s/C\+\+/Java/;
    print;          # outputs "The Java language is to be used in this course"
```
The substitution operator returns a value equal to the number of substitutions made; for example,
```
    <STDIN>         # input a line
    print "\nThere were ", s/[aeiou]/x/ig, " vowels in this line\n";
```

Regular Expression Elements
.	any character except a newline
+	one or more of the preceding character (or group); for example /be+t/ would match "he bet on the race" and "she ate a beet" and "I hope you feel better" but not "Their debt increased"; similarly /b(an)+/ would match "the band" and "banana", but not "batter"
?	zero or one of the preceding character (or group); for example, /be?t/ would match "he bet" and "in debt", but not "ate a beet"
*	zero or more of the preceding character (or group); for example, /be?t would match "better", "beet", and "debt"

[abcd]	match any one of 'a', 'b', 'c', or 'd'; for example, /b[aeiou]t/ would match "rabbit" and "batter", but not "debt" nor "byte"
[a-d]	match any one in the range of characters from 'a' to 'd' inclusive; for example, /[a-z][A-Z]/ (any lower case letter immediately followed by an upper case letter) would match "deSotto" and "MacIntosh" but not "Susie Smith"
[a-dA-D0-4]	match any one in the range of characters from 'a' to 'd' or from 'A' to 'D" or from '0' to '9' inclusive; for example, /0x[0-9A-F]/i ("0x followed by a hexadecimal digit, with the case ignored) would match to "carriage return = 0x0D" and "line feed is 0XA"
[^....]	match something that is not one of the listed elements following the caret (^); for example, /^a-zA-Z\s/ (not a letter or a whitespace)
{x,y}	match preceding character (or group) at least x but no more than y times; for example /\s[0-9]{2,5}\s/ (at least 2 but not more than 5 digits between two whitespaces) would match "there are 88 keys on a piano" and "1 year = are 365 days", but not "I saw 3 blind mice" nor "there are over 1000000 neurons in the brain"
{x}	match the preceding character (or group) exactly x times
{x,}	match the preceding character (or group) at least x times (with no upper limit)
(abcd\|iou\|xyz)	match any substring to "abcd" or "iou" or "xyz"; for example, /profit\|loss\|income\|expense/i would match "His net income was $37,000" or "the expenses where quite high", but not "SanPaulos seems like a dream"
\	escape special meaning of regular expression character when followed by one of the regular expression characters + ? . * ( ) { } [ ] \| \ or / for example, */[0-9]\/** would match "multiply 2*number_of_years_in_jail", but not "there were 357 orangutangs in my bed"
\r	a carriage return
\n	a newline (or line feed)
\t	a tab
\f	a form feed
\d	a digit (same as [0-9])
\D	a non-digit (same as [^0-9])
\w	a word character (same as [0-9a-zA-Z])
\W	a non-word character (same as [^0-9a-zA-Z])
\s	a space or whitespace (same as [\r\t\n\f])
\S	a non-space or non-whitespace
\b	a word boundary; punctuation or whitespace (or non-alphanumeric at the beginning or end of a string)
\B	a non-word boundary character
^	the beginning of the string
$	the end of the string
(...)	group of characters
$1,...,$9	reference to a group which matched (used by substitute operation); $1 is the first group that matched within the string, $2 is the second group, etc.
$&	last previous pattern matched
$`	portion of the target string to the left of the last previous pattern match
$'	portion of the target string to the right of the last previous pattern match

Regular Expressions & String Matching

Basic Match Operation

Regular Expressions

Substitution