These classnotes are depreciated. As of 2005, I no longer teach the classes. Notes will remain online for legacy purposes

UNIX01/Perl Regular Expression Matching

Classnotes | UNIX01 | RecentChanges | Preferences

Regular Expression Matching

One thing we've mentionned briefly before without really exploring is Regular Expression Matching. This is a concept that is key to using many UNIX applications, but is unfortunately a very rich and deep subject. As such, if you continue to take these UNIX courses, you will explore it more and more as the classes progress. For now, we will only concern ourselves with the most basic matching which we will use today. If you would like to immerse yourself in Perl's Regular Expressions, then I would recommend reading this document: http://www.perldoc.com/perl5.6/pod/perlre.html

What They Are

Regular expressions are a syntax, implemented in PERL and certain other environments, making it not only possible but easy to do some of the following:

  • Complex string comparisons
 $string =~ m/sought_text/;
 # m before the first slash is the "match" operator.
  • Complex string selections
 $string =~ m/whatever(sought_text)whatever2/;
 $soughtText = $1;
  • Complex string replacements
 $string =~ tr/originaltext/newtext/;
 # tr before first slash is "translate" operator.
  • Parsing based on the above abilities

Doing String Comparisons

We start with string comparisons because they're the easiest, and yet most of what's contained here is applicable in selecting and replacing text.

Simple String Comparisons

The most basic string comparison is

 $string =~ m/sought_text/;

The above returns true if string $string contains substring "sought_text", false otherwise. If you want only those strings where the sought text appears at the very beginning, you could write the following:

 $string =~ m/^sought_text/;

Similarly, the $ operator indicates "end of string". If you wanted to find out if the sought text was the very last text in the string, you could write this:

 $string =~ m/sought_text$/;

Now, if you want the comparison to be true only if $string contains the sought text and nothing but the sought text, simply do this:

 $string =~ m/^sought_text$/;

Now what if you want the comparison to be case insensitive? All you do is add the letter i after the ending delimiter:

 $string =~ m/^sought_text$/i;

Using Simple "Wildcards" and "Repetitions"

Calling these "wildcards" may actually conflict with the theoretical grammer and syntax of PERL, but in fact is the most intuitive way to think of it, and will not lead to any coding mistakes.

 .   Match any character
 \w  Match "word" character (alphanumeric plus "_")
 \W  Match non-word character
 \s  Match whitespace character
 \S  Match non-whitespace character
 \d  Match digit character
 \D  Match non-digit character
 \t  Match tab
 \n  Match newline
 \r  Match return
 \f  Match formfeed
 \a  Match alarm (bell, beep, etc)
 \e  Match escape
 \021  Match octal char ( in this case 21 octal)
 \xf0  Match hex char ( in this case f0 hexidecimal)

You can follow any character, wildcard, or series of characters and/or wildcard with a repetiton. Here's where you start getting some power:

 *      Match 0 or more times
 +      Match 1 or more times
 ?      Match 1 or 0 times
 {n}    Match exactly n times
 {n,}   Match at least n times
 {n,m}  Match at least n but not more than m times

Now for some examples:

 $string =~ m/\s*rem/i;
 #true if the first printable text is rem or REM

 $string =~ m/^\S{1,8}\.\S{0,3}/;
 # check for DOS 8.3 filename 
 #  (note a few illegals can sneak thru)

Using Groups ( ) in Matching

Note: Many situations can be done either with groups ( ) or character classes [ ]. Groups are less quirky and they more often yield the results you were looking for.

Groups are regular expression characters surrounded by parentheses. They have two major uses:

  • To allow alternative phrases as in /(Clinton|Bush|Reagan)/i. Note that for single character alternatives, you can also use character classes.
  • As a means of retrieving selected text in selection, translation and substitution, used with the $1, $2, etc scalers.

This section will discuss only the first use. Powerful regular expressions can be made with groups At its simplest, you can match either all lowercase or name case like this:

 if($string =~ m/(B|b)ill (C|c)linton/)
  {print "It is Clinton, all right!\n"}

Detect all strings containing vowels

 if($string =~ m/(A|E|I|O|U|Y|a|e|i|o|u|y)/)
  {print "String contains a vowel!\n"}

Detect if the line starts with any of the last three presidents:

 if($string =~ m/^(Clinton|Bush|Reagan)/i)
  {print "$string\n"};

Using Character Classes [ ]

Character classes are alternative single characters within square brackets. If not used carefully, they can yield unexpected results. Remember that groups are an alternative.

Character classes have three main advantages:

  • Shorthand notation, as [AEIOUY] instead of (A|E|I|O|U|Y). This advantage is minor at best.
  • Character Ranges, such as [A-Z].
  • One to one mapping from on class to another, as in tr/[a-z]/[A-Z].

Note that everything in the square brackets counts as one character. It may be tempting to do something like this:

 if($string =~ /[Clinton|Bush|Reagan]/){$office = "President"}

The above may even appear to work upon casual testing. Don't do it. Remember that everything inside the brackets represents ONE character, simply listing all it's alternative possibilities.



Classnotes | UNIX01 | RecentChanges | Preferences
This page is read-only | View other revisions
Last edited July 26, 2003 3:47 am (diff)
Search:
(C) Copyright 2003 Samuel Hart
Creative Commons License
This work is licensed under a Creative Commons License.