Regular expressions



Last revision August 2, 2004

Table of Contents:
  1. Editor choices on Unix
  2. Characteristics, advantages, and disadvantages of vi
  3. Basic text editing operations in vi
  4. Regular expressions
  5. File searching with grep
  6. More about regular expressions
  7. Intermediate text editing with vi
  8. Vi Quick Reference

A regular expression is a pattern or template used in a string matching or searching operation. Regular expressions are used by many programs that need to search for text in a file or perform substitutions or other operations on text. Such programs include the vi editor, the grep file searching program, and the data matching and manipulation utilities expr, awk, and sed.

For either searching or replacing, regular expressions allow you to work with patterns of characters, not just fixed strings of characters. This allows greater power and flexibility in your commands.

In addition to regular characters, regular expressions contain special characters called "metacharacters". These characters mean something other than what they appear to be. They are like variables in a program.

Remember that many metacharacters in regular expressions also have a special (different) meaning to the shell. So if you are typing a command at the shell prompt, such as grep, that requires a regular expression as an argument, be sure to enclose the regular expression in a pair of single quotes.

In regular expressions, all characters that are not "special" match themselves only. If you want a metacharacter to stand for itself, rather than its special meaning, precede it with the escape character \ (backslash). To match a backslash character itself, you need two in a row (the first "escapes" the special meaning of the second backslash as an escape character).

The basic metacharacters permit matches of arbitrary characters.

. (period) matches any single character.
 
[list] matches any single character of the set given in list (any one of the characters between the brackets). If the set of possible matching characters is in ascending ASCII collating sequence, you can abbreviate the list as a-b, where a and b are the end points of the sequence you want to allow, for example, [a-z] for all lowercase letters. To include -, ] or ^ in the list, precede it with the backslash escape character (\).
 
[^list] matches any single character which is not in list. The syntax for the list is the same as shown above.
 
* An asterisk that follows a single character, a period, or a bracketed list, means to match zero or more occurrences of that expression.

Thus, ab* would match a followed by zero or more occurrences of b. This is different from the behavior of the asterisk when interpreted by the shell as a file name wildcard.

.* would match zero or more occurrences of any character - matches anything, including nothing.
 

A second group of metacharacters allow you to "anchor" the match to a location in the line.

^ at the start of a regular expression means that it will only match if it occurs at the beginning of a line.
 
$ at the end of a regular expression means that it will only match if it occurs at the end of a line.
 

^ and $ only have special meanings if used at the beginning or end, respectively, of a regular expression (or for the case of ^, also if at the beginning of a list of characters in square brackets) -- otherwise they are ordinary characters.

The general rule is that a regular expression matches the longest among the possible leftmost matches in a line. For example, if you use the regular expression t.*e with a substitution command in an editor such as vi, and the next line has "the tree is bare", the expression will match not just the first word "the", but the entire phrase "the tree is bare", which starts with the "t" character, has any number of other characters following, and ends with the "e" character.

Comments or Questions?