Harley Hahn's Guide to Unix and Linux

A regular expression is a specification, based on metacharacters and abbreviations, that provides a compact way of unambiguously describing a pattern of characters.

Abbreviated as "regex" or, more simply, "re".

Review Question #2:

Within a regular expression, explain what the following metacharacters match:

. ^ $ \< \> [list] [^list]

Explain what the following repetition operators match:

* + ? {n}

Answer

Metacharacter	Meaning
.	match any single character except newline
^	anchor: match the beginning of a line
$	anchor: match the end of a line
\<	anchor: match the beginning of a word
\>	anchor: match the end of a word
[list]	character class: match any character in list
[^list]	character class: match any character not in list

Repetition Operator	Meaning
*	match zero or more times
+	match one or more times
?	match zero or one times
{n}	bound: match n times

Review Question #3:

For each of the following predefined character classes, give the definition and specify the equivalent range:

[:lower:] [:upper:] [:alpha:] [:digit:] [:alnum:]

For example, [:lower:] represents all the lowercase numbers; the equivalent range is a-z.

Answer

Class	Meaning	Similar to...
[:lower:]	lowercase letters	a-z
[:upper:]	uppercase letters	A-Z
[:alpha:]	upper- and lowercase letters	A-Za-z
[:digit:]	digits	0-9
[:alnum:]	upper- and lowercase letters, numbers	A-Za-z0-9

Review Question #4:

By default, your system uses the dictionary collating sequence, but you want to use the C collating sequence. How do you make the change?

What command would you use for the Bourne Shell family? For the C-Shell family?

In which initialization file would you put such a command?

Answer

To change to the C collating sequence, set the environment variable LC_COLLATE to the value C.

For the Bourne shell family, use:

export LC_COLLATE=C

For the C-Shell family, use:

setenv LC_COLLATE C

You would put such a command in your login file (see Chapter 14).

Applying Your Knowledge #1:

Create regular expressions to match:

• "hello"
• the word "hello"
• either the word "hello" or the word "Hello"
• "hello" at the beginning of a line
• "hello" at the end of a line
• a line consisting only of "hello"

Use grep to test your answers.

Answer

The regexes to use are:

hello
\<hello\>
\<[hH]ello\>
^hello
hello$
^hello$

Applying Your Knowledge #2:

Using repetition operators, create regular expressions to match:

• "start", followed by 0 or more numbers, followed by "end"
• "start", followed by 1 or more numbers, followed by "end"
• "start", followed by 0 or 1 number, followed by "end"
• "start", followed by exactly 3 numbers, followed by "end"

Use grep to test your answers.

Hint: Make sure grep is using extended (not basic) regular expressions.

Answer

The regexes to use are:

start[0-9]*end
start[0-9]+end
start[0-9]?end
start[0-9]{3}end

Applying Your Knowledge #3:

As we discussed in Chapter 20, the following two commands find all the lines in the file data that contain at least one non-alphabetic character:

grep '[^A-Za-z]' data
grep '[^[:alpha:]]' data

What command would you use to find all the lines that do not contain even a single alphabetic character?

Answer

Use grep with the -v (reverse) option:

grep -v '[A-Za-z]' data
grep -v '[[:alpha:]]' data

Applying Your Knowledge #4:

Within the Usenet global discussion group system, freedom of expression is very important. However, it is also important that people should be able to avoid offensive postings if they so desire. The solution is to encode potentially offensive text in a way that it looks like gibberish to the casual observer. However, the encoded text can be decoded easily by anyone who chooses.

The system used for the coding is called Rot-13. It works as follows. Each letter of the alphabet is replaced by the letter 13 places later in the alphabet, wrapping back to the beginning if necessary:

B → O...   O → B
C → P...   P → C
D → Q...   Q → D...

Create a single command that reads from a file named input, encodes the text using Rot-13, and writes the encoded text to standard output. Then create a command that reads encoded Rot-13 data and converts it back to ordinary text.

Test your solutions by creating a text file named input, encoding it and then decoding it.

Answer

To encode data using Rot-13:

tr A-Za-z N-ZA-Mn-za-m

To decode, use the same command.

For Further Thought #1:

The term "regular expression" comes from an abstract computer science concept. Is it a good idea or a bad idea to use such names?

Would it make much difference if the term "regular expression" was replaced by a more straightforward name, such as "pattern matching expression" or "pattern matcher"? Why?

Answer

In many cases, it is a bad idea to use names that refer to abstract computer science concept because they are not meaningful to anyone but abstract computer scientists. For example, the words "regular" and "expression" have an meaning in English that is far removed from how they are used in computer science.

If "regular expression" were replaced by a more straightforward term, it would help newcomers, who would at least have an understanding of what they were learning. Such nomenclature would also make it easier to read and understand documentation that refers to such pattern matching, especially for users who do not have a computer science background.

For Further Thought #2:

With the introduction of locales to support internationalization, regular expression patterns that worked for years stopped working on some systems. In particular, regular expressions that depended on the traditional C collating sequence, do not always work with the dictionary collating sequence. In most cases, the solution is to use predefined character classes instead of ranges. For example, you should use [:lower:] instead of a-z. (See Figure 20-3 for the full set.) What do you think of this arrangement?

Give three reasons why the old system was better.

Give three reasons why the new system is better.

Answer

Old system is better because:

• People were used to it

• Shell scripts worked

• Makes sense when you read it

• New names are easy to mistype, especially when you use
something like [[:lower:]]

New system is better because:

• Makes regexes independent of the collating sequence

• Can be extended to any language

• Removes any ambiguity

Return to Previous Page

Jump to top of page

Jump to Exercises & Answers for Chapter 21
Displaying Files

Exercises: Introduction | Chapter list

Instructor/Student home page