Harley Hahn's Guide to
|
A Personal Note
Chapters...
Command
INSTRUCTOR |
Exercises and Answers for Chapter 20... Regular Expressions Review Question #1: What is a regular expression? What are two common abbreviations for "regular expression"? Answer A regular expression is a specification, based on metacharacters and abbreviations, that provides a compact way of unambiguously describing a pattern of characters. Abbreviated as "regex" or, more simply, "re". Review Question #2: Within a regular expression, explain what the following metacharacters match: . ^ $ \< \> [list] [^list] Explain what the following repetition operators match: * + ? {n} Answer
Review Question #3: For each of the following predefined character classes, give the definition and specify the equivalent range: [:lower:] [:upper:] [:alpha:] [:digit:] [:alnum:] For example, [:lower:] represents all the lowercase numbers; the equivalent range is a-z. Answer
Review Question #4: By default, your system uses the dictionary collating sequence, but you want to use the C collating sequence. How do you make the change? What command would you use for the Bourne Shell family? For the C-Shell family? In which initialization file would you put such a command? Answer To change to the C collating sequence, set the environment variable LC_COLLATE to the value C. For the Bourne shell family, use: export LC_COLLATE=C For the C-Shell family, use: setenv LC_COLLATE C You would put such a command in your login file (see Chapter 14). Applying Your Knowledge #1: Create regular expressions to match:
• "hello"
Use grep to test your answers. Answer The regexes to use are:
hello
Applying Your Knowledge #2: Using repetition operators, create regular expressions to match:
• "start", followed by 0 or more numbers, followed by "end"
Use grep to test your answers. Hint: Make sure grep is using extended (not basic) regular expressions. Answer The regexes to use are:
start[0-9]*end
Applying Your Knowledge #3: As we discussed in Chapter 20, the following two commands find all the lines in the file data that contain at least one non-alphabetic character:
grep '[^A-Za-z]' data
What command would you use to find all the lines that do not contain even a single alphabetic character? Answer Use grep with the -v (reverse) option:
grep -v '[A-Za-z]' data
Applying Your Knowledge #4: Within the Usenet global discussion group system, freedom of expression is very important. However, it is also important that people should be able to avoid offensive postings if they so desire. The solution is to encode potentially offensive text in a way that it looks like gibberish to the casual observer. However, the encoded text can be decoded easily by anyone who chooses. The system used for the coding is called Rot-13. It works as follows. Each letter of the alphabet is replaced by the letter 13 places later in the alphabet, wrapping back to the beginning if necessary:
B → O... O → B
Create a single command that reads from a file named input, encodes the text using Rot-13, and writes the encoded text to standard output. Then create a command that reads encoded Rot-13 data and converts it back to ordinary text. Test your solutions by creating a text file named input, encoding it and then decoding it. Answer To encode data using Rot-13: tr A-Za-z N-ZA-Mn-za-m To decode, use the same command. For Further Thought #1: The term "regular expression" comes from an abstract computer science concept. Is it a good idea or a bad idea to use such names? Would it make much difference if the term "regular expression" was replaced by a more straightforward name, such as "pattern matching expression" or "pattern matcher"? Why? Answer In many cases, it is a bad idea to use names that refer to abstract computer science concept because they are not meaningful to anyone but abstract computer scientists. For example, the words "regular" and "expression" have an meaning in English that is far removed from how they are used in computer science. If "regular expression" were replaced by a more straightforward term, it would help newcomers, who would at least have an understanding of what they were learning. Such nomenclature would also make it easier to read and understand documentation that refers to such pattern matching, especially for users who do not have a computer science background. For Further Thought #2: With the introduction of locales to support internationalization, regular expression patterns that worked for years stopped working on some systems. In particular, regular expressions that depended on the traditional C collating sequence, do not always work with the dictionary collating sequence. In most cases, the solution is to use predefined character classes instead of ranges. For example, you should use [:lower:] instead of a-z. (See Figure 20-3 for the full set.) What do you think of this arrangement? Give three reasons why the old system was better. Give three reasons why the new system is better. Answer Old system is better because: • People were used to it • Shell scripts worked • Makes sense when you read it
• New names are easy to mistype, especially
when you use New system is better because: • Makes regexes independent of the collating sequence • Can be extended to any language • Removes any ambiguity Exercises: Introduction | Chapter list
© All contents Copyright 2024, Harley Hahn
|