Harley Hahn's Guide to
|
A Personal Note
Chapters...
Command
INSTRUCTOR |
Chapter 20... Regular Expressions
Regular expressions are used to specify patterns of characters. The most common use for regular expressions is to search for strings of characters. As such, regular expressions are often used in search-and-replace operations. As a tool, regular expressions are so useful and so powerful that, literally, there are entire books and Web sites devoted to the topic. Certainly, mastering the art of using regular expressions is one of the most important things you can do to become proficient with Unix. It is possible to create regular expressions that are very complex. However, most of the time, the regular expressions you require will be simple and straightforward, so all you need to do is learn a few simple rules and then practice, practice, practice. The goal of this chapter is to get you started. Note: Before we start, there is one thing I want to mention. Within this chapter, I will be showing you a great many examples using the grep command (Chapter 19). If you find that some of the regular expression features don't work with your version of grep, you may have to use egrep or grep -E instead. In such cases, you can set up an alias to use one of these variations automatically. You will find the details in Chapter 19, as part of the discussion of grep and egrep. The reasons for this will be clear when we talk about extended and basic regular expressions later in the chapter.
Within this chapter, I will be showing you examples of regular expressions using the grep command, which we discussed in Chapter 19. Although these are convenient examples, you should know that regular expressions can be used with many different Unix programs, such as the vi and Emacs text editors, less, sed, and many more. In addition, regular expressions can be used with many programming languages, such as Awk, C, C++, C#, Java, Perl, PHP, Python, Ruby, Tcl and VB.NET. The concepts I will be teaching you are typical of regular expressions in general. Once you master the basic rules, all you will ever need to learn are a few variations as the need arises. Although the more esoteric features of regular expressions can vary from one program to another — for example, Perl has a whole set of advanced features — the basic ideas are always the same. If you ever have a problem, all you have to do is check the documentation for the program you are using. A REGULAR EXPRESSION, often abbreviated as REGEX or RE, is a compact way of specifying a pattern of characters. For example, consider the following set of the three character strings: harley1 harley2 harley3 As a regular expression, you could represent this set of patterns as harley[123]. Here is another example. You want to describe the set of character strings consisting of the uppercase letter "H", followed by any number of lowercase letters, followed by the lowercase letter "y". The regular expression to use is H[[:lower:]]*y. As you can see, the power of regular expressions comes from using metacharacters and abbreviations that have special meanings. We will discuss the details in the following sections. For reference, the syntax for using regular expressions is summarized in Figures 20-1, 20-2 and 20-3. Take a moment to skim them now, and you can refer back to them as necessary as you read the rest of the chapter. Figure 20-1: Regular expressions: Basic matching
Figure 20-2: Regular expressions: Repetition operators
Figure 20-3: Regular expressions: Predefined character classes
The term "regular expression" comes from computer science and refers to a set of rules for specifying patterns. The name comes from the work of the eminent American mathematician and computer scientist Stephen Kleene (1909–1994). (His name is pronounced "Klej-nee".) In the early 1940s, two neuroscientists, Walter Pitts and Warren McCulloch developed a mathematical model of how they believed neurons (nerve cells) worked. As part of their model, they used very simple, imaginary machines, called automata. In the mid-1950s, Kleene developed a way of describing automata mathematically using what he called "regular sets": sets that could be described using a small number of simple properties. Kleene then created a simple notation, which he called regular expressions, that could be used to describe such sets. In 1966, Ken Thompson — the programmer who would later develop Unix — joined Bell Labs. One of the first things he did was to program a version of the QED text editor, which he had used at U.C. Berkeley. Thompson extended QED significantly, adding, among other features, a pattern-matching facility that used an enhanced form of Kleene's regular expressions. Until that time, text editors could only search for exact character strings. Now, using regular expressions, the QED editor could search for patterns as well. In 1969, Thompson created the first, primitive version of Unix (see Chapters 1 and 2). Not long afterwards, he wrote the first Unix editor, ed (pronounced "ee-dee"), which he designed to use a simplified form of regular expressions, less powerful than those he had used with QED. The ed program was part of what was called UNIX Version 1, which was released in 1971. The popularity of ed led to the use of regular expressions with grep and, later, with many other Unix programs. Today, regular expressions are used widely, not only within Unix, but throughout the world of computing. (Interestingly enough, the features Thompson left out when he wrote ed have now been added back in.) hint The next time you are at a Unix party and people start talking about regular expressions, you can casually remark that they correspond to Type 3 Grammars within the Chomsky hierarchy. Once you have everyone's attention, you can then explain that it is possible to construct a simple mapping between any regular expression and an NFA (nondeterministic finite automaton), because every regular expression has finite length and, thus, a finite number of operators. Within moments, you will be the most popular person in the room.(*) * Footnote Well, it's always worked for me.
This section is a reference and, on your first reading, it may be a bit confusing. Don't worry. Later, after you have read the rest of the chapter and practiced a bit, everything will make sense. As we discussed in the previous section, regular expressions became part of Unix when Ken Thompson created the ed text editor, which was released with UNIX Version 1 in 1971. The original regular expression facility was useful, but limited, and over the years, it has been extended significantly. This has given rise to a variety of regex systems of varying complexity, which can be confusing to beginners. For practical purposes, you only need to remember a few fundamental ideas. I'd like to take a moment to discuss these ideas now, so you won't have a problem later on when you practice with the examples in this chapter. Unix supports two major variations of regular expressions: a modern version and an older, obsolete version. The modern version is the EXTENDED REGULAR EXPRESSION or ERE. It is the current standard, and is part of the overall IEEE 1003.2 standard (part of POSIX; see Chapter 11). The older version is the BASIC REGULAR EXPRESSION or BRE. It is a more primitive type of regular expression that was used for many years, until it was replaced by the 1003.2 standardization. Basic regular expressions are less powerful than extended regular expressions, and have a slightly more confusing syntax. For these reasons, BREs are now considered obsolete. They are retained only for compatibility with older programs. In this chapter, I will be teaching you extended regular expressions, the default for modern Unix and Linux systems. However, from time to time, you will encounter an old program that accepts only basic regular expressions. In such cases, I want you to know what you are doing, so here are a few words of explanation. The two commands most likely to give you a problem are grep and sed (both of which are covered in Chapter 19). To see what type of regular expressions they support on your system, check your man pages:
man grep
If your system uses the GNU utilities — which is the case with Linux and FreeBSD — you will find that some commands have been updated to offer a -E option, which allows you to use extended regular expressions. For example, this is the case with grep. In general, you can check if a command offers the -E option, either by looking at its man page, or by using the --help option to display its syntax (discussed in Chapter 10). hint With Linux and FreeBSD, some commands offer a -E option to allow you to use extended regular expressions. Since extended regexes are always preferable, you should get in the habit of using -E. If you use such a command regularly, you can create an alias to force -E to be used automatically. For an example of how to do this, see the discussion of grep and egrep in Chapter 19. Even though extended regular expressions are the modern standard, and even though some programs offer the -E option, there will be times when you have no choice but to use basic regular expressions. For example, you may want to use an older program that, on your system, only supports basic regular expressions. (The most common example of this is sed.) In such cases, it behooves you to know what to do, so I'm going to take a moment to explain the difference between basic and extended regular expressions. Of course, this discussion is a bit premature, because we haven't, as yet, talked about the technical details of regexes. However, as I mentioned earlier, if there's anything you don't understand on first reading, you will later, when you come back to it. As I mentioned earlier in the chapter, the power of regular expressions comes from using metacharacters that have special meanings. We will spend a lot of time in this chapter talking about how to use these metacharacters. (For reference, they are summarized in Figures 20-1, 20-2 and 20-3.) The chief difference between basic and extended regular expressions is that, with basic regexes, certain metacharacters cannot be used and others must be quoted with a backslash. (Quoting is discussed in Chapter 13.) The metacharacters that cannot be used are the question mark, plus sign, and vertical bar: ? + | The metacharacters that must be escaped are the brace brackets and parentheses: { } ( ) For reference, I have summarized these limitations in Figure 20-4. This summary won't mean much the first time you look at it, but it will make sense by the time you finish the chapter. Figure 20-4: Extended and basic regular expressions
As I explained earlier, a regular expression (or regex) is a compact way of specifying a pattern of characters. To create a regular expression, you put together ordinary characters and metacharacters according to certain rules. You then use the regex to search for the character strings you want to find. When a regular expression corresponds to a particular string of characters, we say that it MATCHES the string. For example, the regex harley[123] matches any one of harley1, harley2 or harley3. For now, you don't need to worry about the details, except to realize that harley and 123 are ordinary characters, and [ and ] are metacharacters. Another way of saying this is that, within the regular expression, harley and 123 match themselves, while the [ and ] (bracket) characters have a special meaning. Eventually, you will learn all the metacharacters and their special meanings. In this section, we'll cover certain metacharacters, called ANCHORS, that are used to match locations at the beginning or end of a character string. For example, the regex harley$ matches the string harley, but only if it comes at the end of a line. This is because $ is a metacharacter that acts as an anchor by matching the end of a line. (Don't worry about the details for now.) To begin our adventures with regular expressions, we will start with the basic rule: All ordinary characters, such as letters and numbers, match themselves. Here are a few examples, to show you how it works. Let's say that you have a file named data that contains the following four lines:
Harley is smart
You want to use grep to find all the lines that contain "Harley" anywhere in the line. You would use: grep Harley data In this case, Harley is actually a regular expression that will cause grep to select lines 1, 2 and 3, but not line 4:
Harley is smart
This is nothing new, but it does show how, within a regular expression, an H matches an "H", an a matches an "a", an r matches an "r", and so on. All regexes derive from this basic idea. To expand the power of regular expressions, you can use anchors to specify the location of the pattern for which you are looking. The ^ (circumflex) metacharacter is an anchor that matches the beginning of a line. Thus, to search for only those lines that start with "Harley", you would use: grep '^Harley' data In our example, this command would select lines 1 and 2, but not 3 or 4 (because they don't start with "Harley"):
Harley is smart
You will notice that, in the last command, I have quoted the regular expression. You should do this whenever you are using a regex that contains metacharacters, to ensure that the shell will leave these characters alone and pass them on to your program (in this case, grep). If you are not sure if you need to quote the regular expression, go ahead and do it anyway. It can't cause a problem. You will notice that, to be safe, I used strong quotes (single quotes) rather than weak quotes (double quotes). This ensures that all metacharacters, not just some of them, will be quoted properly. (If you need to review the difference between strong quotes and weak quotes, see Chapter 13.) The anchor to match the end of a line is the $ (dollar) metacharacter. For example, to search for only those lines that end with "Harley", you would use: grep 'Harley$' data In our example, this would select only lines 2 and 3:
Harley
You can combine ^ and $ in the same regular expression as long as what you are doing makes sense. For example, to search for all the lines that consist entirely of "Harley", you would use both anchors: grep '^Harley$' data In our example, this would select line 2, because it is the only line that begins and ends with "Harley": Harley Using both anchors with nothing in between is an easy way to look for empty lines. For example, the following command counts all the empty lines in the file data: grep '^$' data | wc -l In a similar fashion, there are anchors you can use to match the beginning or end of a word, or both. To match the beginning of a word, you use the 2-character combination \<. To match the end of a word, you use \>. For example, say you want to search a file named data for all the lines that contain the letters "kn", but only if they occur at the beginning of a word. You would use: grep '\<kn' data To find the letters "ow", but only at the end of a word, use: grep 'ow\>' data To search for complete words, use both \< and \>. For example, to search for "know", but only as a complete word, use: grep '\<know\>' data This command would select the line: I know who you are, and I saw what you did. But it would not select the line: Who knows what evil lurks in the hearts of men? As a convenience, on systems that use the GNU utilities — such as Linux and FreeBSD — you can use \b as an alternate anchor to take the place of both \< and \>. For example, the following commands are equivalent:
grep '\<know\>' data
You can think of \b as meaning "boundary marker". It is up to you to choose which word-boundary anchors you like better. I use \< and \> because, to my eye, they are easier to see. However, many people prefer \b, because it is easier to type, and you can use the same anchor at the beginning and end of a word. When we use regular expressions, the definition of a "word" is more flexible than in English. Within a regex, a WORD is self-contained, contiguous sequence of characters consisting of letters, numbers, or _ (underscore) characters. Thus, within a regex, all of the following are considered to be words: fussbudget Weedly 1952 error_code_5 This same definition holds within many Unix programs. For example, grep uses this definition when you use the -w option to match complete words. hint When you use grep to look for complete words, it is often easier to use the -w (word) option than it is to use multiple instances of \< and \>, or \b. For example, the following three commands are equivalent:
grep -w 'cat' data
Within a regular expression, the metacharacter . (dot) matches any single character except newline. (In Unix, a newline marks the end of a line; see Chapter 7.) For example, say you want to search a file named data for all the lines that contain the following pattern: the letters "Har", followed by any two characters, followed by the letter "y". You would use the command: grep 'Har..y' data This command would find lines that contain, for example: harley harxxy harlly har12y You will find the . metacharacter to be very useful, and you will use it a lot. Nevertheless, there will be times when you will want to be more specific: a . will match any character, but you may want to match particular characters. For example, you might want to search for an uppercase "H" followed by either "a" or "A". In such cases, you can specify the characters you want to find by placing them within square brackets [ ]. Such a construction is called a CHARACTER CLASS. For example, to search the file data for all lines that contain the letter "H", followed by either "a" or "A", you would use: grep 'H[aA]' data Before moving on, I want to make an important point. Strictly speaking, the character class does not include the brackets. For example, in the previous command, the character class is aA, not [aA]. Although the brackets are required when you use a character class, they are not considered to be part of the character class itself. This distinction will be important later when we talk about special abbreviations called "predefined character classes". To continue, here is an example that uses more than one character class in the same regular expression. This following command searches for lines that contain the word "license", even if it is misspelled by mixing up the "c" and the "s": grep 'li[cs]en[cs]e' data To make the command more useful, we can use \< and \> or \b to match only whole words:
grep '\<li[cs]en[cs]e\>' data
Those two commands match will match any of the following: licence license lisence lisense
Some sets of characters are so common, they are given names to make them easy to use. These sets are called PREDEFINED CHARACTER CLASSES, and you see them in Figure 20-3, earlier in the chapter. (Stop and take a look now, before you continue, because I want you to be familiar with the various names and what they mean.) Using predefined character classes is straightforward except for one odd rule: the brackets are actually part of the name. Thus, when you use them, you must include a second set of brackets to maintain the proper syntax. (You will remember that, earlier, I told you that when you use a character class, the outer brackets are not part of the class.) For example, the following command uses grep to find all the lines in the file named data that contain the number 21 followed by a single lower- or uppercase letter: grep '21[[:alpha:]]' data The next command finds all the lines that contain two uppercase letters in a row, followed by a single digit, followed by one lowercase letter: grep '[[:upper:]][[:upper:]][[:digit:]][[:lower:]]' data Aside from predefined character classes, there is another way to specify a set of letters or numbers. You can use a RANGE of characters, where the first and last characters are separated by a hyphen. For example, to search for all the lines of the file data that contain a number from 3 to 7, you can use: grep '[3-7]' data The range 0-9 means the same as [:digit:] . For example, to search for lines that contain an uppercase "X", followed by any two digits, you can use either of these commands:
grep 'X[0-9][0-9]' data
To conclude this section, let us consider one more situation: you want to match characters that are not within a particular character class. You can do so simply by putting a ^ (circumflex) metacharacter after the initial left bracket. In this context, the ^ acts as a negation operator. For example, the following command searches a file named data for all the lines that contain the letter "X", as long is it is not followed by "a" or "o": grep 'X[^ao]' data The next two commands find all the lines that contain at least one non-alphabetic character:
grep '[^A-Za-z]' data
hint The trick to understanding a complex regular expression is to remember that each character class — no matter how complex it might look — represents only a single character.
At this point, you might be wondering if you can use other ranges instead of predefined character classes. For example, could you use a-z instead of [:lower:]? Similarly, could you use A-Z instead of [:upper:]; or A-Za-z instead of [:alpha:]; or a-zA-Z0-9 instead of [:alnum:]? The answer is, maybe. On some systems it will work; on others it won't. Before I can explain to you why this is the case, we need to talk about the idea of "locales". When you write 0-9, it is an abbreviation for 0123...9, which only makes sense. However, when you write a-z, it does not necessarily mean abcd...z. This is because the order of the alphabet on your particular system depends on what is called your "locale", which can vary from one system to another. Why should this be the case? Before the 1990s, the character encoding used by Unix (and most computer systems) was the ASCII CODE, often referred to as ASCII. The name stands for "American Standard Code for Information Interchange". The ASCII code was created in 1967. It specifies a 7-bit pattern for every character, 128 in all. These bit patterns range from 0000000 (0 in decimal) to 1111111 (127 in decimal). The ASCII code includes all the control characters we discussed in Chapter 7, as well as 56 printable characters: the letters of the alphabet, numbers and punctuation. The printable characters are as follows. (Note that the first character is a space.)
!"#$%&'()*+,-./0123456789:;<=>?
The order of the printable characters is the order in which I have listed them above. They range from character #32 (space) to character #126 (tilde). For reference, you can see a chart of the entire ASCII code in Appendix D. (Take a moment to look at it now.) As a convenience, most Unix systems have a reference page that contains the ASCII code. This is handy, as it allows you to look at the code quickly whenever you want. Unfortunately, the ASCII code page is not standardized, so the way in which you display it depends on which system you are using. See Figure 20-5 for the details. Figure 20-5: Displaying the ASCII code
As we discussed in Chapte 19, within a character coding scheme, the order in which the characters are organized is called a collating sequence. The collating sequence is used whenever you need to put characters in order, for example, when you sort data or when you use a range within a regular expression. When you use Unix or Linux, the collating sequence your system will use depends on your locale. The concept of a locale is part of the POSIX 1003.2 standard, which we discussed in Chapter 11 and Chapter 19. As I explained in Chapter 19, your locale — which is set by an environment variable — tells your programs which language conventions you want to use. This enables anyone in the world to choose a locale to match his or her local language. — which is set by an environment variable — tells your programs which language conventions you want to use. This enables anyone in the world to choose a locale to match his or her local language. On some Unix systems, the locale is set so that the default collating sequence matches the order of the characters in the ASCII code. In particular, as you can see above, all the uppercase letters are grouped together, and they come before the lowercase letters. This sequence is called the C collating sequence (Chapter 19), because it is used by the C programming language. With other Unix systems, including many Linux systems, the locale is set in such a way that the default collating sequence groups— the upper- and lowercase letters in pairs: aAbBcCdD...zZ. The advantage of this collating sequence is it is easy to search for words or characters in the same order as you would find them in a dictionary. Thus, it is called the dictionary collating sequence (Chapter 19). With regular expressions, you can run into problems because Unix expands a-z and A-Z according to whichever collating sequence is used on your system. If you are using the C collating sequence, all the uppercase letters are in a single group, as are all the lowercase letters. This means that, when you specify all the upper- or lowercase letters, you can use ranges instead of predefined character classes: A-Z instead of [:upper:]; a-z instead of [:lower:]; A-Za-z instead of [:alpha:]; and A-Za-z0-9 instead of [:alnum:]. You can see this in Figure 20-3. If you are using the dictionary collating sequence, the letters will be in a different order: AaBbCcDd...Zz. This means that the ranges will work differently. For example, a-z would represent aBbCc...YyZz. Notice that the uppercase "A" is missing. (Can you see why?) Similarly, A-Z would represent AaBbCc...YyZ. Notice the lowercase "z" is missing. As an example, let's say you want to search the file data for all the lines that contain an upper- or lowercase letter from "A" to "E". If your locale uses the C collating sequence, you would use: grep '[A-Ea-e]' data In this case, the regex is equivalent to ABCDEabcde, and it is what most experienced Unix users would expect. However, if your locale uses the dictionary collating sequence you would use: grep '[A-e]' data Traditionally, Unix has used the C collating sequence, and many veteran Unix users assume that a-z always refers to the lowercase letters (only), and A-Z always refers to the uppercase letters (only). However, this is a poor assumption, as some types of Unix, including many Linux distributions, use the dictionary collating sequence by default, not the C collating sequence. However, regardless of which collating sequence happens to be the default for your locale, there is a way to ensure it is the C collating sequence, which is what most Unix users prefer. First, you need to determine which collating sequence is the default on your system. To do so, create a short file named data using the command: cat > data Type the following two lines and then press ^D to end the command:
A
Now type the following command: grep '[a-z]' data If the output consists of both lines of the file (A and a), you are using the dictionary collating sequence. If the output consists of only the a line, you are using the C collating sequence. (Why is this?) If your system uses the C collating sequence, you don't need to do anything. However, please read through the rest of this section, as one day, you will encounter this problem on another system. If your system uses the dictionary collating sequence, you will probably want to change to the C collating sequence. To do this, you set a environment variable named LC_COLLATE to either C or POSIX. Either of the following commands will do the job with the Bourne Shell family:
export LC_COLLATE=C
With the C-Shell family, you would use either of the following:
setenv LC_COLLATE C
To make the change permanent, all you need to do is put one of these commands into your login file. (The login file is discussed in Chapter 14; environment variables are discussed in Chapter 12.) Once your system uses the C collating sequence, you can substitute the appropriate ranges for the predefined character classes, as shown in Figure 20-3. For the rest of this chapter, I will assume that you are, indeed, using the C collating sequence (so, if you are not, put the appropriate command in your login file right now). To display information about locales on your system, you can use the locale command. The syntax is: locale [-a] Your locale is maintained by setting a number of global variables, including LC_COLLATE. To see the current value of these variables, enter the command by itself: locale To display a list of all the available locales on your system, use the -a (all) option: locale -a
Once you make sure you are using the C collating sequence (as described in the previous section) you have some flexibility. When you want to match all the upper- or lowercase letters, you can use either a predefined character class or a range. For example, the following two commands both search a file named data for all the lines that contain the letter "H", followed by any lowercase letter from "a" to "z". For example, "Ha", "Hb", "Hc", and so on:
grep 'H[[:lower:]]' data
The next two commands search for all the lines that contain a single upper- or lowercase letter, followed by a single digit, followed by a lowercase letter:
grep '[A-Za-z][0-9][a-z]' data
Here is a more complex example that searches for Canadian postal codes. These codes have the format "letter number letter space number letter number", where all the letters are uppercase, for example, M5P 3G4. Take a moment to analyze both these commands until you understand them completely: |
grep '[A-Z][0-9][A-Z] [0-9][A-Z][0-9]' data
|
The choice of which type of character class to use — a range or a predefined name — is up to you. Many old-time Unix users prefer to use ranges, because that's what they learned. Moreover, ranges are easier to type than names, which require colons and extra brackets. (See the previous example.) However, names are more readable, which makes them better to use in shell scripts. Also, names are designed to always work properly, regardless of your locale or your language, so they are more portable. For example, say you are working with text that contains non-English characters, such as é (an "e" with an acute accent). By using [:lower:], you will be sure to pick up the é. This might not be the case if you used a-z.
Within a regular expression, a single character (such as A) or a character class (such as A-Za-z or [:alpha:]) matches only one character. To match more than one character at a time, you use a REPETITION OPERATOR. The most useful repetition operator is the * (star) metacharacter. A * matches zero or more occurrences of the preceding character. (See Chapter 10 for a discussion of the idea of "zero or more".) For example, let's say you want to search a file named data for all the lines that contain the uppercase letter "H", followed by zero or more lowercase letters. You can use either of the following commands:
grep 'H[a-z]*' data
These commands will find patterns like: H Har Harley Harmonica Harpoon HarDeeHarHar The most common combination is to use a . (dot) followed by a *. This will match zero or more occurrences of any character. For example, the following command searches for lines that contain "error" followed by zero or more characters, followed by "code": grep 'error.*code' data As an example, this command would select the following lines:
Be sure to document the error code.
The following example searches for lines that contain a colon, followed by zero or more occurrences of any other characters, followed by another colon: grep ':.*:' data At times, you may want to match one or more characters, rather than zero or more. To do so, use the + (plus) metacharacter instead of *. For example, the following commands search the file data and select any lines that contain the letters "variable" followed by one or more numbers:
grep 'variable[0-9]+' data
These commands would select any of the following lines:
You can use variable1 if you want.
They would not select the lines:
Remember to use variable 1, 2 or 3.
What if you want to match either an uppercase "V" or lowercase "v" at the beginning of the pattern? Just change the first letter to a character class: grep '[vV]ariable[0-9]+' data The next repetition operator is the ? (question mark) metacharacter. This allows you to match either zero or one instances of something. Another way to say this is that a ? makes something optional. For example, let's say you want to find all the lines in the file data that contain the word "color" (American spelling) or "colour" (British spelling). You can use: grep 'colou?r' data The final repetition operators let you specify as many occurrences as you want by using brace brackets to create what is called a BOUND. There are four different types of bounds. They are:
Note: The third construction {,m} is not part of the POSIX 1003.2 standard, and will not work with some programs. Here are some examples. The first example matches exactly 3 digits; the second matches at least 3 digits; the third matches 5 or fewer digits; the final example matches 3 to 5 digits.
[0-9]{3}
To show you how you might use a bound with grep, the following command finds all the lines in a file named data that contain either 2- or 3-digit numbers. Notice the use of \< and \> to match complete numbers: grep '\<[0-9]{2,3}\>' data So far, we have used repetition operators with only single characters. You can use them with multiple characters if you enclose the characters in parentheses. Such a pattern is called a GROUP. By creating a group, you can treat a sequence of characters as a single unit. For example, to match the letters "xyz" 5 times in a row, you can use either of the following regular expressions:
xyzxyzxyzxyzxyz
The last repetition operator, the | (vertical bar) character, allows us to use alternation. That is, we can match either one thing or another. For example, say we want to search a file for all the lines that contain any of the following words: cat dog bird hamster Using alternation, it's easy: grep 'cat|dog|bird|hamster' data Obviously, this is a powerful tool. However, in this case, can you see a problem? We are searching for character strings, not complete words. Thus, the above command would also find lines that contain words like "concatenate" or "dogmatic". To find only complete words, we need to explicitly match word boundaries: grep '\<(cat|dog|bird|hamster)\>' data Notice the use of the parenthesis to create a group. This allows us to treat the entire pattern as a single unit. Take a moment to think about this until it makes sense. Here is another example. Suppose we want to find all the lines in the file data that contain one of the words "pathname", "filename" or "basename". (These are three technical terms we will meet in Chapter 24.) Either of the following regular expressions will match these words, although the second regex is more compact:
pathname|filename|basename
Adding in the match for word boundaries, here are two equivalent commands to do the job:
grep '\<pathname|filename|basename\>' data
To finish this section, I will explain one last metacharacter. As you know, metacharacters have special meaning within a regular expression. The question arises: What if you want to match one of these characters? For example, what if you want to match an actual * (star), . (dot) or | (vertical bar) character? The answer is, you can quote the character with a \ (backslash). This changes it from a metacharacter to a regular character, so it will be interpreted literally. For example, to search the file data for all the lines that contain a "$" character, use: grep '\$' data If you want to search for a backslash character itself, just use two backslashes in a row. For example, to find all the lines that contain the characters "\*", followed by any number of characters, followed by one or more letters, followed by "$", use: grep '\\\*.*[A-Za-z]+\$' data
One you understand the rules, most regular expressions are easy to write. However, they can be hard to read, especially if they are lengthy. In fact, experienced Unix people often have trouble understanding regular expressions they themselves have written(*). Here is a simple technique I have developed over the years to help understand otherwise cryptic regexes. * Footnote Thus, the old riddle, "If God can do anything, can he create a regular expression even he can't understand?" No doubt this was what Thomas Aquinas was referring to when he wrote in The Summa Theologica: "There may be doubt as to the precise meaning of the word 'all' when we say that God can do all things." When you encounter a regular expression that gives you trouble, write it on a piece of paper. Then break the regex into parts, writing the parts vertically, one above the other. Take each part in turn and write its meaning on the same line. For example, consider the regex: \\\*.*[A-Za-z]+\$ We can break this down as follows:
To conclude our discussion, I will show you three interesting puzzles that we will solve using regular expressions. To solve the first two puzzles, we will use a file that, in itself, is interesting, the DICTIONARY FILE. The dictionary file, which has been included with Unix from the very beginning, contains of a very long list of English words, including most of the words commonly found in a concise dictionary. Each word is on a line by itself and the lines are in alphabetical order, which makes the file easy to search. Once you get used to using the dictionary file imaginatively, you will be able to do all kinds of amazing things. Some Unix commands, such as look (discussed in Chapter 19) use the dictionary file to do their work. The name of the dictionary file is words. In the early versions of Unix, the words file was stored in a directory named /usr/dict. In recent years, however, the Unix file structure has been reorganized and, on most modern systems — including Linux and FreeBSD — the words file is stored in a directory named /usr/share/dict. On a few systems, such as Solaris, the file is stored in /usr/share/lib/dict. Thus, the pathname of the dictionary file may vary from one system to another. (We'll talk about the Unix file system and pathnames in Chapter 23.) For reference, here are the most likely places you will find the dictionary file:
/usr/share/dict/words
In the examples below, I will use the first pathname, which is the most common. If this name doesn't work for you, try one of the others. To start, here is a simple puzzle. What are all the English words that begin with "qu" and end with "y". To solve this puzzle, all we need to do is grep(*) the dictionary file using the following regular expression: grep '^qu[a-z]+y$' /usr/share/dict/words * Footnote As I mentioned in Chapter 19, the word "grep" is often used as a verb. To understand this regular expression, we will use the technique I mentioned earlier. We will break the regex into parts and write the parts vertically, one above the other. The breakdown of the regular expression is as follows:
Remember that each line of the dictionary file contains only a single word. Thus, we start our search at the beginning of a line and finish it at the end of a line. The next puzzle is an old one. Find a common English word that contains all five vowels — a, e, i, o, u — in that order. The letters do not have to be adjacent, but they must be in alphabetical order. That is, "a" must come before "e", which must come before "i", and so on. To solve the puzzle, we can grep the dictionary file for any words that contain the letter "a", followed by zero or more lowercase letters, followed by "e", followed by zero or more lowercase letters, and so on. This time, let's start by writing down the various parts, which we will then put together. This is often a useful technique, when you are creating a complicated regular expression:
Thus, the full command is: grep 'a[a-z]*e[a-z]*i[a-z]*o[a-z]*u' /usr/share/dict/words To avoid undue suspense, I will tell you now that you should find a number of words, most of them obscure. However, there are only three such words that are common. They are(*):
adventitious
* Footnote Strictly speaking, there are six vowels in English: a, e, i, o, u and (sometimes) y. If you want words that contain all six vowels, just turn these three words into adverbs: "adventitiously", "facetiously" and "sacrilegiously". Our last puzzle involves a search of the Unix file system for historical artifacts. Many of the original Unix commands were two letters long: the text editor was ed, the copy program was cp, and so on. Let us find all such commands. To solve the puzzle, you need to know that the very oldest Unix programs reside in the /bin directory. To list all the files in this directory, we use the ls command (discussed in Chapter 24): ls /bin To analyze the output of ls, we can pipe it to grep. When we do this, ls will automatically place each name on a separate line, because the output is going to a filter. Using grep, we can then search for lines that consist of only two lowercase letters. The full pipeline is as follows: ls /bin | grep '^[a-z]{2}$' On some systems, grep will not return the results you want, because it will not recognize the brace brackets as being metacharacters. If this happens to you, you have two choices. You can use egrep instead: ls /bin | egrep '^[a-z]{2}$' Or, you can eliminate the need for brace brackets. Simply repeat the character class, and you won't need to use a bound: ls /bin | grep '^[a-z][a-z]$' Try these commands on your system, and see what you find. When you see a name and you want to find out more about the command, just look it up in the online manual (Chapter 9). For example: man ed cp Aside from the files you will find in /bin, there are other old Unix commands in /usr/bin. To search this directory for 2-character command names, just modify the previous command slightly. ls /usr/bin | grep '^[a-z]{2}$' To count how many such commands there are, use grep with the -c (count) option:
ls /bin | grep -c '^[a-z]{2}$'
Note: When you look in /usr/bin, you may find some 2-character commands that are not old. To see if a command dates from the early days of Unix, check its man page.
Review Question #1: What is a regular expression? What are two common abbreviations for "regular expression"? Review Question #2: Within a regular expression, explain what the following metacharacters match: . ^ $ \< \> [list] [^list] Explain what the following repetition operators match: * + ? {n} Review Question #3: For each of the following predefined character classes, give the definition and specify the equivalent range: [:lower:] [:upper:] [:alpha:] [:digit:] [:alnum:] For example, [:lower:] represents all the lowercase numbers; the equivalent range is a-z. Review Question #4: By default, your system uses the dictionary collating sequence, but you want to use the C collating sequence. How do you make the change? What command would you use for the Bourne Shell family? For the C-Shell family? In which initialization file would you put such a command? Applying Your Knowledge #1: Create regular expressions to match:
• "hello"
Use grep to test your answers. Applying Your Knowledge #2: Using repetition operators, create regular expressions to match:
• "start", followed by 0 or more numbers, followed by "end"
Use grep to test your answers. Hint: Make sure grep is using extended (not basic) regular expressions. Applying Your Knowledge #3: As we discussed in Chapter 20, the following two commands find all the lines in the file data that contain at least one non-alphabetic character:
grep '[^A-Za-z]' data
What command would you use to find all the lines that do not contain even a single alphabetic character? Applying Your Knowledge #4: Within the Usenet global discussion group system, freedom of expression is very important. However, it is also important that people should be able to avoid offensive postings if they so desire. The solution is to encode potentially offensive text in a way that it looks like gibberish to the casual observer. However, the encoded text can be decoded easily by anyone who chooses. The system used for the coding is called Rot-13. It works as follows. Each letter of the alphabet is replaced by the letter 13 places later in the alphabet, wrapping back to the beginning if necessary:
B → O... O → B
Create a single command that reads from a file named input, encodes the text using Rot-13, and writes the encoded text to standard output. Then create a command that reads encoded Rot-13 data and converts it back to ordinary text. Test your solutions by creating a text file named input, encoding it and then decoding it. For Further Thought #1: The term "regular expression" comes from an abstract computer science concept. Is it a good idea or a bad idea to use such names? Would it make much difference if the term "regular expression" was replaced by a more straightforward name, such as "pattern matching expression" or "pattern matcher"? Why? For Further Thought #2: With the introduction of locales to support internationalization, regular expression patterns that worked for years stopped working on some systems. In particular, regular expressions that depended on the traditional C collating sequence, do not always work with the dictionary collating sequence. In most cases, the solution is to use predefined character classes instead of ranges. For example, you should use [:lower:] instead of a-z. (See Figure 20-3 for the full set.) What do you think of this arrangement? Give three reasons why the old system was better. Give three reasons why the new system is better.
List of Chapters + Appendixes
© All contents Copyright 2024, Harley Hahn
|