Harley Hahn's Guide to
|
A Personal Note
Chapters...
Command
INSTRUCTOR |
Chapter 19...
Filters: Selecting, Sorting,
In this chapter, we conclude our discussion of filters by talking about the most interesting and powerful filters in the Unix toolbox: the programs that select data, sort data, combine data, and change data. These programs are incredibly useful, so it behooves us to take the time to discuss them at length. As you know, powerful programs can take a long time to learn, and that is certainly the case with the filters we will be discussing in this chapter. In fact, these programs are so powerful, you will probably never master all the nuances. That's okay. I'll make sure you understand the basics, and I'll show you a great many examples. Over time, as your skills and your needs develop, you can check the online manual for more advanced details, and you can use the Web and Usenet to look for help from other people. Most important: whenever you get a chance to talk to a Unix geek in person, get him or her to show you their favorite tricks using the filters in this chapter. That is the very best way to learn Unix. This is the last of four chapters devoted to filters (Chapters 16-19). In Chapter 20, we will discuss regular expressions, which are used to specify patterns. Regular expressions increase the power of filters significantly and in Chapter 20, you will find many examples that pertain to the filters in this chapter, particularly grep, perhaps the most important filter of them all.
Related filters: look, strings The grep program reads from standard input or from one or more files, and extracts all the lines that contain a specified pattern, writing the lines to standard output. For example, you might use grep to search 10 long files for all the lines that contain the word Harley. Or, you might use the sort program (discussed later in the chapter) to sort a large amount of data, and then pipe that data to grep to extract all the lines that contain the characters "note:". Aside from searching for specific strings of characters, you can use grep with what we call "regular expressions" to search for patterns. When you do so, grep becomes a very powerful tool. In fact, regular expressions are so important, we will discuss them separately in Chapter 20, where you will see a lot of examples using grep. (In fact, as you will see at the end of this section, the re in the name grep stands for "regular expression".) The syntax for grep is: grep [-cilLnrsvwx] pattern [file...] where pattern is the pattern to search for, and file is the name of an input file. Let's start with a simple example of how you might use grep. In Chapter 11, I explained that most Unix systems keep the basic information about each userid in a file named /etc/passwd. Each userid has one line of information in this file. You can display the information about your userid by using grep to search the file for that pattern. For example, to display information about userid harley, use the command: grep harley /etc/passwd If grep does not find any lines that match the specified pattern, there will be no output or warning message. Like most Unix commands, grep is terse. When there is nothing to say, grep says nothing.(*) * Footnote Wouldn't it be nice if everyone you knew had the same philosophy? When you specify a pattern that contains punctuation or special characters, you should quote them so the shell will interpret the command properly. (See Chapter 13 for a discussion of quoting.) For example, to search a file named info for all the lines that contain a colon followed by a space, use the command: grep ': ' info As useful as grep is for searching individual files, where it really comes into its own is in a pipeline. This is because grep can quickly reduce a large amount of raw data into a small amount of useful information. This is very important capability that makes grep one of the most important programs in the Unix toolbox. Ask any experienced Unix person, and you will find that he or she would not want to live without grep. It will take time for you to appreciate the power of this wonderful program, but we can start with a few simple examples. When you share a multiuser system with other people, you can use the w program (Chapter 8) to display information about all the users and what they are doing. Here is some sample output: |
8:44pm up 9 days, 7:02, 5 users, load: 0.11, 0.02, 0.00
|
Say that you want to display all the users who logged in during the afternoon or evening. You can search for lines of output that contain the pattern "pm". Use the pipeline: w -h | grep pm (Notice that I used w with the -h option. This suppresses the header, that is, the first two lines.) Using the above data, the output of the previous command would be: |
harley ttyp1 5:47pm 15:11 w
|
Math: problems 12-10 to 12-33, due Monday
To list all the assignments that are not yet finished, enter: grep -v DONE homework The output is:
Math: problems 12-10 to 12-33, due Monday
|
If you want to see the number of assignments that are not finished, combine -c with -v: grep -cv DONE homework In this case, the output is: 2 On occasion, you may want to find the lines in which the search pattern consists of the entire line. To do so, use the -x option. For example, say the file names contains the lines:
Harley
If you want to find all the lines that contain "Harley", use: grep Harley names If you want to find only those lines in which "Harley" is the entire line, use the ‑x option: grep -x Harley names In this case, grep will select only the first and last lines. To search an entire directory tree (see Chapter 23), use the -r (recursive) option. For example, let's say you want to search for the word "initialize" within all the files in the directory named admin, including all subdirectories, all files in those subdirectories, and so on. You would use: grep -r initialize admin When you use -r on large directory trees, you will often see error messages telling you that grep cannot read certain files, either because the files don't exist or because you don't have permission to read them. (We will discuss file permissions in Chapter 25.) Typically, you will see one of the following two messages:
No such file or directory
If you don't want to see such messages, use the - s (suppress) option. For example, say you are logged in as superuser, and you want to search all the files on the system for the words "shutdown now". As we will discuss in Chapter 23, the designation / refers to the root (main) directory of the entire file system. Thus, if we start from the / directory and use the -r (recursive) option, grep will search the entire file system. The command is: grep -rs / 'shutdown now' Notice I quoted the search pattern because it contains a space. (Quoting is explained in Chapter 13.)
In the olden days (the 1970s and 1980s), it was common for people to use two other versions of grep: fgrep and egrep. The fgrep program is a fast version of grep that searches only for "fixed-character" strings. (Hence the name fgrep.) This means that fgrep does not allow the use of regular expressions for matching patterns. When computers were slow and memory was limited, fgrep was more efficient than grep as long as you didn't need regular expressions. Today, computers are fast and have lots of memory, so there is no need to use fgrep. I mention it only for historical reasons. The egrep program is an extended version of grep. (Hence the name egrep.) The original grep allowed only "basic regular expressions". The egrep program, which came later, supported the more powerful "extended regular expressions". We'll discuss the differences in Chapter 20. For now, all you need to know is that extended regular expressions are better, and you should always use them when you have a choice. Modern Unix systems allow you to use extended regular expressions by using either egrep or grep -E. However, most experienced Unix users would rather type grep. The solution is to create an alias (see Chapter 13) to change grep to either egrep or grep -E. With the Bourne shell family, you would use one of the following commands:
alias grep='egrep'
With the C-Shell family, you would use one of these commands:
alias grep 'egrep'
Once you define such an alias, you can type grep and get the full functionality of extended regular expressions. To make such a change permanent, all you need to do is put the appropriate alias command into your environment file (see Chapter 14). Indeed, this is such a useful alias, that I suggest you to take a moment right now and add it to your environment file. In fact, when you get to Chapter 20, I will assume you are using extended regular expressions. Note: If you use Solaris (from Sun Microsystems), the version of egrep you want is in a special directory named /usr/xpg4/bin/(*), which means you must use different aliases. The examples below are only for Solaris. The first one is for the Bourne Shell family; the second is for the C-Shell family:
alias grep='/usr/xpg4/bin/egrep'
* Footnote The name xpg4 stands for "X/Open Portability Guide, Issue 4", an old (1992) standard for how Unix systems should behave. The programs in this directory have been modified to behave in accordance with the XPG4 standard.
Related filters: grep The look program searches data that is in alphabetical order and finds all the lines that begin with a specified pattern. There are two ways to use look. You can use sorted data from one or more files, or you can have look search a dictionary file (explained in the next section). When you use look to search one or more files, the syntax is: look [-df] pattern [file...] where pattern is the pattern to search for, and file is the name of a file. Here is an example. You are a student at a school where, every term, all the students evaluate their professors. This term, you are in charge of the project. You have a large file called evaluations, that contains a summary of the evaluations for over a hundred professors. The data is in alphabetical order. Each line of the file contains a ranking (A, B, C, D or F), followed by two spaces, followed by the name of a professor. For example:
A William Wisenheimer
Your job is to create five lists to post on a Web site. The lists should contain the names of the professors who received an A rating, a B rating, and so on. Since the data is in alphabetical order, you can create the first list (the A professors) by using look to select all the lines of the file that begin with A: look A evaluations Although this command will do the job, we can improve upon it. As I mentioned, each line in the data file begins with a single-letter ranking, followed by two spaces. Once you have the names you want, you can use colrm (Chapter 16) to remove the first three characters of each line. The following examples do just that for each of the rankings: they select the appropriate lines from the data file, use colrm to remove the first three characters from each line, and then redirect the output to a file:
look A evaluations | colrm 1 3 > a-professors
Unlike the other programs covered in this chapter, look cannot read from the standard input: it must take its input from one or more files. This means that, strictly speaking, look is not a filter. The reason for this restriction is that, with standard input, a program can read only one line at a time. However, look uses a search method called a "binary search" that requires access to all the data at once. For this reason, you cannot use look within a pipeline, although you can use it at the beginning of a pipeline. When you have multiple steps, the best strategy is to prepare your data, save it in a file, and then use look to search the file. For example, let's say the four files frosh, soph, junior and senior contain the raw, unsorted evaluation data as described above. Before you can use look to search the data, you must combine and sort the contents of the four files and save the output in a new file, for example:
sort frosh soph junior senior > evaluations
We will discuss the sort program later in the chapter. At that time, you will learn about two particular options that are relevant to look. The -d (dictionary) option tells sort to consider only letters and numbers. You use -d when you want look to ignore punctuation and other special characters. The -f (fold) option tells sort to ignore differences between upper- and lowercase letters. For example, when you use - f, "Harley" and "harley" are considered the same. If you use either of these sort options to prepare data, you must use the same options with look, so look will know what type of data to expect. For example:
sort -df frosh soph junior senior > evaluations
Both look and grep select lines from text files based on a specified pattern. For this reason, it makes sense to ask, when do you use look and when do you use grep? Similar questions arise in many situations, because Unix often offers more than one way to solve a problem. For this reason, it is important to be able to analyze your options wisely, so as to pick the best tool for the job at hand. As an example, let us compare look and grep. The look program is limited in three important ways. First, it requires sorted input; second, it can read only from a file, not from standard input; third, it can only search for patterns at the beginning of a line. However, within the scope of these limitations, look has two advantages: is simple to use and it is very fast. The grep program is a lot more flexible: it does not require sorted input; it can read either from a file or from standard input (which means you can use it in the middle of a pipeline); and it can search for a pattern anywhere, not just at the beginning of a line. Moreover, grep allows "regular expressions", which enable you to specify generalized patterns, not just simple characters. For example, you can search for "the letters har, followed by one or more characters, followed by the letters ley, followed by zero or more numbers". (Regular expressions are very powerful, and we will talk about them in detail in Chapter 20.) By using regular expressions, it is possible to make grep do anything look can do. However, grep will be slower, and the syntax is more awkward. So here is my advice: Whenever you need to select lines from a file, ask yourself if look can do the job. If so, use it, because look is fast and simple. If look can't do the job, (which will be most of the time), use grep. As a general rule, you should always use the simplest possible solution to solve a problem. But what about speed? I mentioned that look is faster than grep. How important is that? In the early days of Unix, speed was an important consideration, as Unix systems were shared with other users and computers were relatively slow. When you selected lines of text from a very large file, you could actually notice the difference between look and grep. Today, however, virtually all Unix systems run on computers which, for practical purposes, are blindingly fast. Thus, the speed at which Unix executes a single command — at least for the commands in this chapter — is irrelevant. For example, any example in this chapter will run so quickly as to seem instantaneous. More specifically, if you compare a look command to the equivalent grep command, there is no way you are going to notice the difference in speed. So my advice is to choose your tools based on simplicity and ease of use, not on tiny differences in speed or efficiency. This is especially important when you are writing programs, including shell scripts. If a program or script is too slow, it is usually possible to find one or two bottlenecks and speed them up. However, if a program is unnecessarily complex or difficult to use, it will, in the long run, waste a lot of your time, which is far more valuable than computer time. hint Whenever you have a choice of tools, use the simplest one that will do the job.
I mentioned earlier that you can use look to search a dictionary file. You do so when you want to find all the words that begin with a specific pattern, for example, all the words that begin with the letters "simult". When you use look in this way, the syntax is simple: look pattern where pattern is the pattern to search for. The "dictionary file" is not an actual dictionary. It is a long, comprehensive list of words that has existed since the early versions of Unix. (Of course, the list has been updated over the years.) The words in the dictionary file are in alphabetical order, one word per line, which makes it easy to search the file using look. The dictionary file was originally created to use with a program named spell, which provided a crude way to spellcheck documents. The job of spell was to display a list of all the words within a document that were not in the dictionary file. In the olden days, spell could save you a lot of time by finding possible spelling mistakes. Today, there are much better spellcheck tools and spell is rarely used: indeed, you won't even find it on most Linux or Unix systems. Instead, people use either the spellcheck feature within their word processor or, with text files, an interactive program called aspell, which is one of the GNU utilities. If you want to try aspell, use: aspell -c file where file is the name of a file containing plain text. The -c option indicates that you want to check the spelling of the words in the file. Although spell is not used anymore, the dictionary file still exists, and you can use it in a variety of ways. In particular, you can use the look program to find all the words that begin with a specific pattern. This comes in handy when you are having trouble spelling a word. For example, say that you want to type the word "simultaneous", but you are not sure how to spell it. Enter: look simult You will see a list similar to the following:
simultaneity
It is now a simple task to pick out the correct word and — if you wish — to copy and paste it from one window to another. (See Chapter 6 for instructions on how to copy and paste.) We'll talk about the dictionary file again in Chapter 20, at which time I'll show you where to find the actual file on your system, and how to use it help solve word puzzles. (By the way, a "simulty" is a private grudge or quarrel.) hint When you are working with the vi text editor (see Chapter 22), you can display a list of words by using :r! to issue a quick look command. For example: :r !look simult This command inserts all the words that begin with "simult" into your editing buffer. You can now choose the word you want and delete all the others.
Related filters: tsort, uniq The sort program can perform two related tasks: sorting data, and checking to see if data is already sorted. We'll start with the basics. The syntax for sorting data is: sort [-dfnru] [-o outfile] [infile...] where outfile is the name of a file to hold the output, and infile is the name of a file that contains input. The sort program has a great deal of flexibility. You can compare either entire lines or selected portions of each line (fields). The simplest way to use sort is to sort a single file, compare entire lines, and display the results on your screen. As an example, let's say you have a file called names that contains the following four lines:
Barbara
To sort this data and display the results, enter: sort names You will see:
Al
To save the sorted data to a file named masterfile, you can redirect the standard output: sort names > masterfile This last example saves the sorted data in a new file. However, there will be many times when you want to save the data in the same file. That is, you will want to replace a file with the same data in sorted order. Unfortunately, you cannot use a command that redirects the output to the input file: sort names > names You will recall that, in Chapter 15, I explained that when you redirect the standard output, the shell sets up the output file before running the command. In this case, since names is the output file, the shell will empty it before running the sort command. Thus, by the time sort is ready to read its input, names will be empty. Thus, the result of entering this command would be to silently wipe out the contents of your input file (*). * Footnote Unless you have set the noclobber shell variable. See Chapter 15. For this reason, sort provides a special option to allow you to save your output to any file you want. Use -o (output) followed by the name of your output file. If the output file is the same as one of your input files, sort will make sure to protect your data. Thus, to sort a file and save the output in the same file, use a command like the following: sort -o names names In this case, the original data in names will be preserved until the sort is complete. The output will then be written to the file. To sort data from more than one file, just specify more than one input file name. For example, to sort the combined contents of the files oldnames, names and extranames, and save the output in the file masterfile, use: sort oldnames names extranames > masterfile To sort these same files while saving the output in names (one of the input files), use: sort -o names oldnames names extranames The sort program is often used as part of a pipeline to process data that has been produced by another program. The following example combines two files, extracts only those lines that contain the characters "Harley", sorts those lines, and then sends the output to less to be displayed: cat newnames oldnames | grep Harley | sort | less By default, sort looks at the entire line when it sorts data. However, if you want, you can tell sort to examine only one or more fields, that is, parts of each line. (We discussed the concept of fields in Chapter 17, when we talked about the cut program.) The options that allow you to use fields with sort afford a great deal of control. However, they are very complex, and I won't go into the details here. If you ever find yourself needing to sort with fields, you will find the details in the Info file (info sort). If your system doesn't have Info files (see Chapter 9), the details will be in the man page instead (man sort).
There are a number of options you can use to control how the sort program works: The -d (dictionary) looks only at letters, numerals and whitespace (spaces and tabs). Use this option when your data contains characters that will get in the way of the sorting process, for example, as punctuation. The -f (fold) option treats lowercase letters as if they were uppercase. Use this option when you want to ignore the distinctions between upper- and lowercase letters. For example, when you use -f, the words harley and Harley are considered to be the same as HARLEY. (The term "fold" is explained below.) The -n (numeric) option recognizes numbers at the beginning of a line or a field and sorts them numerically. Such numbers may include leading spaces, negative signs and decimal points. Use this option to tell sort that you are using numeric data. For example, let's say you want to sort:
11
If you use sort with no options, the output is:
1
If you use sort -n, you get:
1
The -r (reverse) option sorts the data in reverse order. For example, if you sort the data in the last example using sort -nr, the output is:
20
In my experience, you will find yourself using the -r option a lot more than you might think. This is because it is useful to be able to list information in reverse alphabetical order or reverse numeric order. The final option, -u (unique), tells sort to check for identical lines and suppress all but one. For example, let's say you use sort -u to sort the following data:
Barbara
The output is:
Al
hint As an alternative to sort -u, you can use uniq (discussed later in the chapter). The uniq program is simpler but, unlike sort, it does not let you work with specific fields should that be necessary. What's in a Name? Fold There are a variety of Unix programs that have an option to ignore the differences between upper- and lowercase letters. Sometimes, the option is called -i, for "ignore", which only makes sense. Much of the time, however, the option is -f, which stands for FOLD: a technical term indicating that lowercase letters are to be treated as if they were uppercase, or vice versa, without changing the original data. (The use of the term "fold" in this way has nothing to do with the fold program, so don't be confused.) The term "fold" is most often used as an adjective: "To make sort case insensitive, use the fold option." At times, however, you will see "fold" used as a verb: "When you use the -f option, sort folds lowercase letters into uppercase." Here is something interesting: the original version of the Unix sort program folded uppercase letters into lowercase. That is, when you used -f, sort treated all letters as if they were lowercase. Modern versions of sort fold lowercase into uppercase. That is, they treat all letters as if they were uppercase. Is the difference significant? The answer is, sometimes, as you will see when we discuss collating sequences. By the way, no one knows the origin of the term "fold", so feel free to make up your own metaphor.
As I mentioned earlier, sort can perform two related tasks: sorting data, and checking to see if data is already sorted. In this section, we'll talk about checking data. When you sort in this way, the syntax is: sort -c[u] [file] where file is the name of a file. The -c (check) option tells sort that you don't want to sort the data, you only want to know if it is already sorted. For example, to see if the data within the file names is sorted, you would use: sort -c names If the data is sorted, sort will display nothing. (No news is good news.) If the data is not sorted, sort will display a message, for example: sort: names:5: disorder: Polly Ester In this case, the message means that the data in names is not sorted (that is, there is "disorder"), starting with line 5, which contains the data Polly Ester. You can use sort -c within a pipeline to check data that has been written to standard output by another program. For example, let's say you have a program named poetry-generator that generates a large amount of output. The output is supposed to be sorted, but you suspect there may be a problem, so you check it with sort -c: poetry-generator | sort -c If you combine -c with the -u (unique) option, sort will check your data in two ways at the same time. While it is looking for unsorted data, it will also look for consecutive lines that are the same. You use -cu when you want to ensure (1) your data is sorted, and (2) all the lines are unique. For example, the file friends contains the following data:
Al Packa
You enter: sort -cu friends Although the data is sorted, sort detects a duplicate line: sort: friends:4: disorder: Patty Cake
Suppose you use the sort program to sort the following data. What will the output be?
zzz
On some systems, you will get:
AAA
On other systems, you will get:
AAA
How can this be? In the early days of Unix, there was just one way of organizing characters. Today, this is not the case, and results you see when you run sort depend on how characters are organized on your particular system. Here is the story. Before the 1990s, the character encoding used by Unix (and most computer systems) was the ASCII CODE, often referred to as ASCII. The name stands for "American Standard Code for Information Interchange". The ASCII code was created in 1967. It specifies a 7- bit pattern for every character, 128 in all. These bit patterns range from 0000000 (0 in decimal) to 1111111 (127 in decimal). For this reason, the 128 ASCII characters are numbered from 0 to 127. The 128 characters that comprise the ASCII code consist of 33 "control characters" and 95 "printable characters". The control characters were discussed in Chapter 7. The printable characters, shown below, are the 52 letters of the alphabet (26 uppercase, 26 lowercase), 10 numbers, 32 punctuation symbols, and the space character (listed first below):
!"#$%&'()*+,-./0123456789:;<=>?
The order of the printable characters is the order in which I have listed them. They range from character #32 (space) to character #126 (tilde). (Remember, numbering starts at 0, so the space is actually the 33rd character.) For reference, Appendix D contains a table of the entire ASCII code. You may want to take a moment and look at it now. For practical purposes, it is convenient to consider the tab to be a printable character even though, strictly speaking, it is actually a control character. The tab is character #9, which places it before the other printable characters. Thus, I offer the following definition: the 96 PRINTABLE CHARACTERS are the tab, space, punctuation symbols, numbers, and letters. As a convenience, most Unix systems have a reference page showing the ASCII code to allow you to look at it quickly whenever you want. Unfortunately, the ASCII reference page is not standardized, so the way in which you display it depends on which system you are using. See Figure 19-1 for the details. Figure 19-1: Displaying the ASCII code
With respect to a character coding scheme, the order in which the characters are organized is called the COLLATING SEQUENCE. The collating sequence is used whenever you need to put characters in order; for example, when you use the sort program, or when you use a range within a regular expression (discussed in Chapter 20). With the ASCII code, the collating sequence is simply the order in which the characters appear in the code. This is summarized in Figure 19-2. For a more detailed reference, see Appendix D. Figure 19-2: The order of characters in the ASCII code
It is important to be familiar with the ASCII code collating sequence, as it is used by default on many Unix systems and programming languages. Although you don't have to memorize the entire ASCII code, you do need to memorize three basic principles:
• Spaces come before numbers.
Here is an example. Assume that your system uses the ASCII collating sequence. You use the sort program to sort the following data (in which the third line starts with a space):
hello
The output is:
hello
hint When it comes to the order of characters in the ASCII code, all you need to memorize is: Space, Numbers, Uppercase letters, and Lowercase letters, in that order. Just remember "SNUL".(*) * Footnote If you have trouble remembering the acronym SNUL, let me show you a memory trick used by many smart people. All you need to do is relate the item you want to remember to your everyday life. For example, let's say you are a mathematician specializing in difference calculus, and you happen to be working with fourth order difference equations satisfied by those Laguerre-Hahn polynomials that are orthogonal on special non-uniform lattices. To remember SNUL, you would just think of "special non- uniform lattices". See how easy it is to be smart?
In the early days of Unix, everyone used the ASCII code and that was that. However, ASCII is based on English and, as the use of Unix, Linux and the Internet spread throughout the world, it became necessary to devise a system that would work with a large number of languages and a variety of cultural conventions. In the 1990s, a new system was developed, based on the idea of a "locale", part of the POSIX 1003.2 standard. (POSIX is discussed Chapters 11 and 16.) A LOCALE is a technical specification describing the language and conventions that should be used when communicating with a user from a particular culture. The intention is that a user can choose whichever locale he wants, and the programs he runs will communicate with him accordingly. For example, if a user chooses the American English locale, his programs should display messages in English, write dates in the format "month-day-year", use "$" as a currency symbol, and so on. Within Unix, your locale is defined by a set of environment variables that identify your language, your date format, your time format, your currency symbol, and other cultural conventions. Whenever a program needs to know your preferences, all it has to do is look at the appropriate environment variables. In particular, there is a environment variable named LC_COLLATE that specifies which collating sequence you want to use. (The variables all have default values, which you can change if you want.) To display the current value of all the locale variables on your system — including LC_COLLATE — you use the locale command: locale If you are wondering which locales are supported on your system, you can display them all by using the -a (all) option: locale -a In the United States, Unix systems default to one of two locales. The two locales are basically the same, but have different collating sequences, which means that when you run a program such as sort, your results can vary depending on which locale is being used. Since many people are unaware of locales, even experienced programmers can be perplexed when they change from one Unix system to another and, all of a sudden, programs like sort do not behave "properly". For this reason, I am going to take a moment to discuss the two American locales and explain what you need to know about them. If you live outside the U.S., the ideas will still apply, but the details will be different. The first American locale is based on the ASCII code. This locale has two names. It is known as either the C locale (named after the C programming language) or the POSIX locale: you can use whichever name you want. The second American locale is based on American English, and is named en_US, although you will see variations of this name. The C locale was designed for compatibility, in order to preserve the conventions used by old-time programs (and old-time programmers). The en_US locale was designed to fit into a modern international framework in which American English is only one of many different languages. As I mentioned, both these locales are the same except for the collating sequence. The C locale uses the ASCII collating sequence in which uppercase letters come before lowercase letters: ABC... XYZabc...z. This pattern is called the C COLLATING SEQUENCE, because it is used by the C programming language. The en_US locale uses a different collating sequence in which the lowercase letters and uppercase letters are grouped in pairs: aAbBcCdD... zZ. This pattern is more natural, as it organizes words and characters in the same order as you would find in a dictionary. For this reason, this pattern is called the DICTIONARY COLLATING SEQUENCE. Until the end of the 1990s, all Unix systems used the C collating sequence, based on the ASCII code, and this is still the case with the systems that use the C/POSIX locale. Today, however, some Unix systems, including a few Linux distributions, are designed to have a more international flavor. As such, they use the en_US locale and the dictionary collating sequence. Can you see a possible source of confusion? Whenever you run a program that depends on the order of upper- and lowercase letters, the output is affected by your collating sequence. Thus, you can get different results depending on which locale your system uses by default. This may happen, for example, when you use the sort program, or when you use certain types of regular expressions called "character classes" (see Chapter 20). For reference, Figure 19-3 shows the two collating sequences. Notice that there are significant differences, not only in the order of the letters, but in the order of the punctuation symbols. Figure 19-3: Collating sequences for the C and en_US locales
As an example of how your choice of locale can make a difference, consider what happens when you sort the following data (in which the third line starts with a space):
hello
With the C locale (C collating sequence), the output is:
hello
With the en_US locale (dictionary collating sequence), the output is:
hello
So which locale should you use? In my experience, if you use the en_US locale, you will eventually encounter unexpected problems that will be difficult to track down. For example, as we will discuss in Chapter 25, you use the rm (remove) program to delete files. Let's say you want to delete all your files whose names begin with an uppercase letter. The traditional Unix command to use is: rm [A-Z]* This will work fine if you are using the C locale. However, if you are using the en_US locale, you will end up deleting all the files whose names begin with any upper- or lowercase letter, except the letter a. (Don't worry about the details; they will be explained in Chapter 20.)(*) * Footnote There are lots of situations in which the C locale works better than the en_US locale. Here is another one: You are writing a C or C++ program. In your directory, you have files containing code with names that have all lowercase letters, such as program1.c, program2.cpp, data.h, and so on. You also have extra files with names that begin with an uppercase letter, such as Makefile, RCS, README. When you list the contents of the directory using the ls program (Chapter 24), all the "uppercase" files will be listed first, separating the extra files from the code files. My advice is to set your default to be the C locale, because it uses the traditional ASCII collating sequence. In the long run, this will create less problems than using the en_US locale and the dictionary collating sequence. In fact, as you read this book, I assume that you are using the C locale. So how do you specify your locale? The first step is to determine which collating sequence is the default on your system. If the C locale is already the default on your system, fine. If not, you need to change it. One way to determine your default collating sequence is to enter the locale command and check the value of the LC_COLLATE environment variable. Is it C or POSIX? Or is it some variation of en_US? Another way to determine your default collating sequence is to perform the following short test. Create a small file named data using the command: cat > data Type the following three lines and then press ^D to end the command:
AAA
Now sort the contents of the file: sort data If you are using the C/POSIX locale, the output will be sorted using the C (ASCII) collating sequence:
AAA
If you are using the en_US locale, the output will be sorted using the dictionary collating sequence:
[]
Before you continue, take a moment to look at the collating sequences in Figure 19-3 and make sure these examples make sense to you. If your Unix system uses the C or POSIX locale by default, you don't need to do anything. (However, please read through the rest of this section, as one day, you will encounter this problem on another system.) If your system uses the en_US locale, you need to change the LC_COLLATE environment variable to either C or POSIX. Either of the following commands will do the job with the Bourne Shell family:
export LC_COLLATE=C
With the C-Shell family, you would use:
setenv LC_COLLATE C
To make the change permanent, all you need to do is put one of these commands into your login file. (Environment variables are discussed in Chapter 12; the login file is discussed in Chapter 14.) For the rest of this book, I will assume that you are, indeed, using the C collating sequence so, if you are not, put the appropriate command in your login file right now. hint From time to time, you may want to run a single program with a collating sequence that is different from the default. To do so, you can use a subshell to change the value of LC_COLLATE temporarily while you run the program. (We discuss subshells in Chapter 15.) For example, let's say you are using the C locale, and you want to run the sort program using the en_US (dictionary) collating sequence. You can use: (export LC_COLLATE=en_US; sort data) When you run the program in this way, the change you make to LC_COLLATE is temporary, because it exists only within the subshell.
Related filters: sort Unix has a number of specialized filters designed to work with sorted data. The most useful such filter is uniq, which examines data line by line, looking for consecutive, duplicate lines. The uniq program can perform four different tasks:
• Eliminate duplicate lines
The syntax is: uniq [-cdu] [infile [outfile]] where infile is the name of an input file, and outfile is the name of an output file. Let's start with a simple example. The file data contains the following lines:
Al
(Remember, because input for uniq must be sorted, duplicate lines will be consecutive.) You want a list of all the lines in the file with no duplications. The command to use is: uniq data The output is straightforward:
Al
If you want to save the output to another file, say, processed-data, you can specify its name as part of the command: uniq data processed-data To see only the duplicate lines, use the -d option: uniq -d data Using our sample file, the output is:
Al
To see only the unique (non-duplicate) lines, use - u: uniq -u data In our sample, there is only one such line: Charles Question: What do you think happens if you use both -d and -u at the same time? (Try it and see.) To count how many times each line appears, use the - c option: uniq -c data With our sample, the output is:
2 Al
So far, our example has been simple. The real power of uniq comes when you use it within a pipeline. For example, it is common to combine and sort several files, and then pipe the output to uniq, as in the following two examples:
sort file1 file2 file3 | uniq
hint If you are using uniq without options, you have an alternative. You can use sort -u instead. For example, the following three commands all have the same effect:
sort -u file1 file2 file3
(See the discussion on sort -u earlier in the chapter.) Here is a real-life example to show you how powerful such constructions can be. Ashley is a student at a large Southern California school. During the upcoming winter break, her cousin Jessica will be coming to visit from the East Coast, where she goes to a small, progressive liberal arts school. Jessica wants to meet guys, but she is very picky: she only likes very smart guys who are athletic. It happens that Ashley is on her sorority's Ethics Committee, which gives her access to the student/academic database (don't ask). Using her special status, Ashley logs into the system and creates two files. The first file, math237, contains the names of all the male students taking Math 237 (Advanced Calculus). The second file, pe35, contains the names of all the male students taking Physical Education 35 (Surfing Appreciation). Ashley's idea is to make a list of possible dates for Jessica by finding all the guys who are taking both courses. Because the files are too large to compare by hand, Ashley (who is both beautiful and smart) uses Unix. Specifically, she uses the uniq program with the -d option, saving the output to a file named possible-guys: sort math237 pe35 | uniq -d > possible-guys Ashley then emails the list to Jessica, who is able to check out the guys on Instagram before her trip. hint You must always make sure that input to uniq is sorted. If not, uniq will not be able to detect the duplications. The results will not be what you expect, but there will be no error message to warn you that something has gone wrong.(*) * Footnote This is one time where — even for Ashley and Jessica — a "sorted" affair is considered to be a good thing.
Related filters: colrm, cut, paste Of all the specialized Unix filters designed to work with sorted data, the most interesting is join, which combines two sorted files based on the values of a particular field. The syntax is: join [-i] [-a1|-v1] [-a2|-v2] [-1 field1] [-2 field2] file1 file2 where field1 field2 are numbers referring to specific fields; and file1 and file2 are the names of files containing sorted data. Before we get to the details, I'd like to show you an example. Let's say you have two sorted files containing information about various people, each of whom has a unique identification number. Within the first file, called names, each line contains an ID number followed by a first name and last name:
111 Hugh Mungus
In the second file, phone, each line contains an ID number followed by a phone number:
111 101-555-1111
The join program allows you to the combine the two files, based on their common values, in this case, the ID number: join names phone The output is:
111 Hugh Mungus 101-555-1111
When join reads its input, it ignores leading whitespace, that is, spaces or tabs at the beginning of a line. For example, the following two lines are considered the same:
111 Hugh Mungus 101-555-1111
Before we discuss the details of the join program, I'd like to take a moment to go over some terminology. In Chapter 17, we talked about the idea of a fields and delimiters. When you have a file in which every line contains a data record, each separate item within the line is called a field. In our example, each line in the file names contains three fields: an ID number, a first name, and a last name. The file phone contains two fields: an ID number and a phone number. Within each line, the characters that separate fields are called delimiters. In our example, the delimiters are spaces, although you will often see tabs and commas used in this way. By default, join assumes that each pair of fields is separated by whitespace, that is, by one or more spaces or tabs. When we combine two sets of data based on matching fields, it is called a JOIN. (The name comes from database theory.) The specific field used for the match is called the JOIN FIELD. By default, join assumes that the join field is the first field of each file but, as you will see in a moment, this can be changed. To create a join, the program looks for pairs of lines, one from each file, that contain the same value in their join field. For each pair, join generates an output line consisting of three parts: the common join field value, the rest of line from the first file, and the rest of the line from the second file. As an example, consider the first line in each of the two files above. The join field has the value 111. Thus, the first line of output consists of 111, a space, Hugh, a space, Mungus, a space, and 101-555-1111. (By default, join uses a single space to separate fields in the output.) In the example above, every line in the first file matches a line in the second file. However, this might not always be the case. For example, consider the following files. You are making a list of your friends' birthdays and their favorite gifts. The first file, birthdays, contains two fields: first name and birthday:
Al May-10-1987
The second file, gifts, also contains two fields: first name and favorite gift:
Al money
In this case, you have birthday information and gift information for Al, Barbara and Dave. However, you do not have gift information for Frances and George, and you do not have birthday information for Edward. Consider what happens when you use join: join birthdays gifts Because only three lines have matching join fields (the lines for Al, Barbara and Dave), there are only three lines of output:
Al May-10-1987 money
However, suppose you want to see all the people with birthday information, even if they do not have gift information. You can use the -a (all) option, followed by a 1: join -a1 birthdays gifts This tells join to output all the all the names in file #1, even if there is no gift information:
Al May-10-1987 money
Similarly, if you want to see all the people with gift information (from file #2), even if they do not have birthday information, you can use -a2: join -a2 birthdays gifts The output is:
Al May-10-1987 money
To list all the names from both files, use both options: join -a1 -a2 birthdays gifts The output is:
Al May-10-1987 money
When you use join in the regular way (without the -a option) as we did in our first example, the result is called an INNER JOIN. (The term comes from database theory.) With an inner join, the output comes only from lines where the join field matched. When you use either -a1 or -a2, the output includes lines in which the join field did not match. We call this an OUTER JOIN. I won't go into the details because a discussion of database theory, however interesting, is beyond the scope of this book. All I want you to remember is that, if you work with what are called "relational databases", the distinction between inner and outer joins is important. To continue, if you want to see only those lines that don't match, you can use the -v1 or - v2 (reverse) options. When you use -v1, join outputs only those lines from file #1 that don't match, leaving out all the matches. For example: join -v1 birthdays gifts The output is:
Frances Oct-15-1991
When you use -v2, you get only those lines from file #2 that don't match: join -v2 birthdays gifts The output is:
Charles music
Of course, you can use both options to get all the lines from both files that don't match: join -v1 -v2 birthdays gifts The output is now:
Charles music
Because join depends on its data being sorted, there are several options to help you control the sorting requirements. First, you can use the -i (ignore) option to tell join to ignore any differences between upper and lower case. For example, when you use this option, CHARLES is treated the same as Charles. hint You will often use sort to prepare data for join. Remember: With sort, you ignore differences in upper and lower case by using the - f (fold) option. With join, you use the -i (ignore) option. (See the discussion on "fold" earlier in the chapter.) I mentioned earlier that join assumes the join field is the first field of each file. You can specify that you want to use different join fields by using the -1 and -2 options. To change the join field for file #1, use -1 followed by the number of the field you want to use. For example, the following command joins two files, data and statistics using the 3rd field of file #1 and (by default) the 1st field of file #2: join -1 3 data statistics To change the join field for file #2, use the -2 option. For example, the following command joins the same two files using 3rd field of file #1 and the 4th field of file #2 join -1 3 -2 4 data statistics To conclude our discussion, I would like to remind you that, because join works with sorted data, the results you get may depend on your locale and your collating sequences; that is, on the value of the LC_COLLATE environment variable. (See the discussion about locales earlier in the chapter.) hint The most common mistake in using join is forgetting to sort the two input files. If one or both of the files are not sorted properly with respect to the join fields, you will see either no output or partial output, and there will be no error message to warn you that something has gone wrong.
Related filters: sort Consider the following problem. You are planning your evening activities, and you have a number of constraints:
• You must clean the dishes before you can watch TV.
As you can see, this is a bit confusing. What you need is a master list that specifies when each activity should be done, such that all of the constraints are satisfied. In mathematical terms, each of these constraints is called a PARTIAL ORDERING, because they specify the order of some (both not all) of the activities. In our example, each of the partial orderings specifies the order of two activities. Should you be able to construct a master list, it would be a TOTAL ORDERING, because it would specify the order of all of the activities. The job of the tsort program is to analyze a set of partial orderings, each of which represents a single constraint, and calculate a total ordering that satisfies all the constraints. The syntax is simple: tsort [file] where file is the name of a file. Each line of input must consist of a pair of character strings separated by whitespace (spaces or tabs), such that each pair represents a partial ordering. For example, let's say that the file activities contains the following data:
clean-dishes watch-TV
Notice that each line in the file consists of two characters strings separated by whitespace (in this case, a single space). Each line represents a partial ordering that matches one of the constraints listed above. For example, the first line says that you must clean the dishes before you can watch TV; the second line says you must eat before you can clean the dishes; and so on. The tsort program will turn the set of partial orderings into a single total ordering. Use the command: tsort activities The output is:
shop
Thus, the solution to the problem is:
• Shop
In general, any set of partial orderings can be combined into a total ordering, as long as there are no loops. For example, consider the following partial orderings:
study watch-TV
There can be no total ordering, because you can't study before you watch TV, if you insist on watching TV before you study (although many people try). If you were to send this data to tsort, it would display an error message telling you the input contains a loop. What's in a Name? tsort Mathematically, it is possible to represent a set of partial orderings using what is called a "directed graph". If there are no loops, it is called a "directed acyclic graph" or DAG. For example, a tree (see Chapter 9) is a DAG. Once you have a DAG, you can create a total ordering out of the partial orderings by sorting the elements of the graph based on their relative positions, rather than their values. In fact, this is how tsort does its job (although we don't need to worry about the details). In mathematics, we use the word "topological" to describe properties that depend on relative positions. Thus, tsort stands for "topological sort".
Related filters: grep To use the strings program, you need to understand the difference between text files and binary files. Consider the following three definitions: 1. There are 96 printable characters: tab, space, punctuation symbols, numbers, and letters. Any sequence of printable characters is called a CHARACTER STRING or, more informally, a STRING. For example, "Harley" is a string of length 6. (We discussed printable characters earlier in the chapter.) 2. A file that contains only printable characters (with a newline character at the end of each line) is called a TEXT FILE or an ASCII FILE. For the most part, Unix filters are designed to work with text files. Indeed, within this chapter, all the sample files are text files. 3. A BINARY FILE is any non-empty file that is not a text file, that is, any file that contains at least some non-textual data. Some common examples of binary files are executable programs, object files, images, sound files, video files, word processing documents, spreadsheets and databases. If you are a programmer, you will work with executable programs and object files ("pieces" of programs), all of which are binary files. If you could look inside an executable program or an object file, most of what you would see would be encoded machine instructions, which look like meaningless gibberish. However, most programs do contain some recognizable character strings such as error messages, help information, and so on. The strings program was created as a tool for programmers to display character strings that are embedded within executable programs and object files. For example, there used to be a custom that programmers would insert a character string into every program showing the version of that program. This allowed anyone to use strings to extract the version of a program from the program itself. Today, programmers and users have better ways to keep track of such information(*) and the strings program is not used much. Still, you can use it, just for fun, to look "inside" any type of binary file. Although there is rarely a practical reason for doing so, it is cool to check out binary files for hidden messages. The syntax is: strings [-length] [file...] where length is the minimum length character string to display, and file is the name of a file, most often a pathname. * Footnote As we discussed in Chapter 10, most of the GNU utilities (used with Linux and FreeBSD) support the --version option to display version information. As an example, let's say you want to look inside the sort program. To start, you use the whereis program to find the pathname — that is, the exact location — of the file that contains the program. (We'll discuss pathnames and whereis in Chapter 24, so don't worry about the details for now.) The command to use is: whereis sort Typical output would be: sort: /usr/bin/sort /usr/share/man/man1/sort.1.gz The output shows us the exact locations of the program and its man page. We are only interested in the program, so to use strings to look inside the sort program, we use: strings /usr/bin/sort Such commands usually generate a lot of output. There are, however, three things you can do to make the output more manageable. First, by default, strings will only extract character strings that are at least 4 characters long. The idea is to eliminate short, meaningless sequences of characters. Even so, you are likely to see a great many spurious character strings. However, you can eliminate a lot of them by specifying a longer minimum length. To do so, you use an option consisting of hyphen (-) followed by a number. For example, to specify that you only want to see strings that are at least 7 characters long (a good number), you would use: strings -7 /usr/bin/sort Next, you can sort the output and remove duplicate lines. To do so, just pipe the output to sort -iu (discussed earlier in the chapter): strings -7 /usr/bin/sort | sort -iu Finally, if there is so much output that it scrolls off your screen before you can read it, you can use less (Chapter 21) to display the output one screenful at a time: strings -7 /usr/bin/sort | sort -iu | less If the idea of looking inside programs for hidden messages appeals to you, here is an easy way to use strings to explore a variety of programs. The most important Unix utilities are stored in the two directories /bin and /usr/bin. (We will discuss this in Chapter 23.) Let's say you want to look inside some of the programs in these directories. To start, enter either of the following two cd (change directory) commands. This will change your "working directory" to whichever directory you choose:
cd /bin
Now use the ls (list) program to display a list of all the files in that directory: ls All of these files are programs, and you can use strings to look at any of them. Moreover, because the files are in your working directory, you don't have to specify the entire pathname. In this case, the file name by itself is enough. For example, if your working directory is /bin, where the date program resides, you can look inside the date program by using the command: strings -7 date | sort -iu | less In this way, you can look for hidden character strings inside the most important Unix utilities. Once you are finished experimenting, enter the command: cd This will change your working directory back to your home directory (explained in Chapter 23).
Related filters: sed The tr (translate) program can perform three different operations on characters. First, it can change characters to other characters. For example, you might change lowercase characters to uppercase characters, or tabs to spaces. Or, you might change every instance of the number "0" to the letter "X". When you do this, we say that you TRANSLATE the characters. Second, you can specify that if a translated character occurs more than once in a row, it should be replaced by only a single character. For example, you might replace one of more numbers in a row by the letter "X". Or, you might replace multiple spaces by a single space. When you make such a change, we say that you SQUEEZE the characters. Finally, tr can delete specified characters. For example, you might delete all the tabs in a file. Or, you might delete all the characters that are not letters or numbers. In the next few sections, we will examine each of these operations in turn. Before we start, however, let's take a look at the syntax: tr [-cds] [set1 [set2]] where set1 and set2 are sets of characters(*). * Footnote If you are using Solaris, you should use the Berkeley Unix version of tr. Such programs are stored in the directory /usr/ucb, so all you have to do is make sure this directory is at the beginning of your search path. (The name ucb stands for University of California, Berkeley.) We discuss Berkeley Unix in Chapter 2, and the search path in Chapter 13. Notice that the syntax does not let you specify a file name, either for input or output. This is because tr is a pure filter that reads only from standard input and writes only to standard output. If you want to read from a file, you must redirect standard input; if you want to write to a file (to save the output), you must redirect standard output. This will make sense when you see the examples. (Redirection is explained in Chapter 15.) The basic operation performed by the tr program is translation. You specify two sets of characters. As tr reads the data, it looks for characters in the first set. Whenever tr finds such characters, it replaces them with corresponding characters from the second set. For example, say you have a file named old. You want to change all the "a" characters to "A". The command to do so is: tr a A < old To save the output, just redirect it to a file, for example: tr a A < old > new By defining longer sets of characters, you can replace more than one character at the same time. The following command looks for and makes three different replacements: "a" is replaced by "A"; "b" is replaced by "B"; and "c" is replaced by "C". tr abc ABC < old > new If the second set of characters is shorter than the first, the last character in the second set is duplicated. For example, the following two commands are equivalent:
tr abcde Ax < old > new
They both replace "a" with "A", and the other four characters ("b", "c", "d", "e") with "x". When you specify characters that have special meaning to the shell, you must quote them (see Chapter 13) to tell the shell to treat the characters literally. You can use either single or double quotes, although, in most cases, single quotes work best. However, if you are quoting only a single character, it is easier to use a backslash (again, see Chapter 13). As a general rule, it is a good idea to quote all characters that are not numbers or letters. For example, let's say you want to change all the colons, semicolons and question marks to periods. You would use: tr ':;?' \. < old > new Much of the power of tr comes from its ability to work with ranges of characters. Consider, for example, the following command which changes all uppercase letters to lowercase: tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz <old >new The correspondence between upper- and lowercase letters is clear. However, it's a bother to have to type the entire alphabet twice. Instead, you can use hyphen (-) to define a range of characters, according to the following syntax: start-end where start is the first character in the range, and end is the last character in the range. For example, the previous example can be rewritten as follows: tr A-Z a-z < old > new A range can be any set of characters you want, as long as they form a consecutive sequence within the collating sequence you are using. (Collating sequences are discussed earlier in the chapter.) For example, the following command implements a secret code you might use to encode numeric data. The digits 0 through 9 are replaced by the first nine letters of the alphabet, A through I, respectively. For example, 375 is replaced by CGE. tr 0-9 A-I < old > new As a convenience, there are several abbreviations you can use instead of ranges. These abbreviations are called "predefined character classes", and we will discuss them in detail in Chapter 20 when we talk about regular expressions. For now, all you need to know is that you can use [:lower:] instead of a-z; [:upper:] instead of A-Z; and [:digit:] instead of 0-9. For example, the following two commands are equivalent:
tr A-Z a-z < old > new
As are these two commands:
tr 0-9 A-I < old > new
(Note that the square brackets and colons are part of the name.) For practical purposes, these three predefined character classes are the ones you are most likely to use with the tr program. However, there are more predefined character classes available if you need them. You will find the full list in Figure 20-3 in Chapter 20. hint Compared to other filters, tr is unusual in that it does not allow you to specify the names of an input file or output file directly. To read from a file, you must redirect standard input; to write to a file, you must redirect standard output. For this reason, the most common mistake beginners make with tr is to forget the redirection. For example, the following commands will not work:
tr abc ABC old
Linux will display a vague message telling you that there is an "extra operand". Other types of Unix will display messages that are even less helpful. For this reason, you may, one day, find yourself spending a lot of time trying to figure out why your tr commands don't work. The solution is to never forget: When you use tr with files, you always need redirection:
tr abc ABC < old
So far, all our examples have been straightforward. Still, they were a bit contrived. After all, how many times in your life will you need to change colons, semicolons and question marks to periods? Or change the letters "abc" to "ABC? Or use a secret code that changes numbers to letters? Traditionally, the tr program has been used for more esoteric translations, often involving non-printable characters. Here is a typical example to give you the idea. In Chapter 7, I explained that, within a text file, Unix marks the end of each line by a newline (^J) character(*1) and Windows uses a return followed by a newline (^M^J). Old versions of the Macintosh operating system, up to OS 9, used a return (^M) character(*2). *1 Footnote As we discussed in Chapter 7, when Unix people write the names of control keys, they often use ^ as an abbreviation for "Ctrl". Thus, ^J refers to <Ctrl-J>. *2 Footnote For many years, the Macintosh operating system (Mac OS) used ^M to mark the end of a line of text. As I mentioned, this was the case up to OS 9. In 2001, OS 9 was replaced by OS X, which is based on Unix. Like other Unix-based systems, OS X uses ^J to mark the end of a line of text. Suppose you have a text file that came from an old Macintosh. Within the file, the end of each line is marked by a return. Before you can use the file with Unix, you need to change all the returns to newlines. This is easy with tr. However, in order to do so, you need a way to represent both the newline and return characters. You have two choices. First, you can use special codes that are recognized by the tr program: \r for a return, and \n for newline. Alternatively, you can use a \ (backslash) character followed by the 3-digit octal value for the character. In this case, \015 for return, and \012 for newline. For reference, Figure 19-4 shows the special codes and octal values you are most likely to use with tr. The octal values are simply the base 8(*) number of the character within the ASCII code. For reference, you will find the octal values for the entire ASCII code in Appendix D. * Footnote In general, we count in base 10 (decimal), using the 10 digits 0 through 9. When we use computers, however, there are three other bases that are important:
• Base 2 (binary): uses 2 digits, 0-1
We will talk about these number systems in Chapter 21. Figure 19-4: Codes used by the tr program to represent control characters
Let us consider then, how to use tr to change returns to newlines. Let's say we have a text file named macfile in which each line ends with a return. We want to change all the returns to newlines and save the output in a file named unixfile. Either of the following commands will do the job:
tr '\r' '\n' < macfile > unixfile
As you can see, using these codes is simple once you understand how they work. For example, the following two commands change all the tabs in the file olddata to spaces, saving the output in newdata(*):
tr '\t' ' ' < olddata > newdata
* Footnote When you change tabs to spaces, or spaces to tabs, it is often better to use expand and unexpand (Chapter 18). These two programs were designed specifically to make such changes and, hence, offer more flexibility.
So far, we have discussed how to use tr for straightforward substitutions, where one character is replaced by another character. We will now turn our attention to more advanced topics. Before we do, here is a reminder of syntax we will be using: tr [-cds] [set1 [set2]] where set1 and set2 are sets of characters. The -s option tells tr that multiple consecutive characters from the first set should be replaced by a single character. As I mentioned earlier, when we do this, we say that we squeeze the characters. Here is an example. The following two commands replace any digit (0-9) with the uppercase letter "X". The input is read from a file named olddata, and the output is written to a file named newdata:
tr [:digit:] X < olddata > newdata
Now these commands replace each occurrence of a digit with an "X". For example, the 6-digit number 120357 would be changed to XXXXXX. Let's say, however, you want to change all multi-digit numbers, no matter long they are, into a single "X". You would use the -s option:
tr -s [:digit:] X < olddata > newdata
This tells tr to squeeze all multi-digit numbers into a single character. For example, the number 120357 is now changed to X. Here is a useful example in which we squeeze multiple characters, without actually changing the character. You want to replace consecutive spaces with a single space. The solution is to replace a space with a space, while squeezing out the extras: tr -s ' ' ' ' < olddata > newdata The next option, -d, deletes the characters you specify. As such, when you use -d, you define only one set of characters. For example, to delete all the left and right parentheses, use: tr -d '()' < olddata > newdata To delete all numbers, use either of the commands:
tr -d [:digit:] < olddata > newdata
The final option, -c, is the most complex and the most powerful. This option tells tr to match all the characters that are not in the first set(*). For example, the following command replaces all characters except a space or a newline with an "X": tr -c ' \n' X < olddata > newdata * Footnote The name -c stands for "complement", a mathematical term. In set theory, the complement of a set refers to all the elements that are not part of the set. For example, with respect to the integers, the complement of the set of all even numbers is the set of all odd numbers. With respect to all the uppercase letters, the complement of the set {ABCDWXYZ} is the set {EFGHIJKLMNOPQRSTUV}. The effect of this command is to preserve the "image" of the text, without the meaning. For instance, let's say the file olddata contains: |
Do you really think you were designed to spend most of
The previous command will generate:
XX XXX XXXXXX XXXXX XXX XXXX XXXXXXXX XX XXXXX XXXX XX
|
To finish the discussion of tr, here is an interesting example in which we combine the -c (complement) and -s (squeeze) options to count unique words. Let's say you have written two history essays, stored in text files named greek and roman. You want to count the unique words found in both files. The strategy is as follows:
• Use cat to combine the files
To place each word on a separate line (step 2), all we need to do is use tr to replace every character that is not part of a word with a newline. For example, let's say we have the words: As you can see This would change to:
As
To keep things simple, we will say words are constructed from a set of 53 different characters: 26 uppercase letters, 26 lowercase letters, and the apostrophe (that is, the single quote). The following three commands — choose the one you like — will do the job:
tr -cs [:alpha:]\' "\n"
The -c option changes the characters that are not in the first set; and the -s option squeezes out repeated characters. The net effect is to replace all characters that are not a letter or an apostrophe with a newline. Once you have isolated the words, one per line, it is a simple matter to sort them. Just use the sort program with -u (unique) to eliminate duplicate lines, and -f (fold) to ignore differences between upper and lower case. You can then use wc -l to count the number of lines. Here, then, is the complete pipeline: |
cat greek roman | tr -cs [:alpha:]\' "\n" | sort -fu | wc -l More generally: cat file1... | tr -cs [:alpha:]\' "\n" | sort -fu | wc -l |
In this way, a single Unix pipeline can count how many unique words are contained in a set of input files. If you want to save the list of words, all you need to do is redirect the output of the sort program: cat file1... | tr -cs [:alpha:]\' "\n" | sort -fu > file
A text editor is a program that enables you to perform operations on lines of text. Typically, you can insert, delete, make changes, search, and so on. The two most important Unix text editors are vi (which we will discuss in Chapter 22), and Emacs. There are also several other, less important, but simpler text editors which we discussed in Chapter 14: kedit, gedit, Pico and Nano. The characteristic all these programs have in common is that they are interactive. That is, you work with them by opening a file and then entering commands, one after another until you are done. In this section, I am going to introduce you to a text editor, called sed, which is non-interactive. With a non-interactive text editor, you compose your commands ahead of time. You then send the commands to the program, which carries them out automatically. Using a non-interactive text editor allows you to automate a large variety of tasks which, otherwise, you would have to carry out by hand. You can use sed in two ways. First, you can have sed read its input from a file. This allows you to make changes in an existing file automatically. For example, you might want to read a file and change all the occurrences of "harley" to "Harley". Second, you can use sed as a filter in a pipeline. This allows you to edit the output of a program. It also allows you to pipe the output of sed to yet another program for further processing. Before we get started, here is a bit of terminology. When you think of data being sent from one program to another in a pipeline, it conjures up the metaphor of water running along a path. For this reason, when data flows from one program to another, we call it a STREAM. More specifically, when data is read by a program, we call the data an INPUT STREAM. When data is written by a program, we call the data an OUTPUT STREAM. Of all the filters we have discussed, sed is, by far, the most powerful. This is because sed is more than a single-purpose program. It is actually an interpreter for a portable, shell-independent language designed to perform text transformations on a stream of data. Hence the name: sed is an abbreviation of "stream editor". A full discussion of everything that sed can do is beyond the scope of this book. However, the most useful operation you can perform with sed is to make simple substitutions, so that is what I will teach you. Still, I am leaving out a lot, so when you get a spare moment, look on the Web for a sed tutorial to learn more. If you need a reference, check the man page on your system (man sed). The syntax to use sed in this way is: sed [-i] command | -e command... [file...] where command is a sed command, and file is the name of an input file. To show you what it looks like to use sed, here is a typical example in which we change every occurrence of "harley" to "Harley". The input comes from a text file named names; the output is written to a file named newnames: sed 's/harley/Harley/g' names > newnames I'll explain the details of the actual command in a moment. First, though, we need to talk about input and output files. The sed program reads one line at a time from the data stream, processing all the data from beginning to end, according to a 3-step procedure,
By default, sed writes its output to standard output, which means sed does not change the input file. In some cases this is fine, because you don't want to change the original file; you want to redirect standard output to another file. You can see this in the example above. The input comes from names, and the output goes to newnames. The file names is left untouched. Most of the time, however, you do want to change the original file. To do so, you must use the - i (in-place) option. This causes sed to save its output to a temporary file. Once all the data is processed successfully, sed copies the temporary file to the original file. The net effect is to change the original file, but only if sed finishes without an error. Here is a typical sed command using ‑i: sed -i 's/harley/Harley/g' names In this case, sed modifies the input file names by changing all occurrence of "harley" to "Harley".(*) * Footnote The -i option is available only with the GNU version of sed. If your system does not use the GNU utilities — for example, if you use Solaris — you cannot use -i. Instead, to use sed to change a file, you must save the output to a temporary file. You then use the cp (copy) program to copy the temporary file to the original file, and the rm (remove) program to delete the temporary file. For example:
sed 's/harley/Harley/g' names > temp
In other words, you must do by hand what the -i option does for you automatically. When you use sed -i, you must be careful. The changes you make to the input file are permanent, and there is no "undo" command. hint Before you use sed to change a file, it is a good idea to preview the changes by running the program without the -i option. For example: sed 's/xx/XXX/g' file | less This allows you to look at the output, and see if it is what you expected. If so, you can rerun the command with -i to make the changes(*): sed -i 's/xx/XXX/g' file * Footnote There is a general Unix principle that says, before you make important permanent changes, preview them if possible. We used a similar strategy in Chapter 13 with the history list and with aliases. Both times, we discussed how to avoid deleting the wrong files accidentally by previewing the results before performing the actual deletion. This principle is so important, I want you to remember it forever or until you die (whichever comes first).
Related filters: tr The power of sed comes from the operations you can have it perform. The most important operation is substitution, for which you use the s command. The syntax is: [/address|pattern/]s/search/replacement/[g] where address is the address of one or more lines within the input stream; pattern is a character string; search is a regular expression; and replacement is the replacement text. In its simplest form, you use the substitute command by specifying a search string and a replacement string. For example: s/harley/Harley/ This command tells sed to search each line of the input stream for the character string "harley". If the string is found, change it to "Harley". By default, sed changes only the first occurrence of the search string on each line. For example, let's say the following line is part of the input stream: I like harley. harley is smart. harley is great. The above command will change this line to: I like Harley. harley is smart. harley is great. If you want to change all occurrences of the search string, type the suffix g (for global) at the end of the command: s/harley/Harley/g In our example, adding the g causes the original line to be changed to: I like Harley. Harley is smart. Harley is great. In my experience, when you use sed to make a substitution, you usually want to use g to change all the occurrences of the search string, not just the first one in each line. This is why I have included the g suffix in all our examples. So far, we have searched only for simple character strings. However, you can make your search a lot more powerful by using what is called a "regular expression" (often abbreviated as "regex"). Using a regular expression allows you to specify a pattern, which gives you more flexibility. However, regexes can be complex, and it will take you a while to learn how to use them well. I won't go into the details of using regular expressions now. In fact, they are so complicated — and so powerful — that I have devoted an entire chapter to them, Chapter 20. Once you have read that chapter, I want you to come back to this section and spend some time experimenting with regular expressions and sed. (Be sure to use the handy reference tables in Figures 20-1, 20-2 and 20-3.) For now, I'll show you just two examples that use regular expressions with sed. To start, let's say you have a file named calendar that contains information about your plans for the next several months. You want to change all occurrences of the string "mon" or "Mon" to the word "Monday. Here is a command that makes the change by using a regular expression: sed -i 's/[mM]on/Monday/g' calendar To understand this command, all you need to know is that, within a regular expression, the notation [...] matches any single element within the brackets; in this case, either an "m" or an "M". Thus, the search string is either "mon" or "Mon". This second example is a bit trickier. Earlier in the chapter, when we discussed the tr program, we talked about how Unix, Windows, and the Macintosh all use different characters to mark the end of a line of text. Unix uses a newline (^J); Windows uses a return followed by a newline (^M^J); and the Macintosh uses a return (^M). (These characters are discussed in detail in Chapter 7.) During the discussion, I showed you how to convert a text file in Macintosh format to Unix format. You do so by using tr to change all the returns to newlines: tr '\r' '\n' < macfile > unixfile But what do you do if you have a text file in Windows format and you want to use the file with Unix? In other words, how do you change the "return newline" at the end of each line of text to a simple newline? You can't use tr, because you need to change two characters (^M^J)into one (^J); tr can only change one character into another character. We can, however, use sed, because sed can change anything into anything. To create the command, we use the fact that the return character (^M) will be at the end of the line, just before the newline (^J). All we need to do is find and delete the ^M. Here are two commands that will do the conversion. The first command reads its input from a file named winfile, and writes the output to a file named unixfile. The second command uses -i to change the original file itself:
sed 's/.$//' winfile > unixfile
So how does this work? Within a regular expression, a . (dot) character matches any single character; the $ (dollar sign) character matches the end of a line. Thus, the search string .$ refers to the character just before the newline. Look carefully at the replacement string. Notice that it is empty. This tells sed to change the search string to nothing. That is, we are telling sed to delete the search string. This has the effect of removing the spurious return character from each line in the file. If you have never used regular expressions before, I don't expect you to feel completely comfortable with the last several commands. However, I promise you, by the time you finish Chapter 20, these examples, and others like them, will be very easy to understand. hint To delete a character string with sed, you search for the string and replace it with nothing. This is an important technique to remember, as you can use it with any program that allows search and replace operations. (In fact, you will often use this technique within a text editor.)
By default, sed performs its operations on every line in the data stream. To change this, you can preface your command with an "address". This tells sed to operate only on the lines with that address. An address has the following syntax: number[,number] | /regex/ where number is a line number, and regex is a regular expression. In its simplest form, an address is a single line number. For example, the following command changes only the 5th line of the data steam: sed '5s/harley/Harley/g' names To specify a range of lines, separate the two line numbers with a comma. For example, the following command changes lines 5 through 10: sed '5,10s/harley/Harley/g' names As a convenience, you can designate the last line of the data stream by the $ (dollar sign) character. For example, to change only the last line of the data stream, you would use: sed '$s/harley/Harley/g' names To change lines 5 through the last line, you would use: sed '5,$s/harley/Harley/g' names As an alternative to specifying line numbers, you can use a regular expression or a character string(*) enclosed in / (slash) characters. This tells sed to process only those lines that contain the specified pattern. For example, to make a change to only those lines that contain the string "OK", you would use: sed '/OK/s/harley/Harley/g' names * Footnote As we will discuss in Chapter 20, character strings are considered to be regular expressions. Here is a more complex example. The following command changes only those lines that contain 2 digits in a row: sed '/[0-9][0-9]/s/harley/Harley/g' names (The notation [0-9] refers to a single digit from 0 to 9. See Chapter 20 for the details.)
As I mentioned earlier, sed is actually an interpreter for a text-manipulation programming language. As such, you can write programs — consisting of as many sed commands as you want — which you can store in files and run whenever you want. To do so, you identify the program file by using the -f command. For example, to run the sed program stored in a file named instructions, using data from a file named input, you would use: sed -f instructions input The use of sed to write programs, alas, is beyond the scope of this book. In this chapter, we are concentrating on how to use sed as a filter. Nevertheless, there will be times when you will want sed to perform several operations; in effect, to run a tiny program. When this need arises, you can specify as many sed commands as you want, as long as you precede each one by the -e (editing command) option. Here is an example. You have a file named calendar in which you keep your schedule. Within the file, you have various abbreviations you would like to expand. In particular, you want to change "mon" to "Monday". The command to use is: sed -i 's/mon/Monday/g' calendar However, you also want to change "tue" to "Tuesday". This requires two separate sed commands, both of which must be preceded by the -e option: sed -i -e 's/mon/Monday/g' -e 's/tue/Tuesday/g' calendar By now, you can see the pattern. You are going to need seven separate sed commands, one for each day of the week. This, however, will require a very long command line. As we discussed in Chapter 13, the best way to enter a very long command is to break it onto multiple lines. All you have to do is type a \ (backslash) before you press the <Return> key. The backslash quotes the newline, which allows you to break the command onto more than one line. As an example, here is a long sed command that changes the abbreviations for all seven days of the week. Notice that all the lines, except the last one, are continued. What you see here is, in reality, one very long command line:
sed -i \
hint When you type \<Return> to continue a line, most shells displays a special prompt, called the SECONDARY PROMPT, to indicate that a command is being continued. Within the Bourne Shell family (Bash, Korn Shell), the default secondary prompt is a > (greater-than) character. You can change the secondary prompt by modifying the PS2 shell variable (although most people don't). Within the C-Shell family, only the Tcsh has a secondary prompt. By default, it is a ? (question mark). You can change the secondary prompt by modifying the prompt2 shell variable. (The commands to modify shell variables are explained in Chapter 12. Putting such commands in one of your initialization files is discussed in Chapter 14.)
Review Question #1: Of all the filters, grep is the most important. What does grep do? Why is it especially useful in a pipeline? Explain the meaning of the following options: -c, -i, -l, -L, -n, -r, -s, -v, - w and -x. Review Question #2: What two tasks can the sort program perform? Explain the meaning of the following options: -d, -f, -n, -o, -r and -u. Why is the -o option necessary? Answer The sort program can (1) sort data, (2) Review Question #3: What is a collating sequence? What is a locale? What is the connection between the two? Review Question #4: What four tasks can the uniq program perform? Review Question #5: What three tasks can the tr program perform? When using tr, what special codes do you use to represent: backspace, tab, newline/linefeed, return and backslash. Applying Your Knowledge #1: As we will discuss in Chapter 23, the /etc directory is used to hold configuration files (explained in Chapter 6). Create a command that looks through all the files in the /etc directory, searching for lines that contain the word "root". The output should be displayed one screenful at a time. Hint: To specify the file names, use the pattern /etc/*. Searching through the files in the /etc directory will generate a few spurious error messages. Create a second version of the command that suppresses all such messages. Applying Your Knowledge #2: Someone bets you that, without using a dictionary, you can't find more than 5 English words that begin with the letters "book". You are, however, allowed a single Unix command. What command should you use? Applying Your Knowledge #3: You are running an online dating service for your school. You have three files containing user registrations: reg1, reg2 and reg3. Within these files, each line contains information about a single person (no pun intended). Create a pipeline that processes all three files, selecting only those lines that contain the word "female" or "male" (your choice). After eliminating all duplications, the results should be saved in a file named prospects. Once this is done, create a second pipeline that displays a list of all the people (male or female) who have registered more than once. Hint: Look for duplicate lines within the files. Applying Your Knowledge #4: You have a text file named data. Create a pipeline that displays all instances of double words, for example, "hello hello". (Assume that a "word" consists of consecutive upper- or lowercase letters.) Hint: First create a list of all the words, one per line. Then pipe the output to a program that searches for consecutive identical lines. For Further Thought #1: In an earlier question, I observed that grep is the most important filter, and I asked you to explain why it is especially useful in a pipeline. Considering your answer to that question, what is it about the nature of human beings that makes grep seem so powerful and useful? For Further Thought #2: Originally, Unix was based on American English and American data processing standards (such as the ASCII code). With the development of internationalization tools and standards (such as locales), Unix can now be used by people from a variety of cultures. Such users are able to interact with Unix in their own languages using their own data processing conventions. What are some of the tradeoffs in expanding Unix in this way? List three advantages and three disadvantages. For Further Thought #3: In this chapter, we talked about the tr and sed programs in detail. As you can see, both of these programs can be very useful. However, they are complex tools that require a lot of time and effort to master. For some people, this is not a problem. For many other people, however, taking the time to learn how to use a complex tool well is an uncomfortable experience. Why do you think this is so? Should all tools be designed to be easy to learn? For Further Thought #4: Comment on the following statement: There is no program in the entire Unix toolbox that can't be mastered in less time than it takes to learn how to play the piano well.
List of Chapters + Appendixes
© All contents Copyright 2024, Harley Hahn
|