Harley Hahn's Guide to
|
A Personal Note
Chapters...
Command
INSTRUCTOR |
Chapter 18... Filters: Counting and Formatting
This is the third of four Chapters (16-19) in which we discuss filters: programs that read and write textual data, one line at a time, reading from standard input and writing to standard output. In this chapter, we will talk about how to manipulate text. In particular, I'll show you how to work with line numbers; how to count lines, words and characters; and how to format text in a variety of useful ways. Along the way, we'll digress a bit so I can cover several very interesting topics, including how Unix handles tabs and spaces; and why you will often see 80- character lines in text files: a story that is a lot more interesting that you might think.
Related filters: wc The nl filter provides a simple but useful service: it inserts line numbers into text. The syntax is: nl [-v start] [-i increment] [-b a] [-n ln|rn|rz] [file...] where start is the starting number, increment is the increment, and file is the name of a file. The nl program comes in handy in two situations. First, when you want to insert permanent line numbers into some data, which you will then save. Second, when you want to insert temporary line numbers into the output of a command to make the output easier to understand. Let's start with a simple, but useful example. You are going on a blind date, and you want to be sure to make a good impression. To prepare for the date, you create a file named books containing a list of your favorite books:
Crime and Punishment
To number the list, you can use the command: nl books The output is:
1 Crime and Punishment
This looks good, so you save the numbered list by redirecting the standard output to a file named best-books: nl books > best-books You now have a list of favorite books, complete with line numbers, with which to impress your date. The nl program is an old one, dating back to the early days of Unix. Traditionally, nl is used to insert line numbers into text before printing. For example, let's say you have two files of scientific data: measurements1 and measurements2. You want to print all the data and, to help you interpret it, you want to number each line on the printout. However, you don't want to change the original data. The strategy is to use nl to number the lines and then redirect the output to the lpr program, which sends the data to your default printer. (The two principal Unix programs to print files are lp and lpr.) nl measurements1 measurements2 | lpr In this way, you create temporary numbers that are used once and then thrown away. In the early days of Unix, terminals printed their output on paper, which was slow, and it was common to send data to a real printer, which could print a lot faster than a terminal. Today, it usually makes more sense to display data on your screen. In this case, all you need to do is pipe the output of nl to less (Chapter 21), which will display the output one screenful at a time: nl measurements1 measurements2 | less Again, the original data is not changed. When you use nl, line numbers are always temporary, unless you save the output to a file. By default, nl generates the numbers 1, 2, 3, and so on, which is fine. If the need arises, however, there are a few options you can use to control the numbering. You can change the starting number by using the -v option, and change the increment by using the -i option. To show you how it works, here are some examples using a file called data that contains several lines of text. The first example starts numbering at 100: nl -v 100 data The output is:
100 First line of text.
The next example starts numbering at 1 (the default) with an increment of 5: nl -i 5 data The output is:
1 First line of text.
The third example uses both options to start numbering at 100 with an increment of 5: nl -i 5 -v 100 data The output is:
100 First line of text.
In addition to -v and -i, the nl program has a variety of formatting options. However, there are only two you are likely to need. First, by default, if your data has blank lines, nl will not number them. To force nl to number all lines, use the -b (body numbering) option followed by the letter a (all lines): nl -b a file The -b option has other variations, but they are rarely used. Second, you can control the format of the numbers by using the -n (number format) option followed by a code:
ln = left-justified, no leading zeros
Here is an example: nl -v 100 -i 5 -b a -n rz file This command generates numbers starting with 100, using an increment of 5. All lines are numbered, even blank lines, and the numbers are right-justified with leading zeros.
Related filters: nl The wc (word count) program counts lines, words and characters. The data may come from another program or from one or more files. The syntax is simple: wc [-clLw] [file...] where file is the name of a file. The wc program is very useful, in fact, more useful than you might realize at first. This is because you can use wc within a pipeline to analyze textual output from any program you want. For example, let's say you want to know how many files are in a particular directory. You can count all the files by hand, or you can generate a list and then pipe it to wc to count the lines for you. (I'll show you an example in a moment.) Let's start with the basics. By default, the output of wc consists of three numbers: the number of lines, words and characters in the data. For example, in a moment I'll show you a file that wc will find contains exactly 2 lines, 13 words and 71 characters. When input comes from a file, wc will write the file name after the three numbers. If you specify more than one file, wc will display one line of output for each file, and an extra line showing the total count — lines, words and characters — for all the files put together. Here is the example. You are writing a romantic poem for your sweetheart for Valentines Day. This is all you have written so far:
There was a young man from Nantucket,
To count the lines, words and characters in the poem, use: wc poem The output is: 2 13 71 poem In other words, the file has 2 lines, 13 words and 71 characters. If you forget which number is which, just remember: Lines, Words, Characters. (If you are a man, you can remember the acronym LWC, "Look at Women Carefully".) Here are the technical details: • A "character" is a letter, number, punctuation symbol, space, tab or newline. • A "word" is an unbroken sequence of characters, delimited by spaces, tabs or newlines. • A "line" is a sequence of characters ending with a newline. (The newline character is discussed in Chapter 7.) As I mentioned, if you specify more than one file at a time, wc will also show you total statistics. For example: wc poem message story Here is some typical output(*):
2 13 71 poem
* Footnote These files contain real-life data. The poem is the sample poem we used above; the message is an email message from my editor, dated February 16, 2006, asking when the book would be done; and the story is "Late One Night", written by me, which you can find on my Web site www.harley.com. By convention, output is always displayed in the following order: number of lines, number of words, number of characters. If you do not want all three numbers, you can use the options: -l counts lines, -w counts words, and -c counts characters. When you use options, wc displays only the numbers you ask for. For example, to see only the number of lines in the file story, use: wc -l story The output is: 43 story To see how many words and characters are in the file message, use: wc -wc message The output is: 61 447 message The -c (character), -l (line), and -w (word) options have been part of the wc command for decades and can be used with any type of Unix or Linux. With Linux, there is also another option, -L. This option displays the length of the longest line in the input. For example, let's say you are planning a big party, and you need a list to give to the bouncers of all the people who are not allowed entry. You have created the file do-not-admit, with the following four lines:
Britney
To display the length of the longest line in this file, you would use: wc -L do-not-admit In this case, the first and fourth lines have 7 characters, so the output is: 7 do-not-admit The -L option comes in handy when you need to decide if a file needs some type of formatting. For example, if a file has any lines longer than, say, 70 characters, you might use fmt to format the text before sending it to pr to prepare for printing (explained later in the chapter). As you gain experience, you will find that there are two main ways to use wc. First, there are times when you need a quick measure of the size of a file. For example, let's say you email a text file to someone. The file is important, and you want to double-check that it arrived intact. Run the wc command on the original file. Then tell the recipient to run wc on the other file. If the two sets of results match, you can feel confident the file arrived intact. Similarly, suppose you are writing as essay that must be at least 2,000 words. From time to time, you can use wc -w to see how close you are getting to your target. The second use for wc is different, but just as important: you can pipe the output of a command to wc and check how many lines of text were generated. Because many programs generate one item of information per line, you can, by counting the lines, know how much information was produced. Here are two examples. First, the ls program (Chapter 24) lists the names of files in a directory. For example, the following command displays the names of all the files in the /etc directory. (We will discuss directories in Chapter 24.) ls /etc The ls program has many options. However, there is no option for counting the number of files. To do so, you pipe the output of ls to wc. Thus, to count the number of files in the /etc directory, you use: ls /etc | wc -l (Try it on your system.) Here is the second example. In Chapter 8, I showed you how to use who to find out which userids are logged in to your system. To display the number of userids(*) logged in, all you have to do is count the lines of output of the who command: who | wc -l * Footnote A userid ("user-eye-dee") is not a person. It is a name used to log in to a Unix system. As we discussed in Chapter 4, Unix knows only about userids, not users. If you want to get fancy, you can combine this last pipeline with the echo program (Chapter 12) and command substitution (Chapter 13) to display a message showing the number of userids currently logged in: echo "There are `who | wc -l` userids logged in right now." For example, if 5 people are logged in, you will see: There are 5 userids logged in right now. If you share a multiuser system, this is an interesting command to put in your login file (see Chapter 14).
When you look at your keyboard, you see a <Tab> key. This key is a holdover from the days when tabs were used on typewriters. Although we don't use tabs anymore, we still use <Tab> keys, and Unix still uses tabs settings. To understand why, let's take a moment and travel back in time to the days when typewriters were the dominant form of life in the office machine community. The word "tab" is an abbreviation for "tabulate", which means to organize information into a table. The <Tab> keys on the old typewriters were designed to help line up information in columns and to indent text at the beginning of paragraphs. Here is an example showing how it used to work. You are using an old typewriter, and you want to type a table with three columns. The columns should line up at positions 1, 15 and 25. To prepare, you set two small mechanical markers, called TAB STOPS at positions 15 and 25. Once this is done, pressing the <Tab> key will cause the carriage to move horizontally to the next tab stop. For example, if you are at position 8 and you press <Tab>, the carriage will move to position 15. If you are at position 19, and you press <Tab>, the carriage will move to position 25. Thus, setting the tab stops in this way gives you an easy way to jump directly to positions 15 and 25 without having to press the <Space> bar repeatedly (and without having to back up if you go too far). You are now ready to type your table. To start, you put in a piece of paper and position the carriage at the beginning of the line. You type the information for the first column and then press the <Tab> key. This causes the carriage to move to position 15. You type the information for the second column and press <Tab> again. The carriage now moves to position 25. You then type the information for the third column. You are now finished typing the first row of your table. You press the carriage return lever all the way to left, which moves the carriage to the beginning of the next line, leaving you ready to type the next row. Although the original Unix terminals (see Chapter 7) were not typewriters, they did print on paper and they were able to jump horizontally when they encountered a tab character. For this reason, Unix was designed so that whenever a terminal encountered a tab character, it would act like a typewriter by moving the cursor to the next tab stop on the current line — and, to this day, that is still the case. Unix terminals "display" a tab character by moving the cursor forward to the next tab stop. By default, Unix assumes that there are tab stops every 8 characters, starting with position 1. Thus, the default Unix tab positions are 1, 9, 17, 25, 33, and so on. When you are typing text and you press the <Tab> key, Unix inserts an invisible tab character. Later, when you look at the text, your terminal will "display" the tab character by creating enough horizontal space to jump to the next tab stop, just like the <Tab> key on an old typewriter. Consider this example. You have a one-line file containing the letter "A", a tab character, the letters "BBBBB", another tab character, and the letters "CCC: A<Tab>BBBBB<Tab>CCC If you use the cat command to display this file, you will see: A BBBB CCC The A is at position 1. The BBBB starts at position 9, and the CCC starts at position 17. Of course, you can't see the tabs: they look like empty space. Thus, as far as your eye is concerned, the gap between the letters might as well be space characters. For example, in the example above, when you look at the output of the cat command, you can't tell if the empty space between A and BBBB is (in this case) 1 tab or 7 spaces. So, the question arises: when you want to indent text or align data into columns, which is better to use, tabs or spaces? This question has been debated for a long time by programmers, because they use empty space to indent control-flow constructs (if-then-else, while loops, and so on). Some programmers prefer to use tabs for indentation, because they are simpler. For example, each time you press <Tab>, it inserts a single tab character which automatically indents the text to the position of the next tab stop. If you use spaces, you need to press the <Space> bar multiple times, in order to line up the text by hand. In addition, tabs are also more flexible than spaces. For example, if you want to change the amount of indentation you see on your screen, you need only change the tab stop settings within your text editor program. If you use spaces, you have to go to each line in the program and add or delete actual space characters. Other programmers argue for using spaces for indentation. Tabs, they say, are clumsy to use because the amount of space they generate varies. A single tab character, they point out, can represent 1 to 8 positions depending on their location within the line. When you use spaces, what you use is what you get: type 4 spaces, and you get 4 spaces. Moreover, although it is true you can adjust the tab stop settings within most text editors, much of the time you will be stuck with the default: positions 1, 9, 17, 25, and so on. This much spacing is too much, as it creates large indentations, making the text hard to read. By using real spaces, you can indent 2 or 3 or 4 positions — whatever you want — and it will work exactly the way you want, no matter what text editor or other programs you use. Of course, the need to create horizontal spacing applies to more than computer programs. Whenever you work with any type of text that requires indentation or that is organized into columns, you must choose whether to use tabs or spaces. I can't tell you what to use, because everyone has their own preference and, over time, you will figure out which one works better for you. (Personally, I prefer spaces to tabs.)(*) What I can tell you is that — whichever choice you end up making — there are two Unix programs to make your life easier (expand and unexpand), which we will cover in a moment. First, however, we need to discuss a more fundamental question. * Footnote When I write programs, I indent 4 positions. When I write HTML (Web pages), I indent 2 positions.
When you work with a file that contains tabs and spaces, a problem arises. Since tabs and spaces are invisible, how can you tell where they are? This can be important when you are working with programs like expand and unexpand (which we will discuss in the next two sections). The expand program changes tabs into spaces; unexpand changes spaces into tabs. If you can't see the tabs and spaces, how do you know the commands did what you wanted them to do? The simplest solution is to view the file from within a text editor or word processor that lets you to turn on an option to view invisible characters. There are two choices. With the vi editor (Chapter 22), the command to use is: :set list Spaces will still be invisible, but tabs will show up as ^I, the control character that represents a tab in the ASCII code. To turn off the option, use: :set nolist If you know how to use vi, this is an excellent solution to the problem: quick and easy. In fact, this is how I look at whitespace in files. If you don't know vi, you can use the Nano or Pico editors, which I mentioned in Chapter 14. (They are basically the same editor; Nano is the GNU version of Pico.) Within Nano/Pico, you view spaces and tabs by pressing <Esc>P (that is, press the <Esc> key, then press the <P> key). This turns on "Whitespace display mode". To turn it off, just press <Esc>P a second time. Before you can use <Esc>P, however, you must add the following line to your Nano/Pico initialization file, either .nanorc or .picorc respectively (see Chapter 14): set whitespace "xy" where x is the character you want to indicate a tab, and y is the character you want to indicate a space. For example, if you want a tab to show up as a + (plus) character, and a space to show up as a | (vertical bar) character, put the following line in your Nano/Pico initialization file: set whitespace "+|" Using vi or Nano/Pico to view spaces and tabs works well. Unfortunately, vi is complicated (as you will see Chapter 22), and it will take you a long time to learn how to use it well. Nano/Pico is a lot simpler but, like vi, it does takes time to learn. So what about the very simple GUI-based editors, Kedit and Gedit, we discussed in Chapter 14? As I mentioned, these editors are very easy to use. However, they are not very powerful. In particular, they do not allow you to view invisible characters, so let's forget about them for now. If your system has Open Office, a collection of open source office applications, there is another solution. The Open Office word processor makes it easy to view tabs and spaces within a file. Just pull down the View menu and select "Nonprinting Characters". You turn off the feature the same way. Simple and easy. Aside from viewing a file in a text editor or word processor, there is a way to check the effects of expand or unexpand indirectly. You can use the wc -c command (discussed earlier in the chapter) to display the number of characters in the file. Since each tab is a single character, when you use expand to change tabs to spaces within a file, the number of characters in the file will increase. Similarly, when you use unexpand to change spaces to tabs, the number of characters in the file will decrease. Although using wc -c won't show you the invisible characters, it will give you an indication as to whether or not expand or unexpand worked.
Related filters: unexpand As we discussed earlier in the chapter, when you need to indent text or align data into columns, you can use either tabs or spaces. The choice is yours. Regardless of your preference, however, there will be times when you will find yourself working with data that has tabs, which you need to change into spaces. Similarly, there will be times when you have data with spaces, which you need to change into tabs. In such cases, you can use the expand program to change tabs to spaces, and the unexpand program to change spaces to tabs. Let's start with expand. The syntax is: expand [-i] [-t size | -t list] [file...] where size is the size of fixed-width tabs, list is a list of tab stops, and file is the name of a file. The expand program changes all the tabs in the input file to spaces, while maintaining the same alignment as the original text. By default, expand uses the Unix convention that tab stops are set for every 8 positions: 1, 9, 17, 25, 33, and so on. Thus, each tab in the input will be replaced by 1 to 8 spaces in the output. (Think about that until it makes sense.) As an example, consider the following file, named animals that contains data organized into columns. When you display animals, you see the following:
kitten cat
As you can see, the file contains four lines of data. What you can't see, is that each line consists of two words separated by a tab:
kitten<Tab>cat
(I have used the designation <Tab> to represent a single tab character.) The following command expands each tab to the appropriate number of spaces, saving the output in a file named animals-expanded: expand animals > animals-expanded The new file now contains the following:
kitten<Space><Space>cat
(I have used the designation <Space> to represent a single space character.) If you display the new file using cat or less, it will look the same as the original file. However, if you use a text editor or word processor to view the invisible characters (as I described in the previous section), you will see that all the tabs have been changed to spaces. More specifically, within each line, expand has removed the tab and inserted enough spaces so the following word starts at the next tab stop, in this case, position 9. As I mentioned, expand uses the Unix default by assuming that there are 8 positions between each tab stop. You can change this with the -t (tab stop) option. There are two variations. First, if all the tab stops are the same distance apart, use -t followed by that number. For example, let's say you have a large file named data that contains some tabs. You want to expand the tabs into spaces and save the output in the file data-new. However, you want the tab stops to be set at every 4 characters, rather than every 8 characters; that is, you want: 1, 5, 9, 13, and so on. Use the command: expand -t 4 data > data-new Using this notation, we can say that -t 8 would be the same as the Unix default, tab stops at every 8 positions; -t 4 creates tab stops at every 4 positions. The -t has a second variation. If you want the tab stops to be at specific positions, you can specify a list with more than one number, separated by commas. Within the list, numbering starts at 0. That is, 0 refers to the first position on the line; 1 refers to the second position on the line; and so on. For example, to set tab stops at positions 8, 16, 22 and 57, you would use: expand -t 7,15,21,56 data > data-new Finally, there is an option to use when you want to expand tabs, but only at the beginning of a line. In such cases, use the -i (initial) option, for example: expand -i -t 4 data > data-new This command expands tabs, but only at the beginning of a line. All other tabs are left alone. In this case, because of the -t option, the tab stops are considered to be at positions 1, 5, 9, and so on. hint The expand program is useful for pre-processing text files with tabs before sending the files to a program that expects columns to line up exactly. For example, the following pipeline replaces all the tabs in a file named statistics, which has tab stops at every 4 positions. After the tabs are replaced, the first 15 characters of each line are extracted, and the result is sorted: expand -t 4 statistics | cut -c 1-15 | sort
Related filters: expand To change spaces to tabs, you use the unexpand program. The syntax is: unexpand [-a] [-t size | -t list] [file...] where size is the size of fixed-width tabs, list is a list of tab stops, and file is the name of a file. The unexpand program works as you would expect, like expand in reverse, replacing spaces with tabs in such a way that the original alignment of the data is maintained. As with expand, the default tab settings are every 8 positions: 1, 9, 17, and so on. To change this, you use the same two forms of the -t option as with expand: a fixed interval (such as -t 4; every 4 positions), or a list of tab stops (such as -t 7,15,21,56). If you use a list, numbering starts at 0. That is, 0 refers to the first position on the line; 1 refers to the second position on the line; and so on. One important difference between expand and unexpand is that, by default, unexpand only replaces spaces at the beginning of a line. This is because, most of the time, you would only use unexpand with lines that are indented. You would probably not want to replace spaces in the middle of a line. If, however, you do want to override this default, you can use the -a (all) option. This tells unexpand to replace all spaces, even those that are not at the beginning of a line. As an example, let's say that you are a student at a prestigious West Coast university, majoring in Contemporary American Culture. You have just attended a lecture about Mickey Mouse, during which you have taken careful notes. When you get home, you use a text editor to type your notes into a file named rough-notes. It happens that, when you indent lines, your text editor inserts 4 spaces for each level of indentation: |
Mickey Mouse (1928-)
|
You want to change all the initial spaces to tabs and save the data in a file called mickey. The command to use is: unexpand -t 4 rough-notes > mickey Once you run this command, the file mickey contains: |
Mickey Mouse (1928-)
|
"Man cannot survive except through his mind. But the mind is an attribute of the individual.
|
You want to format this paragraph into 40-character lines. To do so, you use the command: fold -w 40 speech However, fold breaks the lines at exactly 40 characters, which means that some of the lines are broken in the middle of words:
"Man cannot survive except through his m
Instead, you use the -s option to tell fold not to break words: fold -s -w 40 speech The output is:
"Man cannot survive except through his
Although the right margin is a bit ragged, the words are kept intact. If you want to save the formatted text, just redirect the output: fold -s -w 40 speech > speech-formatted hint With some programs, you will find yourself using the same options every time you use the program. To streamline your work, you can define an alias that includes the options (see Chapter 13), so you won't need to type them every time. As an example, let's say you always use fold with -s -w 40. You can put one of the following alias definitions in your environment file (see Chapter 14). The first definition is for the Bourne Shell family; the second is for the C-Shell family:
alias fold="fold -s -w 40"
Now, whenever you use fold, you will automatically get the -s -w 40 options. If, from time to time, you want to run fold without -s -w 40, you can (as we discussed in Chapter 13) suspend the alias temporarily by typing a \ (backslash) character in front of the command name. For example, if you want to use fold with -w 60 instead of -s w 40, you can use the command: \fold -w 60 long-text > short-text To run fold with no options, use: \fold long-text > short-text
For many years, programmers have used 80 characters per line of text, and terminals have displayed 80 characters per line of output. With the advent of GUIs, which allow you to resize windows dynamically, the magic number 80 pops up less often as an exact line length. Nevertheless, it is still the case that many Unix programs use a default of 80 characters/line, for example: • The fold program (which we discussed in the previous section) breaks lines, by default, at position 80. • If you look closely at pages in the online Unix manual (Chapter 9), you can see that they are formatted for an 80-character line. • When you use a terminal emulator program (Chapter 3), the default line width is usually 80 characters. Why should this be the case? What's so special about 80 characters/line? Here's the story. In 1879, the American inventor Herman Hollerith (1860-1929) was working on a system to handle the information for the upcoming U.S. census of 1880. He borrowed an idea from the weaving industry which, since the early part of the 19th century, had been using large cards with holes to control automated looms. Hollerith adapted this idea and developed a system in which census data was stored on punched cards, one card per person. Hollerith designed the cards to be the same size as U.S. banknotes(*), which allowed him to use existing currency equipment — such as filling bins — to process the cards. The cards, which came to be known as PUNCH CARDS, had 20 columns, which was later expanded to 45 columns. * Footnote In 1862, the U.S. government issued its very first banknote, a $1 bill that measured 73/8 inches x 31/8 inches. This was the size of the banknotes in Hollerith's time and, hence, the size of his punch cards. In 1929, the government reduced the dimensions of banknotes by about 20 percent, to 61/8 inches x 25/8 inches, the size which is used today. Hollerith's system proved to be so useful that, in 1896, he founded the Tabulating Machine Company (TMC) to manufacture his own machines. In 1911, TMC merged with two other companies — the Computing Scale Company of America and International Time Recording Company — to form the Computing Tabulating Recording Company (CTR). In addition to tabulators and punch cards, CTR also manufactured commercial scales, industrial time recorders, and meat and cheese slicers. In 1924, CTR formally changed its name to International Business Machines (IBM). By 1929, IBM's technology had advanced to the point where they were able to increase the number of columns on punch cards. Using this new technology, the IBM punch card — which, as you remember, was the size of the old dollar bill — was just large enough to hold 80 columns, each of which could store a single character. Thus, when the first IBM computers were developed in the late 1950s, they used punch cards that stored 80 characters/card. As a result, programs and data were stored as 80-character lines and, within a short time, this became the de facto standard. By the 1980s, punch cards were phased out as programmers began to use terminals and, later, personal computers. However, the 80-character standard persisted, as both terminals and PCs used screens that displayed 80 characters per line, as this being what programmers (and programs) expected. It was during this era that Unix was developed, so it was only natural the 80-character line would be incorporated into the Unix culture. Now you understand why — over 25 years later, and in spite of the popularity of GUIs — the 80-character line survives in various nooks and crannies within the world of Unix.
Related filters: fold, pr The fmt program formats paragraphs. The goal is to join the lines within a paragraph so as to make the paragraph as short and compact as possible, without changing the content or the whitespace. In other words, fmt makes text look nice. The syntax is: fmt [-su] [-w width] [file...] where width is the maximum width of a line, and file is the name of a file. When fmt reads text, it assumes that paragraphs are separated by blank lines. Thus, a "paragraph" is one or more contiguous lines of text, not containing a blank line. The fmt program works by reading and formatting one paragraph at a time according to the following rules: • Line width: Make each line as long as possible, but no longer than a specific length. By default, the maximum line width is 75 characters, although you can change it with the -w option. To do so, use -w followed by the line width you want, for example, -w 50. • Sentences: Whenever possible, break lines at the end of sentences. Avoid breaking lines after the first word of a sentence or before the last word of a sentence. • Whitespace: Preserve all indentations, spaces between words, and blank lines. This can be modified by using the -u option (see below). • Tabs: Expand all tabs into spaces as the text is read and insert new tabs, as appropriate, into the final output. As an example, let's say you have a file named secret-raw contains the following three paragraphs of text. Notice that the lines are not formatted evenly: |
As we all know, real
|
You want to format the text using a line length of 50 characters and save the result in a file named secret-formatted. The command to do so is: fmt -w 50 secret-raw > secret-formatted The contents of secret-formatted are now as follows:
As we all know, real success comes slowly and is
* Footnote This example is taken from an essay entitled "The Secret of My Success". If you want to read the entire essay, you can find it on my Web site . The fmt program has several other options, but only two are important. The -u (uniform spacing) option tells fmt to decrease white space so that there is no more than a single space between words, and no more than two spaces at the end of a sentence, a style called FRENCH SPACING(*). For example, the file joke contains the following text: * Footnote With French spacing, sentences are followed by two spaces instead of one. This style is generally used with monospaced fonts, where all the characters are the same width. With such fonts, it helps the eye to have an extra space at the end of a sentence. The two principal Unix text editors, vi and Emacs, both recognize French spacing, which allows them to detect where sentences begin and end. This allows vi and Emacs to work with sentences as complete units. For example, you can delete two sentences, change a single sentence, jump back three sentences, and so on. For this reason, many Unix people form the habit of using two spaces after a sentence. (I do, even when I type email.) |
A man walks into a drug store and goes up to the
|
You format this with: fmt -u -w 50 joke The output is:
A man walks into a drug store and goes up to the
Notice there is only a single space at the end of the first sentence. This is because there was only one space in the original file, and fmt does not add spaces, it only removes them. The final option, -s (split only), tells fmt to split long lines, but not to join short lines. Use this option when you are working with text that has formatting you want to preserve, for example, when you are writing a computer program.
In the next two sections, we are going to discuss pr, a program that was created in the early days of Unix, the 1970s. This was a time when printers were so expensive that no one had his own and pr was designed to prepare files for printing in a shared environment. However, as you will see, pr has important capabilities that — printing aside — are useful in their own right for formatting text. Before we cover these topics, though, I want to take a moment to lay the foundation by describing what it was like to print files in the early days of Unix. Because printers were expensive, they were almost always shared by a group of people. Whenever a user wanted to print a file, he would enter the appropriate commands on his terminal to format and print the file. Or he might run a program that generated printed output. Each request for printing was called a "print job" and, each time a print job was generated, it was put into the "print queue" to wait its turn. In this way, one print job after another would be generated, stored and, ultimately, printed. The actual printer would be in a computer room, a common area used by many people, usually programmers. Output was printed on continuous, fan-fold computer paper, and the output of a single print job was called a "printout". As printouts accumulated, someone — often an "operator", working in the computer room — would separate the printouts by tearing the paper at a perforation. He would then put each printout in a bin, where it would be picked up later by the person who initiated the print request. Because of how the system was organized, there had to be a way for the operator to be able to take a stack of printed paper and divide it into separate print jobs. The pr program was designed to meet the needs of both the user and the operator by offering two services. First, pr would format the text into pages; second, pr would make sure that each page had its own header, margin and page number. In that way, printed output would not only look nice (for the user), it would be easy to organize (for the operator). Today, many people think of the pr program as being only for printing, which is a mistake. True, pr can still do what it was designed to do: prepare output to be sent to a printer (hence the name pr). This is still a useful function, and we will discuss these aspects of pr in the next section. However, pr can do a lot more for you than simply break text into pages and generate headers, margins and line numbers. It can format text in several very useful ways, especially when you learn how to combine pr with fold and fmt. For example, you can use pr to arrange text from a single file into columns. You can also merge text from multiple files, each file having its own column. So once we finish talking about the basic functions of pr, I'll show you how to use it in ways that have nothing to do with printing and everything to do with being efficient and clever. hint Today, most people do not use text-based tools for printing ordinary files. Instead, they use graphical tools, such as word processors, which make it easy to control the formatting and pagination. However, if you are a programmer, you will find that, when the need arises, the traditional Unix tools (pr, fmt, nl, fold) are excellent for printing source code.
Related filters: fold, fmt The primary function of pr is to format text into pages suitable for printing. The pr program can also format text into columns, as well as merge text from multiple files, which we will talk about in the next section. The basic syntax for pr is below: pr [-dt] [+beg[:end]] [-h text] [-l n] [-o margin] [-W width...] [file...] where beg is the first page to format, and end is the last page to format; text is text for the middle of the header; n is the number of lines per page; margin is the size of the left margin; width is width of the output; and file is the name of a file. It is common to use pr as part of a pipeline to format text before it is sent to a printer. For example, let's say you have a program named calculate that generates data which you want to print. The following pipeline sends the output of calculate to pr to be formatted, and then to lpr to be printed. (The two principal Unix programs to print files are lp and lpr.) calculate | pr | lpr Here is a similar example. You want to combine, format and print the contents of three files: cat data1 data2 data3 | pr | lpr By default, pr formats a page by inserting a header at the top, a margin on the left, and a trailer at the bottom. Both the header and trailer take up five lines. The left margin and the trailer are just for spacing, so they are blank. The header, however, contains information on its middle line: the date and time the file was last modified, the name of the file, and the page number. (These details can vary slightly depending on the version of pr you are using.) As an example, here is a typical header. Leaving out the blank lines, this is what you might see if you formatted a file named logfile: 2008-12-21 10:30 logfile Page 1 By default, pr assumes pages have 66 lines. This is because old printers used 11-inch paper and printed 6 lines/inch. The header (at the top) and the trailer (at the bottom) each take up 5 lines, which leaves 56 lines/page for single-spaced text. When pr creates pages, it starts with page 1 and continues, in order — page 2, page 3, and so on — until all the data is formatted. If you want to test pr and see how it works, an easy way is to format a file and send the output to less (Chapter 21). This will allow you to look at the formatted output one screenful at a time. For example, let's say you are taking a class and you have written an essay, which you have stored in a file named essay. To take a look at how pr would format the text, you can use: pr essay | less If you like what you see, you can then send it to the printer: pr essay | lpr Or, you can save it to a file: pr essay > essay-formatted If your essay was originally written using a word processor, it will have very long lines. This is because word processors store each paragraph as one long line(*). In this case, you can first break the lines appropriately by using fold -s or fmt, whichever works best with your particular text: * Footnote Word processing documents are stored in a special binary format. Almost all Unix filters, however, assume that data is stored as text. Thus, if you want to work with a word processing document using the Unix programs in this chapter, you must first save the document as plain text from within the word processor program. For instance, in the examples above, the file essay is the plain text version of a word processing document named essay.doc.
fmt essay | pr | less
Almost all of the time, you will find that, when you use pr to format pages for printing, the defaults are just fine. However, if the need arises, you can change them by using several options. The most commonly used option is -d, which tells pr to use double-spaced text. Consider the following example, which formats and prints the text file essay. This simple pipeline starts by using fmt to format the lines of the file. The output of fmt is sent to pr, where it is formatted into pages with double-spaced text. The output of pr is then sent to the printer: fmt essay | pr -d | lpr The result is a spiffy, double-spaced, printed copy of your essay, suitable for editing or for submitting to your teacher. Note that -d does not modify the original text: all it does is specify what type of spacing to use when the text is formatted into pages. The original file — essay in this case — is left unchanged. If you want to control which pages are formatted, use the syntax: pr +begin[:end] where begin is the first page to format, and end is the last page to format. For example, to skip pages 1 and 2 — that is, to start from page 3 and continue to the end of the file — use: fmt essay | pr -d +3 | lpr
To format and print pages 3 through 6 only, use: fmt essay | pr -d +3:6 | lpr If you want to specify the text for the middle part of the header, use the -h option, for example: fmt essay | pr -h "My Essay by Harley" | lpr To change the total number of lines per page, use -l, followed by a number. For example, say you want only 40 lines of text per page. Counting the header (5 lines) and the trailer (also 5 lines) you need a total of 50 lines/page: fmt essay | pr -l 50 | lpr To eliminate the header, use the -t option. When you use -t, there will be no breaks between pages, which means all the lines will be used for text. This is useful when you are formatting text you do not want to print. For example, you might want to change single-spaced text into double-spaced text. The following command formats the contents of essay, double-spaced with no headers, and saves the output to a file named essay-double-spaced: fmt essay | pr -t -d > essay-double-spaced By default, pr does not insert a left margin. This is fine because, most likely, your printer will be set up to create margins automatically. However, if you want to add an extra left margin of your own, use the -o (offset) option, followed by the size of the extra margin in spaces. In addition, you can change the width of the output (the default is 72 characters), by using the -W option. (Note the uppercase W.) When you use -W, lines that are too long are truncated, so you must be careful not to lose text. The following is a particularly useful example which illustrates how you might use these two options. If you are a student, you know there are times when you need your essays to be a bit longer. For example, you might have an 8-page essay, but your teacher has asked for a 10-page essay. Of course, you could rework your notes, do more research, and rewrite the essay. Or, you could simply print the essay double-spaced with wide margins.(*) * Footnote You did not read this here. The following example formats the contents of the file essay using double-spaced pages, with a line width of 50 characters, and a left margin of 5 spaces; that is, 45 characters of text per line. The result is an essay that looks significantly larger than it really is: fmt -w 45 essay | pr -d -o 5 -W 50 | lpr If you want to check the output before you print it, use: fmt -w 45 essay | pr -d -o 5 -W 50 | less Note: It is necessary to use the option -w 45 with fmt because, by default, fmt produces lines that are 72 characters long. However, in this case, we have asked pr to limit the output to 45 characters of text per line. Thus, we need to make sure none of the lines are longer than 45 characters: otherwise, they would be truncated. As an alternative to fmt, you could use the fold -s program, which yields similar formatting: fold -s -w 45 essay | pr -d -o 5 -W 50 | lpr hint Both fold and fmt can be used to format lines of text. How do you know which one to use? The fold program does only one thing: break lines. By default, fold breaks lines at a specific column. However, with the -s option, fold break lines between words. This leaves the text a bit ragged on the right, but preserves the words. The fmt program formats paragraphs. Like fold, fmt breaks long lines. However, unlike fold, fmt will also join together short lines. Thus, if you need to break lines at a specific point, you use fold. If you need to format text that contains short lines, you use fmt. That much is clear. But what about when you need to break lines at word boundaries and the text is already formatted? In that case, you can use either fold -s or fmt. Note: These two commands will sometimes yield slightly different output, so try both and see which one works best with your particular data. hint When you pre-process text with fold or fmt before sending it to pr, it is wise to specify the exact line width you want, because the three programs have different defaults:
• fold: 80 characters/line
As we discussed in the previous section, the purpose of pr is to format text into pages suitable for printing. In addition to the simple formatting we have discussed, pr can also format text into columns. The input data can come from a single file or from several files. When you use pr to create columns, the syntax is as follows: pr [-mt] [-columns] [-l lines] [-W width] [file...] where columns is the number of output columns; lines is the number of lines per page; width is width of the output; and file in the name of a file. Let's start with a single file. To specify the number of output columns, use a - (hyphen) character followed by a number. For example, for two columns, use -2. To control the length of the columns, use the -l option. Here is a typical example. You are an undergraduate student at a small, but prestigious liberal arts college. Your academic advisor believes you will have a better chance of being accepted to medical school if you participate in extracurricular activities, so you join the Anti-Mayonnaise Society. As part of your duties, you agree to work on the newsletter. You have just finished writing an article on why mayonnaise is bad. You have saved the article, as plain text, to a file named article. Before you can import the article into your newsletter program, you need to format the text into two columns, with a page length of 48 lines. The following pipeline will do the job. Note the use of fmt to format the text into 35-character lines before sending it to pr: fmt -w 35 article | pr -2 -l 48 > article-columns The formatted text you need is now stored in a file named article-columns. Where did the number 35 come from? When you use pr the default line width is 72. There will be at least one space at the end of each of the two columns, which leaves a maximum of 70 characters of text. Divide this by two to get a maximum of 35 characters/column. (If you use the -W option to change the line width, you must change your calculations accordingly.) hint When you format text into columns, pr will blindly truncate lines that are too long. Thus, if your text contains lines that are longer than the column width, you must break the lines using fold -s or fmt before you send the text to pr. By default, pr aligns columns using tabs, not spaces. If you would rather have spaces, all you have to do is use the expand program (discussed earlier in the chapter) to change the tabs into spaces. For example, the following command line pipes the output of pr to expand before saving the data: fmt -w 35 article | pr -2 -l 40 | expand > article-columns If you examine the output of this command carefully, you will see that the alignment is maintained by using spaces, not tabs. (For help in visualizing tabs and spaces within text, see the discussion earlier in the chapter.) The final use for pr is to format multiple files into separate columns. Use the -m (merge) option, and pr will output each file in its own column. For example, to format three files into three separate columns, use: pr -m file1 file2 file3 When you format three files in this way, the maximum width of each column is, by default, 23 characters.(*) If the input files contain lines longer than the column width, pr will truncate the lines. To avoid this, you must format the text before sending it to pr. Here is an example. * Footnote The default line width is 72 characters. At the end of each column, there will be at least one space. Since we are creating three columns, subtract 3 from 72 to get a maximum of 69 characters of text. Dividing by 3 gives us a maximum of 23 characters/column. You are preparing a newsletter, and you have written three news stories, which you have saved in the files n1, n2 and n3. You want to use pr to format the stories into three columns, one per story, all on a single page. Before you use pr, you must use fold -s or fmt to format the text so that none of the lines are longer than 23 characters. The following commands do the work, saving the output in three files f1, f2 and f3:
fmt -w 23 n1 > f1
You can now use pr to format the three articles, each in its own column: pr -m f1 f2 f3 > formatted-articles When you merge multiple files in this way, it is often handy to use -t to get rid of the headers: pr -mt f1 f2 f3 > formatted-articles This will give you long, continuous columns of text without interruptions.
Review Question #1: What are the three principal options of the wc program, and what does each one do? What is it about wc that makes it such a useful tool? When you use wc, what is the definition of a "line"? Review Question #2: When you use tabs with Unix, what are the default tab positions? Review Question #3: The fold, fmt and pr programs can all be used to reformat text. What are principal differences between these programs? Review Question #4: The fold, fmt and pr programs have different default line lengths. What are they? Applying Your Knowledge #1: Use the command less /etc/passwd to look at the password file on your system. Notice that the file contains one line per userid, and that each line contains a number of fields, separated by : characters. The first field is the userid. Create a pipeline that generates a sorted, numbered list of all the userids on your system. Hint: use cut (Chapter 17), then sort, then nl. Applying Your Knowledge #2: The command ls displays a list of all the files in your working directory (except dotfiles). Create a pipeline that counts the number of files. Hint: use ls, then wc with the appropriate option. Next, create a command that displays output like the following (where xx is the number of files): I have xx files in my working directory. Hint: Use echo (Chapter 12) with command substitution (Chapter 13), making use of the pipeline you just created. Applying Your Knowledge #3: Go to a Web site of your choice, and copy some text into the clipboard. From the Unix command line, use the cat program to create a file named webtext: cat > webtext Paste in the text and press ^D. (Copy and paste is discussed in Chapter 6.) You now have a file containing the text from the Web site. Create a pipeline that formats this text into 40 character lines, changing multiple spaces to single spaces. At the end of the pipeline, display the text one screenful at a time. Applying Your Knowledge #4: Using the webtext file from the last example, format the text into pages with two columns, suitable for printing. Each column should be 30 characters wide, and the pages should be 20 lines long. The columns should be created with spaces, not tabs. Display the formatted output one screenful at a time. Once you are satisfied that the output is correct, save it to a file named columns. For Further Thought #1: As we discussed in the chapter, 80-column lines were used widely in the world of computing because, in the 1950s, the first IBM computers used punch cards, which could hold 80 characters per card. The number 80 was mere serendipity, as the size of the punch card was taken from the size of the old U.S. dollar bill. This is an example of how old technology influences new technology. In a similar manner, when IBM introduced the PC in 1981, the keyboard design was based on the standard typewriter, which used the so-called QWERTY layout (named after the six keys at the top left). It is widely accepted that the QWERTY layout is a poor one, as the most important keys are in particularly awkward locations. Although there exist much better keyboard layouts, the QWERTY keyboard is still the standard. Why do you think old technology has such a strong influence on new technology? Why is this bad? Why is this good? (Take a moment to look up the layout of the Dvorak keyboard, an intelligent alternative to the QWERTY. I have been using a Dvorak keyboard for years, and I would never switch back.)
List of Chapters + Appendixes
© All contents Copyright 2025, Harley Hahn
|