Home > Programmers' Haven

The Joy Of Awk

The Code
10 Jul 14, 16:17

The Joy Of Awk

Awk How much can you do with just one line of code?

The Doctor shows off his favourite little language ...

Let's deal with the strange name first. Awk is named after the surnames of its inventors - Alfred Aho, Peter Weinberger and Brian Kernighan. For me, Awk is quintessential Linux. Why? Well, it's a text filter and plays very nicely in pipelines, but most importantly, it lets you do Really Useful Stuff with very little code.

   Awk is a fully-fledged programming language. lt has variables, arithmetic, arrays, loops, branches, functions and all the other stuff you'd expect a regular programming language to have. But I have made no attempt to deliver a systematic description of the language here: rather, my purpose is to show and explain a few simple examples of how you can do useful things in a just line or two of code. A 'small is beautiful' celebration, if you like.

Basic instinct

Awk's basic instinct is to read its input stream line by line and parse the line apart into fields. Within your Awk program, the fields are referenced as $1, $2, and so on. By default the fields are separated by whites pace characters, but you can change that, as we'll see shortly. The example in the figure shows a structured 'shopping list' and how Awk splits it apart into fields.

   An Awk program consists of a series of patterns and actions like this:

pattern { action }
pattern { action }

   The pattern selects the lines to be processed, and the action describes what's to be done on the lines that match the pattern. Since at this point we don't know what either a pattern or an action might look like, this probably doesn't help much, so here's a simple example to get us started. This Awk command will display the mount points (the second field) of all the lines in /etc/fstab:

$ awk '{ print $2 }' /etc/fstab

   Here,'{ print $2 }' is the action part (print the second field). So where's the pattern, you ask? Good question! There isn't one, and in this case, (which is very common), Awk will perform the action on every line.

   If you try this on your own fstab you'll probably get some spurious output from the comment lines in the file. The command lines begin with a # and we can use the pattern (in this case a regular expression match enclosed in slashes) to filter these out, like this:

$ awk ' /^ [^#]/ { print $2 }' /etc/fstab

   In case you don't speak regex, the regular expression (between the slashes) says "lines that begin with something that isn't a #". Here's another example of simply picking off a specified field. The output from the date command is structured like this:

Mon Aug 5 19:12:42 BST 2013

and we can print the time field like this:

$date | awk '{ print $4 }'

   The important difference here is that Awk is processing the data stream from a pipe, not a file. (In this particular case, the stream only contains one line - this is quite common in Awk usage examples). To make an obvious (but important) point, using Awk to process text requires you to understand how your input data is structured- specifically, what data appears in each field.

   Now there are lots of structured text files in Linux that use : as the field separator - /etc/passwd comes to mind - so sometimes we need to be able to tell Awk to use a different character as the field separator, as in this example, that prints just the usernames (first field) from the password file:

$ awk -F: '{ print $1 }' /etc/passwd

   Notice the use of the -F option to set the field separator. (There are other ways of splitting fields, including fixed-width fields and fancier methods based on regular expression matches.)

   Let's go a step further and print just the usernames that correspond to ordinary user accounts. On a Red Hat system these can be identified by having a UID (third field) of at least 500, and we can select these lines with an Awk pattern like this:

$ awk -F: '$3 >= 500 {print $1}' /etc/passwd

   This innocent looking example actually starts to hint at the real power of Awk. lt can do arithmetic, something that's beyond tools like grep and sed.

   Here's another one-line program that also does arithmetic. This one displays the average of numbers entered on a line:

$ awk '{ sum=0; for (i=1; i<=NF; i++) sum += $i; print sum/NF;
}'                                                             >>

4 6 14
8
72 123 45 17
64.25

   I didn't specify an input file, so in true filter style, Awk is taking input from the keyboard and I'm entering the data manually. Look carefully, and you'll see that input is interspersed with output on alternate lines. The program uses a classic C-style for loop to process each field on the line, adding them up. The built-in variable NF is set to the number of fields on the current input line. Keep in mind that this one-liner has no PATTERN, so our little program runs on every line of input.

   $1                   $2        $3                  $4

Supermarket   1   Chicken         4.55
Supermarket   50 Clothespegs     1.25
Bakers                3   Bread                2.40
DIY                      1    Hosepipe        15.00

←-------------------     NF   --------------------→
                                                                NR
                                                                      ↓

> Awk's basic behaviour is to split input lines into fields. NF is the number of fields. NR is the record number.

Command substitution

Command substitution is a shell programming trick that is not directly related to Awk, but is often used around Awk commands. Suppose I have a file called hostlist that contains a list of machine names, one per line.

$ ping 'cat hostlist'

will run the cat command inside the backquotes, collect its output, and substitute it back on to the command line, forming the arguments to the ping command. There's an alternative syntax (which I prefer because my aging eyes have trouble distinguishing the various sorts of quotes) that looks like this:

$ ping $(cat hostlist)

Command substitution is often used in shell scripts to assign the output from a command to a variable in a shell script, something like this:

me=$(whoami)

Awk in the real world

Awk finds widespread use in Linux sysadmin scripts, and the majority of them are simple one-liners. Here's one from /etc/init/mounted-tmp.conf on my Ubuntu 12.04 system. I've left in the surrounding lines, so that you can see the context:

# Check if we have enough space in /tmp, and if not, mount
a tmpfs there

avail='df -kP /tmp | awk 'NR 2 { print $4 }''
if [ "$avail" -It 1000 ]; then

   mount -t tmpfs -o size=1048576,mode=1777 overflow /tmp
fi

   This is classic usage, where Awk appears inside a command substitution (see the boxout) to set the value of a variable within a script. Here, Awk is selecting the fourth field from the second line of the output from df. This isn't some arbitrary decision; whoever wrote the script clearly knew exactly what the output of the df command was going to look like.

Beyond one-liners

As our Awk programs get longer, typing them in on the command line becomes distinctly tedious, so before we go any further let's see how we can put them into an external script file. it's easy: you just put your Awk statements into a file, then reference it on the command line with the -f option. Here's a program that will find the largest UID in the password file:

BEGIN { maxuid = 0; PS= ":" }
{ if ($3 > maxuid) maxuid = $3 }
END { print "the largest UID is ", maxuid }

   If I put these three lines into a file called maxuid, I can run it like this:

$ awk -f maxuid /etc/passwd

   Let's dissect the program. lt has three statements. The special BEGIN pattern matches just before we process the first line of the file. Here we use it to initialise a couple of variables including the built-in variable FS that defines the field separator. This provides an alternative to using the -F option on the command line. Sometimes the BEGIN pattern is also used to print headings. The second statement has no PATTERN so the action is performed for every line. This is the logic that tracks the largest UID (in the third field). Finally, the END pattern matches after we've processed the last line and is often used to print out final results, as we do here.

   Notice that we do not have to pre-declare variable or give them a type. They spring into existence at the mere mention of their name, and their type will be inferred from whatever type you assign to them. This is very different from more traditional languages like C that require you to declare all variables, and specify their type, before you use them. But this "dynamic typing" is common in more modern languages like PHP and Python.

   If you run this program you'll probably find that the answer is 65534. This UID belongs to a rather spurious account called "nobody". To ignore this line we could
modify our program to add a PATTERN to the second statement like this:

BEGIN { maxuid = 0; PS = ":"}
$1 != "nobody" { if ($3 > maxuid) maxuid = $3 }
END { print "the largest UID is ", maxuid }

   Here's another example captured from the wild on Ubuntu, from the file /etc/init.d/vmware:

count='/sbin/lsmod | awk 'BEGIN {n = 0} {if ($1 =
"'"$driver"'") n = $3} END {print n}''

   Again, it's an example of command substitution, and there's some tricksy quoting going on around the variable $driver (defined earlier in the script).

Emulating other tools

Awk is a general-purpose tool that can emulate all sorts of special-purpose programs. For example, this one-liner prepends line numbers to the file foo and is equivalent to

cat -n

$ awk '{print NR, $0 }' foo

   This example counts the words in its input, equivalent to

wc-w:

$ awk' { w += NF} END {print w}' foo

   And this example is basically equivalent to grep:

$ awk '/^chris/' /etc/passwd
chris:x:1000:1000:Chris Brown,,,:/home/chris:/bin/bash

   In this example the program has a pattern (the regular expression match '^chris' but no action. Awk's default action is simply to print the line, like grep does.

   The following examples are all based around a 'shopping list', shown in the figure. This file is purposefully structured, and is exactly the sort of thing that Awk loves to pick apart. Take a moment to look at the file, otherwise the examples won't make sense.

   First, let's display the total number of items we want to buy. We just have to add up the values in the second column:

$ awk '{ items += $2 } END { print items }' shopping

   Next, we'll display the total shopping bill. This involves multiplying the quantity by the unit price for each item, then adding them up:

$ awk '{ cost += $2 * $4} END { print cost }' shopping

   Suppose we wanted to add up only the spending on DIY items. Adding a simple regex pattern does the trick:

$ awk '/^DIY/ { cost += $2 * $4} END { print cost }' shopping

   Now let's find the most expensive item:

$ awk '{ if ($4 > max) max = $4 } END { print max }' shopping

   Here, max is just a variable name I made up. it's not a built-in function or anything. Notice again the dynamic typing - I don't need to pre-declare max or initialise it to zero.

   Next, we'll obtain a list (in the file "lotsofthem") of all items that we need 10 or more of. This example is mainly here to show you how to redirect output to a file:

$ awk '$2 >= 10 { print $3 > "lotsofthem" }' shopping

   Be clear here, the > is not being interpreted by the shell but by Awk. lt says "write the output to this file".

   Now here's a really neat example that splits the data out into multiple files, one per category:

$ awk '{ print > $1 }' shopping

   Here, $1 (the shopping category) defines the output file name. You end up with files called Bakers, DIY and so on. I love this example! If you're not convinced by the power of one-liners yet, I give up!

   This next example breaks down the expenditure by category. Here, I've put the Awk program into an external file called catcost:

{ cost[$1] += $2 * $4}
END { for (cat in cost) print cat, cost[ cat]}

... and I ran the program like this:

$ awk -f catcost shopping
Bakers 39.7
Clothes 134.99
DIY 283.3
Supermarket 69.55

   This example uses an associative array (an array whose subscripts are strings); the array is called cost and the subscripts are the shopping category names.

   Here's one last example that uses our shopping file. lt extends the previous example to determine which shopping category we spent the most money on. Here's the script; I've added line numbers for reference:

1. # Calculate the most expensive category
2. { cost[$1] += $2 * $4 }
3.   END {
4. max=0;
5. for (cat in cost) {
6. if (cost[cat] > max) {
7.   max = cost[cat];
8. maxcat = cat;
9.        }
10.   }
11. print maxcat;
12. }

   Some explanation is in order: line 1 is a comment. Line two accumulates the category costs into an associative array, as in the previous example. Lines 3-12 are all part of the END action. Notice it's starting to look more like a regular program now, and we see ; used as a statement terminator. Lines 5-10 loop over the categories, scanning for the largest cost. In case you're starting to hyperventilate, you can breathe easy - that's the longest Awk program I'm going to show!

Writing self-contained scripts

You can also create self-contained Awk scripts that run directly from the shell. This is slightly more convenient than typing awk -f on the command line each time. Here's how:

   First, add a 'shebang' line at the top of your Awk script file so that Linux knows which interpreter to use, something like this:

#!/usrlbin/awk -f
{ cost[$1] += $2 * $4 }
END { for (cat in cost) print cat, cost[ cat]}

Now make the script executable as you would any other:

$ chmod u+x catscript

Now you can run the script directly as a command:

$ ./catscript shopping

   If you like the one-liners, take a look at www.pement.org/awk/awk1line.txt. lf you'd like to see some much longer Awk programs, download the gawk manual at www.gnu.org/software/gawk/manual. The classic text is The Awk Programming Language by Aho, Weinberger and Kernighan. it's worth hunting one down on eBay because of the exquisite clarity of the writing. Bye for now!

Wait - there's more

Awk has lots of built-in functions. There are mathematical functions such as sin(), cos(), log() and sqrt(), which you may never use, and string-handling functions you may find more useful, ranging from length() which simply returns the length of a string, to split() which divides a string into pieces, and gsub() which does text replacement based on a regex match. You can also define your own functions.

Chris@m1530-1204: ~/Linux Format Article

Supermarket     1     Chicken           4.55

Supermarket   50    Clothespegs    1.25

Bakers               3     Bread               2.40

DIY                     1      Hosepipe      15.00

Clothes              1      Trousers        24.99

DIY                     2      Doorknob       8.40

Supermarket     2     Milk                 1.25

Clothes              6      Socks           9.00

DIY                     2     Screwdriver    2.00

Clothes              2      Skirt               28.00

DIY                   20      Sandpaper   10.00

Bakers           10       Muffin             1.95

Bakers             2       Quiche           6.50

DIY                   50       Nails               0.95

$ â–‹

> The structured shopping list file used in some of the examples.

Variable               Meaning

ARGC, ARGV     Provide access to the command line arguments passed to an awk program

NF                         Number of fields in the current line

NR                        The current record number (line number)

ENVIRON             An associative array that provides access to the program's environment variables

FS                         The input field separator (default is a space)

RS                        The record separator (newline by default, but can be a regex)

OFS                       Output field separator - used to separate the fields printed by a print statement. Default is a space

IGNORECASE      If set, regex matches in awk are not case-sensitive

FIELDWIDTHS     A comma-separated list of field width for splitting input with fixed colum boundaries

> A few of Awk's built-in variables. The ones in red provide information, the green ones control behaviour.

The history of Awk

Awk was originally written in 1977 and released into Version 7 Unix in 1978. In keeping with the spirit of this tutorial the authors wrote: "We knew how the language was supposed to be used, and so we only wrote one-liners." The original authors extended the language in 1985, adding user-defined functions amongst other things. Their classic book The AWK Programming Language was written in 1988. The language was formally defined in a POSIX standard in 1992. Gawk (the version you'll find on Linux) comes from the GNU project and adds many extensions.

Dr Brown's Administeria, The Linux Format, December 2013, Pg 56-59