The Making of Effective awk Programming
Pages: 1, 2, 3, 4
Fixing Index Entries
The next task was to work on the indexing entries. The original gawk.texi
file already had a number of index entries that I had placed there. makeinfo
translated them into DocBook <indexterm> entries, but they
still needed some work. For example, occasionally additional material appeared
on the same line as the closing </indexterm> tag. More importantly,
special characters in the text of an index entry, such as <
and >, were not turned into < and >
in the generated DocBook. Also, O'Reilly's convention is to not have any font
changes in the contents of an index entry. The fixindex.awk script
dealt with all of these. The first part handles splitting off any trailing text:
#! /bin/gawk -f
# <indexterm> always comes at the beginning of a line.
# 1. If there's anything after the </indexterm>, insert a newline
# 2. Remove markup in the indexed items
/<indexterm>/ {
if (match($0, /<\/indexterm>./)) {
front = substr($0, 1, RSTART + 11);
rest = substr($0, RSTART + RLENGTH - 1)
} else {
front = $0
rest = ""
}
If the text of the index entry has font changes in it, the next part extracts
the contents of the entry, removes the font changes, and then puts the tags
back in:
if (match(front, /<(literal|command|filename)>/)) {
text = gensub(/<indexterm>(.+)<\/indexterm>/, "\\1", 1, front)
gsub(/<\/?(literal|command|filename)>/, "", text)
front = "<indexterm>" text "</indexterm>"
}
Looking at this now, sometime later, I see that the removal and restoration
of the <indexterm> tags isn't necessary. Nevertheless, I
leave it here to show the code as I wrote it then.
The rest of the rule deals with index entries for the <, <=,
>, and >= operators, converting them into the
appropriate DocBook entities. Finally, it prints the modified line and any trailing
material that may have been present, and then gets the next input line with
next. The final rule simply prints lines that aren't indexing lines:
gsub(/><=/, ">\\<=", front)
gsub(/>< /, ">\\< ", front)
gsub(/>>=/, ">\\>=", front)
gsub(/>> /, ">\\> ", front)
print front
if (rest)
print rest
next
}
{ print }
Fixing Options
As you may have noticed, the scripts have been progressing from larger-scope fixes to smaller-scope fixes. This next script deals with a fine-grained, typographical detail.
In the Italic font O'Reilly uses to represent options, the correct character
to use for a hyphen or dash is the en-dash, discussed earlier. This is represented
by the DocBook – entity. Furthermore, gawk's
long options start with two dashes, not one. In both the Italic font in the
text and in the Roman font in the index, the two dashes run together when printed,
making them difficult to distinguish.
This next script solves both problems. It converts plain dash characters to
–, and inserts an   character
between two en-dashes. The   is a very small amount
of horizontal spacing whose job is to provide just such tiny amounts of separation
between characters. This script also works by setting RS to a regular
expression matching the text of interest, modifying the capture value in RT,
and then printing the record and new text back out.
The <primary> and <secondary> tags only
appear inside <indexterm> tags. The <option>
tags delimit options in the book's main text:
#! /bin/awk -f
BEGIN {
RS = "<(primary|secondary|option)>-(-|[A-Za-z])+"
}
{
if (RT != "") {
new = RT
new = gensub(/--/, "\\–\\ \\–", "g", new)
new = gensub(/-/, "\\–", "g", new)
} else
new = ""
printf("%s%s", $0, new)
}
Manual Work
After going through all the above scripts, the book was almost ready for prime
time. All my scripts had produced a DocBook XML document that was quite close
to what I would have produced had I been entering the text directly in DocBook.
It took considerably less effort than if I tried to convert the text from Texinfo
to DocBook using either the sed stream editor, or manually, using
editor commands (the colon prompt within vim).
Nevertheless, my Notes file lists a fair number of manual changes
that I had to make, things that weren't amenable to scripting. Most of these,
though, could be tackled using the vim command line. (Believe me, if I could have fixed these with a script too, I would have. But sometimes there are things that a program just isn't quite smart enough to handle.)
After all of these changes, I was at the final stage. In fact, this was during the technical review stage, and for a brief while before submitting the book to O'Reilly's Production department, I was making edits in parallel, in both the Texinfo and the DocBook versions of the book. The main reason for this was to avoid having to remake all the manual edits. It was easier to make a few incremental changes in parallel than to just edit the Texinfo file, regenerate DocBook, and then have to redo all the manual edits.
Fixing Identifiers
One final transformation was needed before submitting the book to Production.
O'Reilly has a standard convention for naming chapters, sections, tables, and
figures within the id="..." clause of the appropriate
tags. For example, <sect2 id="eap3-ch-3-sect-2.1">. These
same identifiers are used in <xref> tags for cross references.
However, makeinfo produced identifiers based on the original names
of the @node lines in the gawk.texi file. For example,
<sect1 id="How20To20Contribute">. (Here, the spaces in the
original node name are replaced by 20, which is the numeric value
of the space character, in hexadecimal.) I needed to transform these generated
identifiers into ones that followed the O'Reilly convention.
The following script, redoids.awk (re-do ids), does this job.
It makes two passes over the input. The first pass extracts the existing ids
from chapter, section, and table tags. It maintains the appropriate chapter and
section level counts, and by using them, generates the correct new tag for the
given item. The first pass builds up a table (an associative array),
mapping the old ids to the new ones.
The second pass goes through the file, actually making the substitutions of new id for old. It can't be done all in one pass since there are cross references, both forwards and backwards, scattered throughout the text.
Setting Up Two Passes
The BEGIN block checks that exactly one argument was given, and
prints an error message if not. It then sets some global variables, namely,
the book name and IGNORECASE, which causes gawk to
ignore case when doing regular expression matching:
#! /bin/gawk -f
BEGIN {
if (ARGC != 2) {
print("usage: redoids file > newfile\n") > "/dev/stderr"
abnormal = 1
exit 1
}
book = "eap3"
IGNORECASE = 1
This next part actually sets up two passes over the input. It first initializes
Pass to 1. Next, it adds a variable assignment, Pass=2, to ARGV, and then the
input filename, and increments ARGC.
The upshot is that gawk reads through the file twice, with the
variable Pass being set appropriately each time through. The code
for the two passes then distinguishes which pass is which by testing Pass:
# set up two passes
Pass = 1
ARGV[ARGC++] = "Pass=2"
ARGV[ARGC++] = ARGV[1]
}





