The Making of Effective awk Programming
Pages: 1, 2, 3, 4
The First Pass
Top level section headings within a chapter are often referred to in publishing
as "A-level headings," or just "A heads" for short. Similarly, the next level
section headings are "B heads," "C heads," and so on. The variables ah,
bh, ch, and dh, represent heading levels.
At each level, the variable for the levels below it must be set to zero. The
variable tab represents the current table number within a chapter.
The chnum variable tracks the current chapter. Thus, this first
rule sets all the variables to zero, extracts the current id, and computes a
new one:
Pass == 1 && /^<chapter/ {
ah = bh = ch = dh = tab = 0
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
curchap = sprintf("ch-%d", ++chnum)
newtag = sprintf("%s-%s", book, curchap)
tags[oldid] = newtag
}
The next few rules are similar, and handle chapter-level items that aren't
actually chapters:
Pass == 1 && /^<preface/ {
ah = bh = ch = dh = tab = 0
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
curchap = "ch-0"
newtag = sprintf("%s-%s", book, curchap)
tags[oldid] = newtag
}
Pass == 1 && /^<appendix/ {
ah = bh = ch = dh = tab = 0
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
applet = substr("abcdefghijklmnopqrstuvwxyz", ++appnum, 1)
curchap = sprintf("ap-%s", applet)
newtag = sprintf("%s-%s", book, curchap)
tags[oldid] = newtag
}
Pass == 1 && /^<glossary/ {
ah = bh = ch = dh = tab = 0
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
curchap = "glossary"
newtag = sprintf("%s-%s", book, curchap)
tags[oldid] = newtag
}
Next comes code that deals with section tags. The first rule handles a special
case. Two of the appendixes in Effective awk Programming are the
GNU General Public License
(GPL), which covers the gawk source code and the GNU Free
Documentation License (FDL), which covers the book itself. The sections
in these appendixes don't have ids, nor do they need them. The first rule skips
them.
The second rule does much of the real work. It extracts the old id, and then it extracts the level of the section (1, 2, 3, etc.). Based on the level, it resets the lower-level heading variables and sets up the new id.
The third rule handles tables. Table numbers increase monotonically through
the whole chapter and have two-digit numbers:
Pass == 1 && /<sect[1-4]>/ { next } # skip licenses
Pass == 1 && /^<sect[1-4]/ {
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
level = substr($1, 6, 1) + 0 # get level
if (level == 1) {
++ah
sectnum = ah
bh = ch = dh = 0
} else if (level == 2) {
++bh
sectnum = ah "." bh
ch = dh = 0
} else if (level == 3) {
++ch
sectnum = ah "." bh "." ch
dh = 0
} else {
++dh
sectnum = ah "." bh "." ch "." dh
}
newtag = sprintf("%s-%s-sect-%s", book, curchap, sectnum)
tags[oldid] = newtag
}
Pass == 1 && /^<table/ {
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
newtag = sprintf("%s-%s-tab-%02d", book, curchap, ++tab)
tags[oldid] = newtag
}
The Second Pass
By using -v Debug=1 on the gawk command line, I could
do debugging of the code that gathered old ids and built new ones. When debugging
is true, the program simply skips the second pass, by reading through the file
and doing nothing. More debug code appears in the END rule, below:
Pass == 2 && Debug { next }
If not debugging, this next rule is what replaces old ids in various tags with
the new one:
Pass == 2 && /^<(chapter|preface|appendix|glossary|sect[1-4]|table)/ {
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
tagtype = gensub(/<(chapter|preface|appendix|glossary|sect[1-4]|table).*/, "\\1", 1, $0)
printf "<%s id=\"%s\">\n", tagtype, tags[oldid]
next
}
The following rule updates cross references. Cross-reference tags contain a
linkend="..." clause pointing to the id of the place
they reference. Since I knew that linkend= only appeared in cross
references, that was all I had to look for. The while loop handles
multiple cross references in a single line. The loop body works by splitting
apart the line into three pieces: the part before the linkend=,
the linkend clause itself, and the rest of the line after it. It
then builds up the output line by concatenating the preceding text with the
new linkend clause:
Pass == 2 && /linkend=/ {
str = $0
out = ""
while (match(str, /linkend="[^"]+"/)) {
before = substr(str, 1, RSTART - 1)
xreftag = substr(str, RSTART, RLENGTH)
after = substr(str, RSTART + RLENGTH)
oldid = gensub(/linkend="([^"]+)"/, "\\1", 1, xreftag)
if (! oldid in tags) {
printf("warning: xref to %s not in tags!\n", oldid) > "/dev/stderr"
tags[oldid] = "UNKNOWNLINK"
}
out = out before "linkend=\"" tags[oldid] "\""
str = after
}
if (str)
out = out str
print out
next
}
Finally, the last rule is the catch-all that prints out lines that don't need
updating:
Pass == 2 { print }
The END Rule
The END rule does simple cleanup. The abnormal variable
is true if the wrong number of arguments were provided. The if
statement tests it and exits immediately if it's true, avoiding execution of
the rest of the rule.
It turns out that the rest of the rule isn't that involved. It simply
dumps the table mapping of the old ids to the new ones if debugging is turned on:
END {
if (abnormal)
exit 1
if (Debug) {
for (i in tags)
printf "%s -> %s\n", i, tags[i]
exit
}
}
Production and Post-Production
Once the new ids were in place, that was it. Since the O'Reilly DocBook tools work on separate per-chapter files, all that remained was to split the large file up into separate files, and then print them. I verified that everything went through their tools with no problems, and submitted the files to Production.
Production went quite quickly. A large part of this was due to the fact that copy editing had already been done on the Texinfo version. Usually it's done as part of the production cycle.
O'Reilly published the book, and I released gawk 3.1.0 at about
the same time. The gawk.texi shipped with gawk included
all of O'Reilly's editorial input.
It would seem that all ended happily. Alas, this was mostly true, but one non-trivial problem remained.
A major aspect of book production done after the author submits his files is
indexing. While gawk.texi contained a number of index entries,
most of which I had provided, this served only as an initial basis upon which
to build. Indexing is a separate art, requiring training and experience to do
well, and I make no pretensions that I'm good at it.
Nancy Crumpton, a professional indexer, turned my amateur index into a real one. Also, during final production, there were the few, inevitable changes made to the text to fix gaffes in English grammar or to improve the book's layout.
I was thus left with a quandary. While the vast majority of O'Reilly's editorial input had been used to improve the Texinfo version of the book, there were now a number of new changes that existed only in the DocBook version. I really wanted to have those included in the Texinfo version as well.
The solution involved one more script and a fair amount of manual work. The
following script, desgml.awk, removes DocBook markup from a file,
leaving just the text. The BEGIN block sets up a table of translations
from DocBook entities to simple textual equivalents. (Some of these entities
are specific to Effective awk Programming.) The specials
array handles tags that must be special-cased (as opposed to entities):
#! /bin/awk -f
BEGIN {
entities["darkcorner"] = "(d.c.)"
entities["TeX"] = "TeX"
entities["LaTeX"] = "LaTeX"
entities["BIBTeX"] = "BIBTeX"
entities["vellip"] = "\n\t.\n\t.\n\t.\n"
entities["hellip"] = "..."
entities["lowbar"] = "_"
entities["frac18 "] = "1/8"
entities["frac38 "] = "3/8"
# > 300 entities removed for brevity ...
specials["<?lb?>"] = specials["<?lb>"] = " "
specials["<keycap>"] = " "
RS = "<[^>]+>"
entity = "&[^;&]+;"
}
As in many of the other scripts seen so far, this one also uses RS
as a regular expression that matches tags with the variable entity
encapsulating the regular expression for an entity.
The single rule processes records, looking for entities to replace. The first
part handles the simple case where there are no entities (match()
returns zero). In such a case, all that's necessary is to check the tag for
special cases:
{
if (match($0, entity) == 0) {
printf "%s", $0
special_case()
next
}
The next part handles replacing entities, again using a loop to pull the line
apart around the text of the entity. If the entity exists in the table, it's
replaced. Otherwise it's used as-is, minus the & and ;
characters:
# have a match
text = $0
out = ""
do {
front = substr(text, 1, RSTART - 1)
object = substr(text, RSTART+1, RLENGTH-2) # strip & and ;
rest = substr(text, RSTART + RLENGTH)
if (object in entities)
replace = entities[object]
else
replace = object
out = out front replace
text = rest
} while (match(text, entity) != 0)
if (length(text) > 0)
out = out text
printf("%s", out)
special_case()
}
The special_case() function translates any special tags into white
space and handles cross references, replacing them with an upper-case version
of the id:
function special_case( rt, ref)
{
# a few special cases
rt = tolower(RT)
if (rt in specials) {
printf "%s", specials[rt]
} else if (rt ~ /<xref/) {
ref = gensub(/<xref +linkend="([^"]*)".*>/,"\\1", 1, rt)
ref = toupper(ref)
printf "%s", ref
}
}
I ran both my original XML files and O'Reilly's final XML files through the
desgml.awk script to produce text versions of each chapter. I then
used diff to produce a context-diff of the chapters, and went through
each diff looking for indexing and wording changes. Each such change I then
added back into gawk.texi. This process occurred over the course
of several weeks, as it was tedious and time-consuming.
However, the end result is that gawk.texi is now once again the
"master version" of the documentation, and whenever work starts on the fourth
edition of Effective awk Programming, I expect to be able to generate
new DocBook XML files that still contain all the work that O'Reilly contributed.
Conclusion and Acknowledgements
Translating something the size of a whole book from Texinfo to DocBook was
certainly a challenge. Using gawk made the cleanup work fairly
straightforward, so I was able to concentrate on revising the contents of the
book without worrying too much about the production. Furthermore, the use of
Texinfo did not impede the book's production since O'Reilly received DocBook
XML files that went through their tool suite, and the distributed version of
the documentation benefited enormously from their input.
I would like to thank Philippe Martin for his original DocBook changes and Karl Berry, Texinfo's maintainer, for his help and support. Many thanks go to Chuck Toporek and the O'Reilly production staff. Working with them on Effective awk Programming really was a pleasure. Thanks to Nelson H.F. Beebe, Karl Berry, Len Muellner, and Jim Meyering as well as O'Reilly folk Betsy Waliszewski, Bruce Stewart, and Tara McGoldrick for reviewing preliminary drafts of this article.
|
Related Reading Effective awk Programming |






