Editor's note: Arnold Robbins has been an O'Reilly author for more than eight years, authoring or coauthoring some of its best-selling, most enduring titles, including the sixth edition of Learning the vi Editor, the third edition of Unix in a Nutshell: System V Edition, the second edition of sed & awk, and the second edition of Learning the Korn Shell.
|
Related Reading
Effective awk Programming |
In all that time, he's learned a thing or two about the O'Reilly book production process. But when he was ready to update Effective awk Programming he wanted to use the Texinfo markup language. O'Reilly production prefers authors use the DocBook markup language. Arnold compromised by agreeing to manage the conversion process for the final production of his book. In this article, he chronicles the challenges he faced translating his book from Texinfo to DocBook. The breadth of technical detail and the extensive code examples he provides here offer unique insight into one author's experience working with O'Reilly's book production department to create a book.
O'Reilly & Associates published the third edition of
Effective awk Programming in May 2001. The book provides thorough coverage of the awk programming language,
as standardized by the IEEE POSIX standard for portable operating
system applications. This standard is based on Unix and its utilities.
Effective awk Programming also doubles as the user's guide for GNU awk
(known as gawk), explaining the extensions and features that
are unique to gawk. It includes a wealth of sample programs and
library functions that demonstrate good awk programming style.
gawk is the standard version of awk on GNU/Linux
and most BSD-based systems. It is also popular on commercial Unix and
Windows systems because it has a number of useful extensions,
and because it can handle large data sets (records with hundreds or
thousands of fields, arrays with thousands of elements) that often cause
other implementations to give up. The third edition of Effective awk Programming describes the current version of gawk, 3.1.
The GNU project uses the Texinfo markup language for all of its documentation. Texinfo is a pleasant markup language in which to work. It is semantically driven: you markup what something is, not how to print it; it allows easy nesting of different constructs; it is not as painful to type as HTML or DocBook XML; and it provides for translation into multiple output formats.
Printed documents may be generated directly from Texinfo input
files by using TeX. The Texinfo distribution includes the file
texinfo.tex, which is a set of TeX macros that directly
implement the Texinfo language, and scripts for running TeX.
Other output formats are generated by the makeinfo program,
which is a rather large and complicated C program that knows how to
produce GNU Info, HTML, and these days, DocBook XML.
The use of Texinfo for Effective awk Programming presented a problem for O'Reilly.
Their production process prefers the use of DocBook markup (particularly
the XML variant) since it may be used to produce both printed and
browsable versions of the same book. (Browsable versions are necessary
for the CD-ROM editions of their books, as well as for
the Safari Bookshelf.)
Furthermore, O'Reilly has a series design used for all their books:
the TeX output from texinfo.tex, while reasonable enough,
doesn't looks anything like an O'Reilly book.
By the time of the initial discussions with O'Reilly, I had produced
four O'Reilly books in DocBook SGML, so I was quite comfortable with it.
And as the author of the gawk.texi, I was also very comfortable
with Texinfo. Therefore, because both O'Reilly and I were committed to
getting Effective awk Programming published, I promised to manage the conversion from
Texinfo into DocBook for the final book production.
I reasoned that since makeinfo could already produce HTML,
and since HTML and DocBook are conceptually similar, it shouldn't
be that hard to modify the code to generate DocBook. I had
worked with the makeinfo source code in the past, so I wasn't
scared, even if I was a bit naive.
Delaying the conversion to DocBook until the end had two other related,
significant advantages. First, I was able to use the Texinfo version
for the technical review, incorporating all the changes from the review
into the documentation that would eventually ship with gawk.
And second, O'Reilly agreed to do their copy editing on a paper copy of
the Texinfo version of the manuscript. I then entered the copy edits
into the Texinfo source file, again allowing the distributed version to
benefit from O'Reilly's considerable editorial expertise.
(At this point I'd like to pause and acknowledge the significant contributions made by Chuck Toporek, my editor. His comments helped to enormously improve the organization and presentation of the material in the book. Mary Sheehan's copy edits were also very valuable. I learned a lot about good writing during the work on this book.)
Furthermore, Chuck and the rest of the people at O'Reilly bent
over backwards to make sure that they complied with the
GNU Free Documentation License
(FDL), under which the book is published. The final DocBook XML source for the
book is available from the
O'Reilly Web site.
The Texinfo version, of course, is part of the gawk
distribution.)
Fortunately, I didn't have to write the DocBook changes for
makeinfo from scratch. Philippe Martin had done the
bulk of this already, and I was able to obtain his patches to the
makeinfo source code. His code did the vast majority of what
I needed.
Philippe's version generated DocBook SGML. At the time, O'Reilly
was moving away from SGML, towards the XML version of DocBook.
The differences boiled down mostly to using lowercase for tags,
always providing a full closing tag (<emph>whatever</emph> versus
<emph>whatever</>), using the trailing-slash version of tags
that don't enclose objects (such as <xref linkend="..."/>),
and fully quoting all the parameters inside of tags (<colspec
colnum="1"/> vs. <colspec colnum=1>).
Also, Philippe's code often generated a single DocBook tag for multiple
different Texinfo commands, when in fact DocBook has tags that correspond
to the original Texinfo commands. For example, it might produce
<literal> for both @command{} and @file{}.
This needed to be fixed, so that the generated output would contain
separate <command> and <filename> tags. In other words,
as much as possible, it was necessary to preserve the semantic-based
nature of the Texinfo markup in the generated DocBook.
This work was straightforward, and over a week or two, I did the bulk of
it, getting makeinfo to the point where it produced a basic
DocBook XML version of gawk.texi on which I could do further
post-processing.
The current release of Texinfo includes Philippe's original changes, as well
as my improvements. Philippe has gone further with the development, and besides
DocBook XML, makeinfo can produce a variant of XML that uses a
Texinfo DTD that is similar to the DocBook XML DTD. Indeed, most of the reformatting
problems described below are no longer needed with the current version. For
further details, see the Texinfo
distribution.
Generating technically correct DocBook markup was just the beginning of the
process. While the file might go through an XML parser without any problems,
it would still need to be readable, so that O'Reilly's production editors could
work with it directly. It also needed to adhere to O'Reilly's markup conventions,
such as the id="..." parameter in <chapter>
and section tags, and in <xref> tags for cross references.
There was still a ways to go.
First, the makeinfo output needed lots of simple cleanups. Some
of these related to anomalies in the output, others to removing Texinfo-specific
output features which were better expressed using different fonts in DocBook.
The first script, fixup.awk, evolved to handle many of these. This
section presents the most interesting of the changes that had to be made.
makeinfo generated some boiler-plate material at the front of
the file that wasn't necessary for O'Reilly's DocBook tools. It looks like this:
<!-- This is /home/arnold/ORA/db/gawk.sgml, produced by makeinfo
version 4.0 from gawk.texi. --><para>
<!DOCTYPE book PUBLIC "-//Davenport//DTD DocBook V3.0//EN">
<book>
<title>The GNU Awk User's Guide</title>
</para>
Notice that the <para> and </para> tags
are misplaced. This early version of makeinfo was over-zealous
about wrapping things in paragraph tags. The first part of fixup.awk
strips off this leading junk. It works by having the first rule look for the
first <chapter> tag. When that's seen, it sets a flag. The
second rule checks the flag. If it hasn't been seen yet, the next
statement gets the next line of input:
#! /bin/gawk -f
# strip leading gunk from file
/<chapter/ { chapter_seen = 1 }
! chapter_seen { next }
The next bit removes trailing white space (space and TAB characters) and removes
leading white space inside lists and examples. The first rules uses the sub()
function to unconditionally remove trailing white space. (This is needed only
because I find such white space gets in the way when editing.)
The in_term variable indicates being inside the terms of a variable
list. Inside list item bodies or examples, the strip_spaces
variable is true (non-zero), so the sub() function removes all
leading white space. The closing tags set the strip_spaces flag
back to false:
# strip trailing white space
/[ \t]+$/ { sub(/[ \t]+$/, "") }
# strip leading spaces inside lists
/<listitem>/ { stripspaces++ ; in_term = 0 }
/<\/listitem>/ { stripspaces-- }
# fix up examples
/<screen>/ { in_screen++ ; stripspaces++ }
stripspaces != 0 { sub(/^ +/, "") }
/<\/screen>/ { in_screen-- ; stripspaces-- }
The Texinfo command @var{} is used to describe something that
is variable, such as a user's supplies. It corresponds to the
DocBook <replaceable> tag. In an O'Reilly book, <replaceable> items
get printed in a Constant Width Italic font. This is entirely
appropriate in most contexts, such as within examples, or in lists
where items represent a combination of a command and its parameters.
However, O'Reilly conventions indicate that variable items should be in regular italics when used in prose discussion. For example:
<!-- Correctly marked up DocBook XML -->
<variablelist>
<varlistentry><term>
<literal>ls -l</literal> <replaceable>file</replaceable>
</term>
<listentry><para>
The <command>ls</command> with the <option>-l</option> gives
extra information about <emphasis>file</emphasis>.
</para></listentry>
</varlistentry>
...
</variablelist>
The generated DocBook used <replaceable> everywhere. This next bit of code
makes the context-sensitive transformation for us:
|
O'Reilly books use a Constant Width Bold font to indicate user input in examples
and a plain Constant Width font for computer output. Texinfo only uses plain
Constant Width, distinguishing computer output with a leading glyph, in this
case, -|. (TeX output uses a similar, but nicer looking symbol.)
Error messages are prefixed with a different glyph that comes out in the DocBook
file as error-->. This next bit removes these glyphs. It also
supplies the <userinput> tags for any line whose first character
is either $ or > (the > symbol).
These represent the Bourne shell primary and secondary prompts, respectively,
which are used in printed examples of interactive use:
in_screen != 0 {
gsub(/-\| */, "");
gsub(/error--> /, "");
if (/^(\$|>) /)
$0 = gensub(/ (.+)/, " <userinput>\\1</userinput>", "g")
}
The gensub() ("general substitution") function is a gawk
extension. The first argument is the regular expression to match. The second
is the replacement text. The third is either a number indicating which match
of the text to replace, or "g", meaning that the change should
be done globally (on all matches). The fourth argument, if present, is the value
of the original text. When not supplied, the current input record ($0)
is used. The return value is the new text after the substitution has taken
place.
Here the replacement text includes \\1, which means "use the text
matched by the part of the regular expression enclosed in the first set of parentheses."
What this ends up doing is enclosing the command entered by the user in <userinput>
tags, leaving the rest of the line alone.
Texinfo doesn't have sidebars, which are blocks of text set off
to the side for separate, isolated discussion of issues. They are typically
used for more in depth discussion items or for longer examples. In gawk.texi,
I got around the lack of sidebars by using regular sections and adding the
words "Advanced Notes" to the section title. This next bit of code looks for
sections that have the words "Advanced Notes" in their titles and converts
them into sidebars. While it's at it, it removes all inline font changes from
the contents between <title> and </title>
tags, since such font changes are against O'Reilly conventions:
# deal with Advanced Notes, turn them into sidebars
/^<sect/ { save_sect = $0 ; next }
/<title>/ {
if (/Advanced Notes/) {
in_sidebar++
print "<sidebar>"
sub(/Advanced Notes: /, "")
} else if (save_sect) {
print save_sect
}
save_sect = ""
# remove font changes from titles
if (match($0, /<title>.+<\/title>/)) {
before = substr($0, 1, RSTART - 1)
text = substr($0, RSTART + 7, RLENGTH - 15)
after = substr($0, RSTART + RLENGTH)
gsub(/<[^>]+>/, "", text)
print before "<title>" text "</title>" after
next
}
}
/<\/sect/ {
if (in_sidebar) {
print "</sidebar>"
in_sidebar = 0
next
}
}
There are three different kinds of dashes used in typography. "Em-dashes" are
the length of the letter "m." "En-dashes" are the length of the letter "n."
They are shorter than em-dashes. And plain dashes, or hyphens, are the shortest
of all. The makeinfo output represents an em-dash as two dashes.
This last chunk turns them into the — DocBook entity.
This change is not done inside examples (! in_screen). The very
last rule simply prints the (possibly modified) input record to the output:
/([a-z]|(<\/[a-z]+>))--[a-z]/ && ! in_screen {
$0 = gensub(/([a-z]|(<\/[a-z]+>)?)--([a-z])/, "\\1\\—\\3", g, $0)
}
{ print }
As mentioned earlier, the early DocBook version of makeinfo generated
lots of unnecessary <para> tags. The output had numerous
empty paragraphs, and removing them by hand was just too painful. The following
simple script, rmpara.awk strips out empty paragraphs.
This script works by taking advantage of gawk's ability to specify
a regular expression as the record separator. Here, records are separated by
the markup for empty paragraphs. By setting the output record separator to the
null string (ORS = ""), a print statement prints the
preceding part of the file.
#! /usr/local/bin/gawk -f
BEGIN {
RS = "<para>[ \t\n]+</para>\n*"
ORS = ""
}
And since we're working with paragraph tags, the following small rule puts
<para> tags inside lists and index entries on their own lines.
This makes the DocBook file easier to work with. The final rule simply prints
the record, which is all text in the file up to an empty paragraph:
/(indexterm|variablelist)><para>/ {
sub(/<para>/, "\n&")
}
{ print }
A significant problem, requiring a separate script, had to do with the formatting
of tables. The Texinfo @multitable ... @end multitable
translates pretty directly into a DocBook <table>. However,
the formatting of the output, while fine for machine processing, was essentially
impossible for a human to work with directly. For example:
<para>
<table> <title></title> <tgroup cols="2"><colspec colwidth="31*">
<colspec colwidth="49*"> <tbody> <row>
<entry><literal>[:alnum:]</literal> </entry>
<entry> Alphanumeric characters. </entry> </row><row> <entry>
<literal>[:alpha:]</literal> </entry> <entry> Alphabetic characters.
</entry> </row><row> <entry><literal>[:blank:]</literal>
</entry> <entry> Space and tab characters. </entry> </row><row>
<entry> <literal>[:cntrl:]</literal> </entry> <entry> Control
characters. </entry> </row></tbody> </tgroup> </table>
</para>
Each row in a table should be separate, and each entry (column) in a row should
have its own line (or lines). For this, I wrote the next script, fixtable.awk.
It is similar to the rmpara.awk script, in that it uses a regular
expression for RS. This time the regular expression matches DocBook
tags. Thus the record is all text up to a tag, and the record separator is the
tag itself plus any trailing white space.
The associative array tab (for "table") contains all the table-related
tags that should be on their own lines. The <colspec> tag
contains parameters, thus it does not have the closing > character
in it:
#! /bin/gawk -f
BEGIN {
RS = "<[^>]+> *"
tab["<table>"] = 1
tab["<colspec"] = 1
tab["<tbody>"] = 1
tab["<tgroup"] = 1
tab["</tgroup>"] = 1
tab["</tbody>"] = 1
tab["<row>"] = 1
tab["</row>"] = 1
}
gawk sets the variable RT (record terminator) to
the actual text that matched the RS regular expression. Any trailing
white space in RT is saved in the variable white,
and then removed from RT. This is necessary in case the tag in
RT isn't one for tables. Then the white space has to be put back
into the output to preserve the original file's contents:
{
# remove trailing white
# gensub returns the original string if the re doesn't match
if (RT ~ / +$/)
white = gensub(/.*>( +$)/, "\\1", 1, RT)
else
white = ""
sub(/ +$/, "", RT)
This next part does the work. It splits RT around white space.
(This is necessary for the <colspec> tag.) If the tag is
in the table, we print the preceding record, a newline, and then the whole tag
on its own line. <entry> tags are printed on their own lines.
Finally, any other tags are printed together with the preceding record, without
intervening newlines, and with the original trailing white space:
split(RT, a, " ")
if (a[1] in tab)
printf ("%s\n%s\n", $0, RT)
else if (a[1] == "<entry>")
printf ("%s\n%s", $0, RT)
else
printf ("%s%s", $0, RT white)
}
The result of running this script on the above input is:
<para>
<table>
<title></title>
<tgroup cols="2">
<colspec colwidth="31*">
<colspec colwidth="49*">
<tbody>
<row>
<entry><literal>[:alnum:]</literal> </entry>
<entry>Alphanumeric
characters. </entry>
</row>
<row>
<entry><literal>[:alpha:]</literal> </entry>
<entry>Alphabetic
characters. </entry>
</row>
<row>
<entry><literal>[:blank:]</literal> </entry>
<entry>Space and
tab characters. </entry>
</row>
<row>
<entry><literal>[:cntrl:]</literal> </entry>
<entry>Control
characters. </entry>
</row>
</tbody>
</tgroup>
</table>
</para>
Although there are still extra newlines, at least now the table is readable, and further manual cleaning up isn't difficult.
|
The next task was to work on the indexing entries. The original gawk.texi
file already had a number of index entries that I had placed there. makeinfo
translated them into DocBook <indexterm> entries, but they
still needed some work. For example, occasionally additional material appeared
on the same line as the closing </indexterm> tag. More importantly,
special characters in the text of an index entry, such as <
and >, were not turned into < and >
in the generated DocBook. Also, O'Reilly's convention is to not have any font
changes in the contents of an index entry. The fixindex.awk script
dealt with all of these. The first part handles splitting off any trailing text:
#! /bin/gawk -f
# <indexterm> always comes at the beginning of a line.
# 1. If there's anything after the </indexterm>, insert a newline
# 2. Remove markup in the indexed items
/<indexterm>/ {
if (match($0, /<\/indexterm>./)) {
front = substr($0, 1, RSTART + 11);
rest = substr($0, RSTART + RLENGTH - 1)
} else {
front = $0
rest = ""
}
If the text of the index entry has font changes in it, the next part extracts
the contents of the entry, removes the font changes, and then puts the tags
back in:
if (match(front, /<(literal|command|filename)>/)) {
text = gensub(/<indexterm>(.+)<\/indexterm>/, "\\1", 1, front)
gsub(/<\/?(literal|command|filename)>/, "", text)
front = "<indexterm>" text "</indexterm>"
}
Looking at this now, sometime later, I see that the removal and restoration
of the <indexterm> tags isn't necessary. Nevertheless, I
leave it here to show the code as I wrote it then.
The rest of the rule deals with index entries for the <, <=,
>, and >= operators, converting them into the
appropriate DocBook entities. Finally, it prints the modified line and any trailing
material that may have been present, and then gets the next input line with
next. The final rule simply prints lines that aren't indexing lines:
gsub(/><=/, ">\\<=", front)
gsub(/>< /, ">\\< ", front)
gsub(/>>=/, ">\\>=", front)
gsub(/>> /, ">\\> ", front)
print front
if (rest)
print rest
next
}
{ print }
As you may have noticed, the scripts have been progressing from larger-scope fixes to smaller-scope fixes. This next script deals with a fine-grained, typographical detail.
In the Italic font O'Reilly uses to represent options, the correct character
to use for a hyphen or dash is the en-dash, discussed earlier. This is represented
by the DocBook – entity. Furthermore, gawk's
long options start with two dashes, not one. In both the Italic font in the
text and in the Roman font in the index, the two dashes run together when printed,
making them difficult to distinguish.
This next script solves both problems. It converts plain dash characters to
–, and inserts an   character
between two en-dashes. The   is a very small amount
of horizontal spacing whose job is to provide just such tiny amounts of separation
between characters. This script also works by setting RS to a regular
expression matching the text of interest, modifying the capture value in RT,
and then printing the record and new text back out.
The <primary> and <secondary> tags only
appear inside <indexterm> tags. The <option>
tags delimit options in the book's main text:
#! /bin/awk -f
BEGIN {
RS = "<(primary|secondary|option)>-(-|[A-Za-z])+"
}
{
if (RT != "") {
new = RT
new = gensub(/--/, "\\–\\ \\–", "g", new)
new = gensub(/-/, "\\–", "g", new)
} else
new = ""
printf("%s%s", $0, new)
}
After going through all the above scripts, the book was almost ready for prime
time. All my scripts had produced a DocBook XML document that was quite close
to what I would have produced had I been entering the text directly in DocBook.
It took considerably less effort than if I tried to convert the text from Texinfo
to DocBook using either the sed stream editor, or manually, using
editor commands (the colon prompt within vim).
Nevertheless, my Notes file lists a fair number of manual changes
that I had to make, things that weren't amenable to scripting. Most of these,
though, could be tackled using the vim command line. (Believe me, if I could have fixed these with a script too, I would have. But sometimes there are things that a program just isn't quite smart enough to handle.)
After all of these changes, I was at the final stage. In fact, this was during the technical review stage, and for a brief while before submitting the book to O'Reilly's Production department, I was making edits in parallel, in both the Texinfo and the DocBook versions of the book. The main reason for this was to avoid having to remake all the manual edits. It was easier to make a few incremental changes in parallel than to just edit the Texinfo file, regenerate DocBook, and then have to redo all the manual edits.
One final transformation was needed before submitting the book to Production.
O'Reilly has a standard convention for naming chapters, sections, tables, and
figures within the id="..." clause of the appropriate
tags. For example, <sect2 id="eap3-ch-3-sect-2.1">. These
same identifiers are used in <xref> tags for cross references.
However, makeinfo produced identifiers based on the original names
of the @node lines in the gawk.texi file. For example,
<sect1 id="How20To20Contribute">. (Here, the spaces in the
original node name are replaced by 20, which is the numeric value
of the space character, in hexadecimal.) I needed to transform these generated
identifiers into ones that followed the O'Reilly convention.
The following script, redoids.awk (re-do ids), does this job.
It makes two passes over the input. The first pass extracts the existing ids
from chapter, section, and table tags. It maintains the appropriate chapter and
section level counts, and by using them, generates the correct new tag for the
given item. The first pass builds up a table (an associative array),
mapping the old ids to the new ones.
The second pass goes through the file, actually making the substitutions of new id for old. It can't be done all in one pass since there are cross references, both forwards and backwards, scattered throughout the text.
The BEGIN block checks that exactly one argument was given, and
prints an error message if not. It then sets some global variables, namely,
the book name and IGNORECASE, which causes gawk to
ignore case when doing regular expression matching:
#! /bin/gawk -f
BEGIN {
if (ARGC != 2) {
print("usage: redoids file > newfile\n") > "/dev/stderr"
abnormal = 1
exit 1
}
book = "eap3"
IGNORECASE = 1
This next part actually sets up two passes over the input. It first initializes
Pass to 1. Next, it adds a variable assignment, Pass=2, to ARGV, and then the
input filename, and increments ARGC.
The upshot is that gawk reads through the file twice, with the
variable Pass being set appropriately each time through. The code
for the two passes then distinguishes which pass is which by testing Pass:
# set up two passes
Pass = 1
ARGV[ARGC++] = "Pass=2"
ARGV[ARGC++] = ARGV[1]
}
|
Top level section headings within a chapter are often referred to in publishing
as "A-level headings," or just "A heads" for short. Similarly, the next level
section headings are "B heads," "C heads," and so on. The variables ah,
bh, ch, and dh, represent heading levels.
At each level, the variable for the levels below it must be set to zero. The
variable tab represents the current table number within a chapter.
The chnum variable tracks the current chapter. Thus, this first
rule sets all the variables to zero, extracts the current id, and computes a
new one:
Pass == 1 && /^<chapter/ {
ah = bh = ch = dh = tab = 0
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
curchap = sprintf("ch-%d", ++chnum)
newtag = sprintf("%s-%s", book, curchap)
tags[oldid] = newtag
}
The next few rules are similar, and handle chapter-level items that aren't
actually chapters:
Pass == 1 && /^<preface/ {
ah = bh = ch = dh = tab = 0
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
curchap = "ch-0"
newtag = sprintf("%s-%s", book, curchap)
tags[oldid] = newtag
}
Pass == 1 && /^<appendix/ {
ah = bh = ch = dh = tab = 0
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
applet = substr("abcdefghijklmnopqrstuvwxyz", ++appnum, 1)
curchap = sprintf("ap-%s", applet)
newtag = sprintf("%s-%s", book, curchap)
tags[oldid] = newtag
}
Pass == 1 && /^<glossary/ {
ah = bh = ch = dh = tab = 0
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
curchap = "glossary"
newtag = sprintf("%s-%s", book, curchap)
tags[oldid] = newtag
}
Next comes code that deals with section tags. The first rule handles a special
case. Two of the appendixes in Effective awk Programming are the
GNU General Public License
(GPL), which covers the gawk source code and the GNU Free
Documentation License (FDL), which covers the book itself. The sections
in these appendixes don't have ids, nor do they need them. The first rule skips
them.
The second rule does much of the real work. It extracts the old id, and then it extracts the level of the section (1, 2, 3, etc.). Based on the level, it resets the lower-level heading variables and sets up the new id.
The third rule handles tables. Table numbers increase monotonically through
the whole chapter and have two-digit numbers:
Pass == 1 && /<sect[1-4]>/ { next } # skip licenses
Pass == 1 && /^<sect[1-4]/ {
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
level = substr($1, 6, 1) + 0 # get level
if (level == 1) {
++ah
sectnum = ah
bh = ch = dh = 0
} else if (level == 2) {
++bh
sectnum = ah "." bh
ch = dh = 0
} else if (level == 3) {
++ch
sectnum = ah "." bh "." ch
dh = 0
} else {
++dh
sectnum = ah "." bh "." ch "." dh
}
newtag = sprintf("%s-%s-sect-%s", book, curchap, sectnum)
tags[oldid] = newtag
}
Pass == 1 && /^<table/ {
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
newtag = sprintf("%s-%s-tab-%02d", book, curchap, ++tab)
tags[oldid] = newtag
}
By using -v Debug=1 on the gawk command line, I could
do debugging of the code that gathered old ids and built new ones. When debugging
is true, the program simply skips the second pass, by reading through the file
and doing nothing. More debug code appears in the END rule, below:
Pass == 2 && Debug { next }
If not debugging, this next rule is what replaces old ids in various tags with
the new one:
Pass == 2 && /^<(chapter|preface|appendix|glossary|sect[1-4]|table)/ {
oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
tagtype = gensub(/<(chapter|preface|appendix|glossary|sect[1-4]|table).*/, "\\1", 1, $0)
printf "<%s id=\"%s\">\n", tagtype, tags[oldid]
next
}
The following rule updates cross references. Cross-reference tags contain a
linkend="..." clause pointing to the id of the place
they reference. Since I knew that linkend= only appeared in cross
references, that was all I had to look for. The while loop handles
multiple cross references in a single line. The loop body works by splitting
apart the line into three pieces: the part before the linkend=,
the linkend clause itself, and the rest of the line after it. It
then builds up the output line by concatenating the preceding text with the
new linkend clause:
Pass == 2 && /linkend=/ {
str = $0
out = ""
while (match(str, /linkend="[^"]+"/)) {
before = substr(str, 1, RSTART - 1)
xreftag = substr(str, RSTART, RLENGTH)
after = substr(str, RSTART + RLENGTH)
oldid = gensub(/linkend="([^"]+)"/, "\\1", 1, xreftag)
if (! oldid in tags) {
printf("warning: xref to %s not in tags!\n", oldid) > "/dev/stderr"
tags[oldid] = "UNKNOWNLINK"
}
out = out before "linkend=\"" tags[oldid] "\""
str = after
}
if (str)
out = out str
print out
next
}
Finally, the last rule is the catch-all that prints out lines that don't need
updating:
Pass == 2 { print }
The END rule does simple cleanup. The abnormal variable
is true if the wrong number of arguments were provided. The if
statement tests it and exits immediately if it's true, avoiding execution of
the rest of the rule.
It turns out that the rest of the rule isn't that involved. It simply
dumps the table mapping of the old ids to the new ones if debugging is turned on:
END {
if (abnormal)
exit 1
if (Debug) {
for (i in tags)
printf "%s -> %s\n", i, tags[i]
exit
}
}
Once the new ids were in place, that was it. Since the O'Reilly DocBook tools work on separate per-chapter files, all that remained was to split the large file up into separate files, and then print them. I verified that everything went through their tools with no problems, and submitted the files to Production.
Production went quite quickly. A large part of this was due to the fact that copy editing had already been done on the Texinfo version. Usually it's done as part of the production cycle.
O'Reilly published the book, and I released gawk 3.1.0 at about
the same time. The gawk.texi shipped with gawk included
all of O'Reilly's editorial input.
It would seem that all ended happily. Alas, this was mostly true, but one non-trivial problem remained.
A major aspect of book production done after the author submits his files is
indexing. While gawk.texi contained a number of index entries,
most of which I had provided, this served only as an initial basis upon which
to build. Indexing is a separate art, requiring training and experience to do
well, and I make no pretensions that I'm good at it.
Nancy Crumpton, a professional indexer, turned my amateur index into a real one. Also, during final production, there were the few, inevitable changes made to the text to fix gaffes in English grammar or to improve the book's layout.
I was thus left with a quandary. While the vast majority of O'Reilly's editorial input had been used to improve the Texinfo version of the book, there were now a number of new changes that existed only in the DocBook version. I really wanted to have those included in the Texinfo version as well.
The solution involved one more script and a fair amount of manual work. The
following script, desgml.awk, removes DocBook markup from a file,
leaving just the text. The BEGIN block sets up a table of translations
from DocBook entities to simple textual equivalents. (Some of these entities
are specific to Effective awk Programming.) The specials
array handles tags that must be special-cased (as opposed to entities):
#! /bin/awk -f
BEGIN {
entities["darkcorner"] = "(d.c.)"
entities["TeX"] = "TeX"
entities["LaTeX"] = "LaTeX"
entities["BIBTeX"] = "BIBTeX"
entities["vellip"] = "\n\t.\n\t.\n\t.\n"
entities["hellip"] = "..."
entities["lowbar"] = "_"
entities["frac18 "] = "1/8"
entities["frac38 "] = "3/8"
# > 300 entities removed for brevity ...
specials["<?lb?>"] = specials["<?lb>"] = " "
specials["<keycap>"] = " "
RS = "<[^>]+>"
entity = "&[^;&]+;"
}
As in many of the other scripts seen so far, this one also uses RS
as a regular expression that matches tags with the variable entity
encapsulating the regular expression for an entity.
The single rule processes records, looking for entities to replace. The first
part handles the simple case where there are no entities (match()
returns zero). In such a case, all that's necessary is to check the tag for
special cases:
{
if (match($0, entity) == 0) {
printf "%s", $0
special_case()
next
}
The next part handles replacing entities, again using a loop to pull the line
apart around the text of the entity. If the entity exists in the table, it's
replaced. Otherwise it's used as-is, minus the & and ;
characters:
# have a match
text = $0
out = ""
do {
front = substr(text, 1, RSTART - 1)
object = substr(text, RSTART+1, RLENGTH-2) # strip & and ;
rest = substr(text, RSTART + RLENGTH)
if (object in entities)
replace = entities[object]
else
replace = object
out = out front replace
text = rest
} while (match(text, entity) != 0)
if (length(text) > 0)
out = out text
printf("%s", out)
special_case()
}
The special_case() function translates any special tags into white
space and handles cross references, replacing them with an upper-case version
of the id:
function special_case( rt, ref)
{
# a few special cases
rt = tolower(RT)
if (rt in specials) {
printf "%s", specials[rt]
} else if (rt ~ /<xref/) {
ref = gensub(/<xref +linkend="([^"]*)".*>/,"\\1", 1, rt)
ref = toupper(ref)
printf "%s", ref
}
}
I ran both my original XML files and O'Reilly's final XML files through the
desgml.awk script to produce text versions of each chapter. I then
used diff to produce a context-diff of the chapters, and went through
each diff looking for indexing and wording changes. Each such change I then
added back into gawk.texi. This process occurred over the course
of several weeks, as it was tedious and time-consuming.
However, the end result is that gawk.texi is now once again the
"master version" of the documentation, and whenever work starts on the fourth
edition of Effective awk Programming, I expect to be able to generate
new DocBook XML files that still contain all the work that O'Reilly contributed.
Translating something the size of a whole book from Texinfo to DocBook was
certainly a challenge. Using gawk made the cleanup work fairly
straightforward, so I was able to concentrate on revising the contents of the
book without worrying too much about the production. Furthermore, the use of
Texinfo did not impede the book's production since O'Reilly received DocBook
XML files that went through their tool suite, and the distributed version of
the documentation benefited enormously from their input.
I would like to thank Philippe Martin for his original DocBook changes and Karl Berry, Texinfo's maintainer, for his help and support. Many thanks go to Chuck Toporek and the O'Reilly production staff. Working with them on Effective awk Programming really was a pleasure. Thanks to Nelson H.F. Beebe, Karl Berry, Len Muellner, and Jim Meyering as well as O'Reilly folk Betsy Waliszewski, Bruce Stewart, and Tara McGoldrick for reviewing preliminary drafts of this article.
|
Related Reading Effective awk Programming |
Copyright © 2007 O'Reilly Media, Inc.