The Making of Effective awk Programming
Pages: 1, 2, 3, 4
O'Reilly books use a Constant Width Bold font to indicate user input in examples
and a plain Constant Width font for computer output. Texinfo only uses plain
Constant Width, distinguishing computer output with a leading glyph, in this
case, -|. (TeX output uses a similar, but nicer looking symbol.)
Error messages are prefixed with a different glyph that comes out in the DocBook
file as error-->. This next bit removes these glyphs. It also
supplies the <userinput> tags for any line whose first character
is either $ or > (the > symbol).
These represent the Bourne shell primary and secondary prompts, respectively,
which are used in printed examples of interactive use:
in_screen != 0 {
gsub(/-\| */, "");
gsub(/error--> /, "");
if (/^(\$|>) /)
$0 = gensub(/ (.+)/, " <userinput>\\1</userinput>", "g")
}
The gensub() ("general substitution") function is a gawk
extension. The first argument is the regular expression to match. The second
is the replacement text. The third is either a number indicating which match
of the text to replace, or "g", meaning that the change should
be done globally (on all matches). The fourth argument, if present, is the value
of the original text. When not supplied, the current input record ($0)
is used. The return value is the new text after the substitution has taken
place.
Here the replacement text includes \\1, which means "use the text
matched by the part of the regular expression enclosed in the first set of parentheses."
What this ends up doing is enclosing the command entered by the user in <userinput>
tags, leaving the rest of the line alone.
Texinfo doesn't have sidebars, which are blocks of text set off
to the side for separate, isolated discussion of issues. They are typically
used for more in depth discussion items or for longer examples. In gawk.texi,
I got around the lack of sidebars by using regular sections and adding the
words "Advanced Notes" to the section title. This next bit of code looks for
sections that have the words "Advanced Notes" in their titles and converts
them into sidebars. While it's at it, it removes all inline font changes from
the contents between <title> and </title>
tags, since such font changes are against O'Reilly conventions:
# deal with Advanced Notes, turn them into sidebars
/^<sect/ { save_sect = $0 ; next }
/<title>/ {
if (/Advanced Notes/) {
in_sidebar++
print "<sidebar>"
sub(/Advanced Notes: /, "")
} else if (save_sect) {
print save_sect
}
save_sect = ""
# remove font changes from titles
if (match($0, /<title>.+<\/title>/)) {
before = substr($0, 1, RSTART - 1)
text = substr($0, RSTART + 7, RLENGTH - 15)
after = substr($0, RSTART + RLENGTH)
gsub(/<[^>]+>/, "", text)
print before "<title>" text "</title>" after
next
}
}
/<\/sect/ {
if (in_sidebar) {
print "</sidebar>"
in_sidebar = 0
next
}
}
There are three different kinds of dashes used in typography. "Em-dashes" are
the length of the letter "m." "En-dashes" are the length of the letter "n."
They are shorter than em-dashes. And plain dashes, or hyphens, are the shortest
of all. The makeinfo output represents an em-dash as two dashes.
This last chunk turns them into the — DocBook entity.
This change is not done inside examples (! in_screen). The very
last rule simply prints the (possibly modified) input record to the output:
/([a-z]|(<\/[a-z]+>))--[a-z]/ && ! in_screen {
$0 = gensub(/([a-z]|(<\/[a-z]+>)?)--([a-z])/, "\\1\\—\\3", g, $0)
}
{ print }
As mentioned earlier, the early DocBook version of makeinfo generated
lots of unnecessary <para> tags. The output had numerous
empty paragraphs, and removing them by hand was just too painful. The following
simple script, rmpara.awk strips out empty paragraphs.
This script works by taking advantage of gawk's ability to specify
a regular expression as the record separator. Here, records are separated by
the markup for empty paragraphs. By setting the output record separator to the
null string (ORS = ""), a print statement prints the
preceding part of the file.
#! /usr/local/bin/gawk -f
BEGIN {
RS = "<para>[ \t\n]+</para>\n*"
ORS = ""
}
And since we're working with paragraph tags, the following small rule puts
<para> tags inside lists and index entries on their own lines.
This makes the DocBook file easier to work with. The final rule simply prints
the record, which is all text in the file up to an empty paragraph:
/(indexterm|variablelist)><para>/ {
sub(/<para>/, "\n&")
}
{ print }
Fixing Tables
A significant problem, requiring a separate script, had to do with the formatting
of tables. The Texinfo @multitable ... @end multitable
translates pretty directly into a DocBook <table>. However,
the formatting of the output, while fine for machine processing, was essentially
impossible for a human to work with directly. For example:
<para>
<table> <title></title> <tgroup cols="2"><colspec colwidth="31*">
<colspec colwidth="49*"> <tbody> <row>
<entry><literal>[:alnum:]</literal> </entry>
<entry> Alphanumeric characters. </entry> </row><row> <entry>
<literal>[:alpha:]</literal> </entry> <entry> Alphabetic characters.
</entry> </row><row> <entry><literal>[:blank:]</literal>
</entry> <entry> Space and tab characters. </entry> </row><row>
<entry> <literal>[:cntrl:]</literal> </entry> <entry> Control
characters. </entry> </row></tbody> </tgroup> </table>
</para>
Each row in a table should be separate, and each entry (column) in a row should
have its own line (or lines). For this, I wrote the next script, fixtable.awk.
It is similar to the rmpara.awk script, in that it uses a regular
expression for RS. This time the regular expression matches DocBook
tags. Thus the record is all text up to a tag, and the record separator is the
tag itself plus any trailing white space.
The associative array tab (for "table") contains all the table-related
tags that should be on their own lines. The <colspec> tag
contains parameters, thus it does not have the closing > character
in it:
#! /bin/gawk -f
BEGIN {
RS = "<[^>]+> *"
tab["<table>"] = 1
tab["<colspec"] = 1
tab["<tbody>"] = 1
tab["<tgroup"] = 1
tab["</tgroup>"] = 1
tab["</tbody>"] = 1
tab["<row>"] = 1
tab["</row>"] = 1
}
gawk sets the variable RT (record terminator) to
the actual text that matched the RS regular expression. Any trailing
white space in RT is saved in the variable white,
and then removed from RT. This is necessary in case the tag in
RT isn't one for tables. Then the white space has to be put back
into the output to preserve the original file's contents:
{
# remove trailing white
# gensub returns the original string if the re doesn't match
if (RT ~ / +$/)
white = gensub(/.*>( +$)/, "\\1", 1, RT)
else
white = ""
sub(/ +$/, "", RT)
This next part does the work. It splits RT around white space.
(This is necessary for the <colspec> tag.) If the tag is
in the table, we print the preceding record, a newline, and then the whole tag
on its own line. <entry> tags are printed on their own lines.
Finally, any other tags are printed together with the preceding record, without
intervening newlines, and with the original trailing white space:
split(RT, a, " ")
if (a[1] in tab)
printf ("%s\n%s\n", $0, RT)
else if (a[1] == "<entry>")
printf ("%s\n%s", $0, RT)
else
printf ("%s%s", $0, RT white)
}
The result of running this script on the above input is:
<para>
<table>
<title></title>
<tgroup cols="2">
<colspec colwidth="31*">
<colspec colwidth="49*">
<tbody>
<row>
<entry><literal>[:alnum:]</literal> </entry>
<entry>Alphanumeric
characters. </entry>
</row>
<row>
<entry><literal>[:alpha:]</literal> </entry>
<entry>Alphabetic
characters. </entry>
</row>
<row>
<entry><literal>[:blank:]</literal> </entry>
<entry>Space and
tab characters. </entry>
</row>
<row>
<entry><literal>[:cntrl:]</literal> </entry>
<entry>Control
characters. </entry>
</row>
</tbody>
</tgroup>
</table>
</para>
Although there are still extra newlines, at least now the table is readable, and further manual cleaning up isn't difficult.





