The Code
So, after trying the orthodox approach, I started again. I broke the
rule about parsing HTML with regular expressions and wrote the
following code:
#!usr/bin/perl -w
use strict;
my (@s) = m{
> # close of previous tag
([^<]+) # text (name of part, e.g., q/BLACK CARTRIDGE/)
<br>
([^<]+) # part number (e.g., q/HP Part Number: HP C9724A/+)
(?:<[^>]+>\s*){4} # separated by four tags
(\d+) # percent remaining
| # --or--
(?:
# different text values
(?:
Pages\sRemaining
| Low\sReached
| Serial\sNumber
| Pages\sprinted\swith\sthis\ssupply
) : (?:\s*<[^>]+>){6}\s* # colon, separated by six tags
# or just this, within the current element
| Based\son\shistorical\s\S+\spage\scoverage\sof\s
)
(\w+) # and the value we want
}gx;
A single regular expression (albeit with a /g
modifier for global matching) pulls out all I want. Actually,
it's not quite perfect, since the resulting array
also fills up with a pile of undefs, the unfilled
parenthesis on the opposite side of the |
alternation to the match. This is easily handled with a simple
next unless$index addition to
any foreach loop on @s.
Is the code fragile? Not really. The HTML has errors in it, such as
<td valign= op">, which can trip up some
modules that expect perfectly formed HTML, but
HTML::TreeBuilder coped just fine with this too.