
|
|
|
Scrape Customer Advice
Screen scraping can give you access to
community features not yet implemented through the API—like
customer buying advice
The Code
[Discuss (1) | Link to this hack] |
The CodeThis Perl script,
get_advice.pl,
splits the advice page into two variables based on the headings "in
addition to" and "instead of." It then loops through those sections,
using regular expressions to match the products' information. The
script then formats and
prints the information. #!/usr/bin/perl
# get_advice.pl
# A script to scrape Amazon to retrieve customer buying advice
# Usage: perl get_advice.pl <asin>
#Take the asin from the command-line
my $asin =shift @ARGV or die "Usage:perl get_advice.pl <asin>\n";
#Assemble the URL
my $url = "http://amazon.com/o/tg/detail/-/" . $asin .
"/?vi=advice";
#Set up unescape-HTML rules
my %unescape = ('"'=>'"', '&'=>'&', ' '=>' ');
my $unescape_re = join '|' => keys %unescape;
use strict;
use LWP::Simple;
#Request the URL
my $content = get($url);
die "Could not retrieve $url" unless $content;
my($inAddition) = (join '', $content) =~ m!in addition to(.*?)<tr>&return;
<td colspan=3><br></td></tr>!mis;
my($instead) = (join '', $content) =~ m!recommendations instead of(.*?)</&return;
table>!mis;
#Loop through the HTML looking for "in addition" advice
print "-- In Addition To --\n\n";
while ($inAddition =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.&return;
*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
my($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
$title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML
#Print the results
print $place . " " .
$title . " (" . $thisAsin . ")\n(" .
"Recommendations: " . $number . ")" .
"\n\n";
}
#Loop through the HTML looking for "instead of" advice
print "-- Instead Of --\n\n";
while ($instead =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.*?)/.&return;
*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
my($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
$title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML
#Print the results
print $place . " " .
$title . " (" . $thisAsin . ")\n(" .
"Recommendations: " . $number . ")" .
"\n\n";
}
Showing messages 1 through 1 of 1.
-
mixup
2004-01-17 07:07:45
anonymous2
[View]
|
Showing messages 1 through 1 of 1.
|
|
O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website:
| Customer Service:
| Book issues:
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
|
|