Web DevCenter    
 Published on Web DevCenter (http://www.oreillynet.com/javascript/)
 See this if you're having trouble printing code examples


O'Reilly Book Excerpts: Spidering Hacks

More Spidering Hacks

by Morbus Iff and Tara Calishain

Related Reading

Spidering Hacks
100 Industrial-Strength Tips & Tools
By Morbus Iff, Tara Calishain

Editor's note: In last week's sample hacks, excerpted from Spidering Hacks, we showed you two workarounds that will save you time and extra trips to your favorite web sites. This week we offer two more hacks on grabbing--or scraping--the information you need, whether it's the link count for a particular Yahoo! category, or the quick answer for the word that's just on the tip of your tongue. Enjoy.

Hack #49: Yahoo! Directory Mindshare in Google

How does link popularity compare in Yahoo!'s searchable subject index versus Google's full-text index? Find out by calculating mindshare!

Yahoo! and Google are two very different animals. Yahoo! indexes only a site's main URL, title, and description, while Google builds full-text indexes of entire sites. Surely there's some interesting cross-pollination when you combine results from the two.

This hack scrapes all the URLs in a specified subcategory of the Yahoo! directory. It then takes each URL and gets its link count from Google. Each link count provides a nice snapshot of how a particular Yahoo! category and its listed sites stack up on the popularity scale.

TIP: What's a link count? It's simply the total number of pages in Google's index that link to a specific URL.

There are a couple of ways you can use your knowledge of a subcategory's link count. If you find a subcategory whose URLs have only a few links each in Google, you may have found a subcategory that isn't getting a lot of attention from Yahoo!'s editors. Consider going elsewhere for your research. If you're a webmaster and you're considering paying to have Yahoo! add you to their directory, run this hack on the category in which you want to be listed. Are most of the links really popular? If they are, are you sure your site will stand out and get clicks? Maybe you should choose a different category.

We got this idea from a similar experiment Jon Udell (http://weblog.infoworld.com/udell/) did in 2001. He used AltaVista instead of Google; see mindshare-script.txt. We appreciate the inspiration, Jon!

The Code

You will need a Google API account (http://api.google.com/), as well as the SOAP::Lite and HTML::LinkExtor Perl modules to run the following code:

#!/usr/bin/perl -w

use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;

my $google_key  = "your API key goes here";
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir   = shift || "/Computers_and_Internet/Data_Formats/XML_  _".
                  "eXtensible_Markup_Language_/RSS/News_Aggregators/";

# download the Yahoo! directory.
my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;

# create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.

# extract all the links and parse 'em.
HTML::LinkExtor->new(\&mindshare)->parse($data);
sub mindshare { # for each link we find...

    my ($tag, %attr) = @_;

    # continue on only if the tag was a link,
    # and the URL matches Yahoo!'s redirectory.
    return if $tag ne 'a';
    return unless $attr{href} =~ /srd.yahoo/;
    return unless $attr{href} =~ /\*http/;

    # now get our real URL.
    $attr{href} =~ /\*(http.*)/; my $url = $1;

    # and process each URL through Google.
    my $results = $google_search->doGoogleSearch(
                        $google_key, "link:$url", 0, 1,
                        "true", "", "false", "", "", ""
                  ); # wheee, that was easy, guvner.
    $urls{$url} = $results->{estimatedTotalResultsCount};
}

# now sort and display.
my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;
foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }

Running The Hack

The hack has its only configuration — the Yahoo! directory you're interested in — passed as a single argument (in quotes) on the command line. If you don't pass one of your own, a default directory will be used instead.

% perl mindshare.pl "/Entertainment/Humor/Procrastination/"

Your results show the URLs in those directories, sorted by total Google links:

340: http://www.p45.net/
246: http://www.ishouldbeworking.com/
81: http://www.india.com/
33: http://www.jlc.net/~useless/
23: http://www.geocities.com/SouthBeach/1915/
18: http://www.eskimo.com/~spban/creed.html
13: http://www.black-schaffer.org/scp/
3: http://www.angelfire.com/mi/psociety
2: http://www.geocities.com/wastingstatetime/

Hacking the Hack

Yahoo! isn't the only searchable subject index out there, of course; there's also the Open Directory Project (DMOZ, http://www.dmoz.org/), which is the product of thousands of volunteers busily cataloging and categorizing sites on the Web — the web community's Yahoo!, if you will. This hack works just as well on DMOZ as it does on Yahoo!; they're very similar in structure.

Replace the default Yahoo! directory with its DMOZ equivalent:

my $dmoz_dir = shift || "/Reference/Libraries/Library_and_Information_&return;
Science/Technical_Services/Cataloguing/Metadata/RDF/Applications/RSS/&return; 
News_Readers/";

You'll also need to change the download instructions:

# download the Dmoz.org directory.
my $data = get("http://dmoz.org" . $dmoz_dir) or die $!;

Next, replace the lines that check whether a URL should be measured for mindshare. When we were scraping Yahoo! in our original script, all directory entries were always prepended with http://srd.yahoo.com/ and then the URL itself. Thus, to ensure we received a proper URL, we skipped over the link unless it matched that criteria:

return unless $attr{href} =~ /srd.yahoo/;
return unless $attr{href} =~ /\*http/;

Since DMOZ is an entirely different site, our checks for validity have to change. DMOZ doesn't modify the outgoing URL, so our previous Yahoo! checks have no relevance here. Instead, we'll make sure it's a full-blooded location (i.e., it starts with http://) and it doesn't match any of DMOZ's internal page links. Likewise, we'll ignore searches on other engines:

return unless $attr{href} =~ /^http/;
return if $attr{href} =~ /dmoz|google|altavista|lycos|yahoo|alltheweb/;

Our last change is to modify the bit of code that gets the real URL from Yahoo!'s modified version. Instead of "finding the URL within the URL":

# now get our real URL.
$attr{href} =~ /\*(http.*)/; my $url = $1;

we simply assign the URL that HTML::LinkExtor has found:

# now get our real URL.
my $url = $attr{href};

Can you go even further with this? Sure! You might want to search a more specialized directory, such as the FishHoo! fishing search engine (http://www.fishhoo.com/).

You might want to return only the most linked-to URL from the directory, which is quite easy, by piping the results [Hack #28] to another common Unix utility:

% perl mindshare.pl | head 1

Alternatively, you might want to go ahead and grab the top 10 Google matches for the URL that has the most mindshare. To do so, add the following code to the bottom of the script:

print "\nMost popular URLs for the strongest mindshare:\n";
my $most_popular = shift @sorted_urls;
my $results = $google_search->doGoogleSearch(
                    $google_key, "$most_popular", 0, 10,
                    "true", "", "false", "", "", "" );

foreach my $element (@{$results->{resultElements}}) {
   next if $element->{URL} eq $most_popular;
   print " * $element->{URL}\n";
   print "   \"$element->More Spidering Hacks\"\n\n";
}

Then, run the script as usual (the output here uses the default hardcoded directory):

% perl mindshare.pl
27800: http://radio.userland.com/
6670: http://www.oreillynet.com/meerkat/
5460: http://www.newsisfree.com/
3280: http://ranchero.com/software/netnewswire/
1840: http://www.disobey.com/amphetadesk/
847: http://www.feedreader.com/
797: http://www.serence.com/site.php?page=prod_klipfolio
674: http://bitworking.org/Aggie.html
492: http://www.newzcrawler.com/
387: http://www.sharpreader.net/
112: http://www.awasu.com/
102: http://www.bloglines.com/
67: http://www.blueelephantsoftware.com/
57: http://www.blogtrack.com/
50: http://www.proggle.com/novobot/

Most popular URLs for the strongest mindshare:
 * http://groups.yahoo.com/group/radio-userland/
   "Yahoo! Groups : radio-userland"

 * http://groups.yahoo.com/group/radio-userland-francophone/message/76
   "Yahoo! Groupes : radio-userland-francophone Messages : Message 76 ... "

 * http://www.fuzzygroup.com/writing/radiouserland_faq.htm
   "Fuzzygroup :: Radio UserLand FAQ"
...

Hack #78: Super Word Lookup

Working on a paper, book, or thesis and need a nerdy definition of one word, and alternatives to another?

You're writing a paper and getting sick of constantly looking up words in your dictionary and thesaurus. As most of the hacks in this book have done, you can scratch your itch with a little bit of Perl. This script uses the dict protocol (http://www.dict.org/) and Thesaurus.com (http://www.thesaurus.com/) to find all you need to know about a word.

By using the dict protocol, DICT.org and several other dictionary sites make our task easier, since we do not need to filter through HTML code to get what we are looking for. A quick look through CPAN (http://www.cpan.org/) reveals that the dict protocol has already been implemented as a Perl module (http://search.cpan.org/author/NEILB/Net-Dict/lib/Net/Dict.pod). Reading through the documentation, you will find it is well-written and easy to implement; with just a few lines, you have more definitions than you can shake a stick at. Next problem.

Unfortunately, the thesaurus part of our program will not be as simple. However, there is a great online thesaurus (http://www.thesaurus.com/) that we will use to get the information we need. The main page of the site offers a form to look up a word, and the results take us to exactly what we want. A quick look at the URL shows this will be an easy hurdle to overcome — using LWP, we can grab the page we want and need to worry only about parsing through it.

Since some words have multiple forms (noun, verb, etc.), there might be more than one entry for a word; this needs to be kept in mind. Looking at the HTML source, you can see that each row of the data is on its own line, starting with some table tags, then the header for the line (Concept, Function, etc.), followed by the content. The easiest way to handle this is to go through each section individually, grabbing from Entry to Source, and then parse out what's between. Since we want only synonyms for the exact word we searched for, we will grab only sections where the content for the entry line contains only the word we are looking for and is between the highlighting tag used by the site. Once we have this, we can strip out those highlighting tags and proceed to finding the synonym and antonym lines, which might not be available for every section. The easiest thing to do here is to throw it all in an array; this makes it easier to sort, remove duplicate words, and display it. In cases in which you are parsing through long HTML, you might find it easier to put the common HTML strings in variables and use them in the regular expressions; it makes the code easier to read. With a long list of all the words, we use the Sort::Array module to get an alphabetical, and unique, listing of results.

The Code

Save the following code as dict.pl:

#!/usr/bin/perl -w
#
# Dict - looks up definitions, synonyms and antonyms of words.
# Comments, suggestions, contempt? Email adam@bregenzer.net.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
use LWP;
use Net::Dict;
use Sort::Array "Discard_Duplicates";
use URI::Escape;

my $word = $ARGV[0]; # the word to look-up
die "You didn't pass a word!\n" unless $word;
print "Definitions for word '$word':\n";

# get the dict.org results.
my $dict = Net::Dict->new('dict.org');
my $defs = $dict->define($word);
foreach my $def (@{$defs}) {
    my ($db, $definition) = @{$def};
    print $definition . "\n";
}

# base URL for thesaurus.com requests
# as well as the surrounding HTML of
# the data we want. cleaner regexps.
my $base_url       = "http://thesaurus.reference.com/search?q=";
my $middle_html    = ":</b>&nbsp;&nbsp;</td><td>";
my $end_html       = "</td></tr>";
my $highlight_html = "<b style=\"background: #ffffaa\">";

# grab the thesaurus results.
my $ua = LWP::UserAgent->new(agent => 'Mozilla/4.76 [en] (Win98; U)');
my $data = $ua->get("$base_url" . uri_escape($word))->content;

# holders for matches.
my (@synonyms, @antonyms);

# and now loop through them all.
while ($data =~ /Entry(.*?)<b>Source:<\/b>(.*)/) {
    my $match = $1; $data = $2;

    # strip out the bold marks around the matched word.
    $match =~ s/${highlight_html}([^<]+)<\/b>/$1/;

    # push our results into our various arrays.
    if ($match =~ /Synonyms${middle_html}([^<]*)${end_html}/) {
        push @synonyms, (split /, /, $1);
    }
    elsif ($match =~ /Antonyms${middle_html}([^<]*)${end_html}/) {
        push @antonyms, (split /, /, $1);
    }
}

# sort them with sort::array,
# and return unique matches.
if ($#synonyms > 0) {
    @synonyms = Discard_Duplicates(
        sorting      => 'ascending',
        empty_fields => 'delete',
        data         => \@synonyms,
    );

    print "Synonyms for $word:\n";
    my $quotes = ''; # purtier.
    foreach my $nym (@synonyms) {
        print $quotes . $nym;
        $quotes = ', ';
    } print "\n\n";
}

# same thing as above.
if ($#antonyms > 0) {
    @antonyms = Discard_Duplicates(
        sorting      => 'ascending',
        empty_fields => 'delete',
        data         => \@antonyms,
    );

    print "Antonyms for $word:\n";
    my $quotes = ''; # purtier.
    foreach my $nym (@antonyms) {
        print $quotes . $nym;
        $quotes = ', ';
    } print "\n";
}

Running the Hack

Invoke the script on the command line, passing it one word at a time. As far as I know, these sites know how to work with English words only. This script has a tendency to generate a lot of output, so you might want to pipe it to less or redirect it to a file.

Here is an example where I look up the word "hack":

% perl dict.pl "hack"
Definitions for word 'hack':
<snip>
hack
 
   <jargon> 1. Originally, a quick job that produces what is
   needed, but not well.
 
   2.  An incredibly good, and perhaps very time-consuming, piece
   of work that produces exactly what is needed.

<snip>
 
   See also {neat hack}, {real hack}.
 
   [{Jargon File}]
 
   (1996-08-26)
 
Synonyms for hack:
be at, block out, bother, bug, bum, carve, chip, chisel, chop, cleave, 
crack, cut, dissect, dissever, disunite, divide, divorce, dog, drudge, 
engrave, etch, exasperate, fashion, form, gall, get, get to, grate, grave, 
greasy grind, grind, grub, grubber, grubstreet, hack, hew, hireling, incise, 
indent, insculp, irk, irritate, lackey, machine, mercenary, model, mold, 
mould, nag, needle, nettle, old pro, open, part, pattern, peeve, pester, 
pick on, pierce, pique, plodder, potboiler, pro, provoke, rend, rip, rive, 
rough-hew, sculpt, sculpture, separate, servant, sever, shape, slash, slave, 
slice, stab, stipple, sunder, tear asunder, tease, tool, trim, vex, whittle, 
wig, workhorse
 
Antonyms for hack:
appease, aristocratic, attach, calm, cultured, gladden, high-class, humor, 
join, make happy, meld, mollify, pacify, refined, sophisticated, superior, 
unite

Hacking the Hack

There are a few ways you can improve upon this hack.

Using specific dictionaries
You can either use a different dict server or you can use only certain dictionaries within the dict server. The DICT.org server uses 13 dictionaries; you can limit it to use only the 1913 edition of Webster's Revised Unabridged Dictionary by changing the $dict->define line to:

my $defs = $dict->define($word, 'web1913');

The $dict->dbs method will get you a list of dictionaries available.

Clarifying the thesaurus
For brevity, the thesaurus section prints all the synonyms and antonyms for a particular word. It would be more useful if it separated them according to the function of the word and possibly the definition.

Adam Bregenzer

Morbus Iff is the coauthor of Mac OS X Hacks, author of Spidering Hacks, and the alter ego of the pervasively strange Morbus Iff, creator of disobey.com, which bills itself as "content for the discontented."

Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.


Return to the Web Development DevCenter.

Copyright © 2009 O'Reilly Media, Inc.