O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Yahoo! Hacks
By Paul Bausch
October 2005
More Info

HACK
#22
Yahoo! Directory Mindshare in Google
How does link popularity compare in Yahoo!'s searchable subject index versus Google's full-text index? Find out by calculating mindshare
The Code
[Discuss (0) | Link to this hack]

The Code

You will need a Google API account (http://api.google.com) as well as the Perl modules SOAP::Lite (http://www.soaplite.com) and HTML::LinkExtor (http://search.cpan.org/author/GAAS/HTML-Parser/lib/HTML/LinkExtor.pm) to run the following code. You'll also need a copy of the Google WSDL file in the same directory as the script (http://api.google.com/GoogleSearch.wsdl). Save the following code to a file called mindshare.pl:

	#!/usr/bin/perl -w
	
	use strict;
	use LWP::Simple;
	use HTML::LinkExtor;
	use SOAP::Lite;

	my $google_key = "your API key goes here";
	my $google_wdsl = "GoogleSearch.wsdl";
	my $yahoo_dir = shift || "/Computers_and_Internet/Data_Formats/XML_ _".
				"eXtensible_Markup_Language_/RSS/Aggregators/";

	# download the Yahoo! directory.
	my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;

	# create our Google object.
	my $google_search = SOAP::Lite->service("file:$google_wdsl");
	my %urls; # where we keep our counts and titles.
	# extract all the links and parse 'em.
	HTML::LinkExtor->new(\&mindshare)->parse($data);
	sub mindshare { # for each link we find…

		my ($tag, %attr) = @_;
		
		# only continue on if the tag was a link,
		# and the URL matches Yahoo!'s redirectory,
		return if $tag ne 'a';
		return if $attr{href} =~ /us.rd.yahoo/;
		return unless $attr{href} =~ /^http/;
	
		# and process each URL through Google.
		my $results = $google_search->doGoogleSearch(
				$google_key, "link:$attr{href}", 0, 1,
				"true", "", "false", "", "", ""
				); # wheee, that was easy, guvner.
		$urls{$attr{href}} = $results->{estimatedTotalResultsCount};
	}

		# now sort and display.
		my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;
		foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.