advertisement

Print

Hacking Maps
Pages: 1, 2

Automatically Geocode U.S. Addresses

expert iconhack 80



Use the geocoder.us web services to geocode batches of address

In [Hack #79], you saw how easy it was to geocode an individual address. But what about a whole database of addresses? What about geocoding addresses as people enter them into a web form? You don't need to webscrape geocoder.us! There are three different web service interfaces. geocoder.us supports XML-RPC and a lightweight REST-ful interface. There is also an embryonic SOAP interface. (For more information and a code sample, consult the geocoder.us web site.)

A web service is a way for a program to communicate with another program over the Web. In this case, it is as though you had a magical assistant entering addresses into the geocoder.us site and returning the resulting coordinates in your program.

Except this assistant is itself a program, and it is optimized to get just the information that you need and return that information to your program. An example is "Caffeinated and Unstrung: A Guide to Seattle's Free Wireless Coffee Shops" (http://seattle.wifimug.org), created by Kellan Elliot-McCrea. You can go to the site, select the "Search Nearby" option, enter your address, and find a spot that provides both coffee and a wireless connection. At first glance, this may seem like overkill, but given the coffee habit of the Seattle wireless community, a place to get connected is a fine thing!

geocoder.us can be queried via the XML-RPC and REST-ful interfaces, which are available to any reasonable programming language. The basic steps are:

  1. Get an address from a web form, database, or file.

  2. Format that address and create a web service request.

  3. Call the geocoder.

  4. Do something interesting with the result.

Geocoding with XML-RPC

XML-RPC is a way of making a request to a remote system (a Remote Procedure Call, or RPC) and receiving the results in XML. The XML response from geocoder.us is easy to script in Perl by using the XMLRPC::Lite module. Most modern languages have a library that will parse XML-RPC and return results in an easy to manage form:

#!/usr/bin/perl
   
use XMLRPC::Lite;
use Data::Dumper;
use strict;
use warnings;
   
my $where = shift @ARGV
    or die "Usage: $0 \"111 Main St., Anytown, KS\"\n";
   
my $result = XMLRPC::Lite
  -> proxy( 'http://rpc.geocoder.us/service/xmlrpc' )
  -> geocode( $where )
  -> result;
   
print Dumper $result;

Before running the code, you need to install the XMLRPC::Lite Perl module. This can be done via CPAN from the shell by typing sudo perl -MCPAN -e "XMLRPC::Lite".

Running the Hack

Write the previous script to a file called simplest_xmlrpc.pl and run it like this:

./simplest_xmlrpc.pl "1005 Gravenstein Hwy North, Sebastopol, CA  95472"

It should show you the following data structure:

$VAR1 = [
          {
            'lat' => '38.411908',
            'state' => 'CA',
            'zip' => '95472',
            'prefix' => '',
            'long' => '-122.842232',
            'suffix' => 'N',
            'number' => '1005',
            'type' => 'Hwy',
            'city' => 'Sebastopol',
            'street' => 'Gravenstein'
          }
        ];

In this example, the geocoder found one and only one possible match, but in the case of ambiguous addresses, you can come up with multiple possible matches. For example, try geocoding this address:

./simplest_xmlrpc.pl "800 Oxford Ave., Los Angeles, CA"

The geocoder finds three possible addresses in different parts of the city: an Oxford Avenue North, South, and unknown. This often occurs when you try to identify a location from incomplete information, and it is also a potential trouble spot if you are geocoding a full database where you don't have additional context. Fortunately each of the Oxford Avenues is in a different ZIP Code, and they can be further disambiguated by including the full address, including the directional. The important point is to remember that you can get multiple results, so plan accordingly. In the sample batch geocoding script in Section 7.4.4, (later in this hack), multiple addresses will be marked and specifically not geocoded, following the theory that bad data is worse than no data.

The XMLRPC::Lite method returns an array of hash refs, one array element for each address that is geocoded. Processing the returned value is trivial in Perl. The last line of the sample is:

print Dumper $result;

This uses the built-in Perl Data::Dumper module to print complex data structures. Replacing that line with the following code will walk through all the returned addresses and print out the city, state, ZIP, latitude, and longitude:

foreach my $row (@$result) {
        print $row->{city} . ',' . $row->{state} . ',';
        print  $row->{zip} .  $row->{lat} . ',' . $row->{long} . "\n";
}

The geocoder also returns the address that you passed to it in a cleaned-up form, split back into fields. So you can use the XML-RPC interface as a poor man's address parser.

Casey West is working on a Perl module to extract the address-splitting functionality of the geocoder and put it into its own module. As always, keep an eye on CPAN!

Geocoding with the RDF/REST Interface

REST stands for "Representational State Transfer" and is a way to treat web services requests as parameters to standard GET and POST requests. This means that you enter a normal human-readable URL. To make a RESTful request to the geocoder, you need to create a URI-safe version of the address. The address needs to be converted to a form that can appear on the address line of your browser (which means replacing spaces with + signs and using special escape sequences). Here is an example of a RESTful call. The advantage over the XML-RPC version is that you can paste this directly into your browser, so there is no need for XML parsing libraries:

http://rpc.geocoder.us/service/rest?address=1005+Gravenstein+Hwy+N+  sebastopol+ca

This returns an RDF/XML document that includes the results of your request, which will be displayed in different ways depending on your browser. Apple's Safari browser displays the full RDF/XML document, as shown in Figure 7-3.

figure 7-3
Figure 7-3. The results of a REST-ful RDF request shown in Safari

Older or non-RDF-aware browsers will ignore the tags that they don't recognize (such as <geo:Point>), leaving just the coordinates. Opera reveals the bare coordinates:

-122.842232 38.411908

Here's an example of a simple program to script the REST interface with Perl:

#!/usr/bin/perl
   
use LWP::Simple;
use URI::Escape;
   
my $where = shift @ARGV
    or die "Usage: $0 \"111 Main St, Anytown, KS\"\n";
   
my $addr = uri_escape($where);
print get "http://rpc.geocoder.us/service/rest?address=$addr";

Call the program by putting an address on the command line:

./simplest_rest.pl "1005 Gravenstein Hwy North, Sebastopol, CA"

You can also substitute + for spaces and skip the quotes:

./simplest_rest.pl 1005+Gravenstein+Hwy+North+Sebastopol+CA

The full RDF document as shown in Figure 7-3 is returned. This can be parsed with the Perl module RDF::Simple::Parser.

Geocoding a List of Addresses

The Monterey Express is a dive boat in Monterey, California. A list of dive-related resources for the boat is maintained at http://www.montereyexpress.com/DiveLinks.htm. A real-world application would be to geocode these addresses in order to create a "find your closest dive resource" application. This sample Perl code fetches the list, does a simplistic (and demonstrably wrong in some cases) parse to get the addresses, geocodes the addresses, and returns the results:

#!/usr/bin/perl
   
# divecode.pl - Geocode the Monterey Express dive resources list
   
use LWP::Simple;
use XMLRPC::Lite;
   
my $lines;
#$lines = get "http://www.montereyexpress.com/DiveLinks.htm";
   
#or use STDIN
{local $/; undef $/; $lines = <>;}
   
my ($shop_name, $shop_address);
while ($lines =~ s/(.+)<br>//m) {
        $st = $1;
        chomp $st;
        # is this the address?
        ($shop_address)  = ($st =~ /^\s*Address:(.+)/);
        if ($shop_address) {
                $shop_address =~ s/<br>//;
                print "$shop_name\n";
                print "$shop_address\n";
                my $result = XMLRPC::Lite
                  -> proxy( 'http://rpc.geocoder.us/service/xmlrpc' )
                  -> geocode( $shop_address )
                  -> result;
                # assume we only get one address
                print $result->[0]->{lat} . ',' . $result->[0]->{long};
                print "\n\n";
        }
        # just assume that the shop name is the line before the address
        $shop_name = $st;
}

To run this hack, just execute the script, which produces the following results:

Rich-Gibson-iBook:~/wa/geohacks/geocode_web_service rich$ ./divecode.pl 
Aquatic Dreams Scuba Center,1212 Kansas Avenue, Modesto, CA 95351,37.647585,
-121.028297
Bamboo Reef (Monterey),614 Lighthouse Avenue, Monterey, CA 93940,36.613716,
-121.901494
Bamboo Reef,584 4th Street, San Francisco, CA 94107,37.778529,-122.396631
...

Out of 26 "legitimate" addresses, all but four were successfully geocoded. The remaining four didn't work because the addresses were broken across extra lines in the original document, and I didn't write a very good parser. Better results could be obtained by using the HTML::Parser module and spending a bit more time studying this particular data set, but the goal was to illustrate how easy it can be to get 85% success with such a simplistic approach (and 37 lines of Perl).

Happy geocoding!

Setting up Your Own Geocoding Server

Do you have too many addresses to geocode, or want the control of running your own server? You can set up your own geocoder with the Geocoder code.

You need to install the Perl Module Geo::Coder::US from CPAN, follow the instructions to download the relevant TIGER data for the areas you wish to cover, and then refer to the Geo::Coder::US documentation on how to load the database.

Schuyler Erle, Rich Gibson, Jo Walsh had nothing to do with the Tunguska explosion of 1908. No, really.


View catalog information for Mapping Hacks

Return to the O'Reilly Network.