O'Reilly Book Excerpts: Mapping Hacks

Editor's note: Schuyler Erle, one of Mapping Hacks' coauthors, will be participating in a panel discussion on sustainable businesses for data at O'Reilly's Where 2.0 Conference. If you're a developer of location-based services and apps, don't miss what is sure to be a lively debate among executives of organizations like Navteq, Microsoft, the Census Bureau, and others, as they discuss business models for service and data companies.

Geocode a U.S. Street Address

beginner iconhack 79

You know the address, but where is that in GPS terms?

You know your friend's address, but that won't help you program your GPS or aim your ICBM. For that, you need her latitude and longitude; you want to "geocode" her address! Geocoding is the process of adding geographic coordinates, such as latitude/longitude, to other information. You can geocode street addresses, or any other information that has a geographic component.

One Saturday we were sitting around thinking that we really ought to go see the Power Tool Drag Races. We knew that they were put on by Qbox (http://www.qbox.org/), and we even knew their address, but where exactly is that? Sure, we could use a commercial mapping service and have it tell us to turn left here, and in circles there, but what I wanted was to program my GPS and have it just sort of point the way. At one level, this is much harder to follow than turn-by-turn directions, except that directions only work as long as you follow them. Since I have little confidence in my ability to follow directions in San Francisco, I am very happy to have the safety net of the GPS pointer.

To cut to the chase, just enter this URL (Figure 7-1 shows what it should return):

http://geocoder.us/demo.cgi?address=950+Hudson+Street%2C+san+francisco%2C+ca

figure 7-1
Figure 7-1. The Power Tool Drag Races at Qbox.com

We plugged (37.734085, -122.377589) into our GPS unit, and off we went for a day of power-tool debauchery.

There are commercial services that provide geocoding for U.S. addresses and for other parts of the world. To find them, just do a Google search for "Geocode Addresses."

A geocoder is also at the heart of all the online map services. When you enter a street address into MapQuest, it is geocoded and the map you get is generated from the returned coordinates. In the good old days of the Web, pretty much all of the online map services returned the lat/long for addresses as a "freebie." Then they decided that geocoding had added value, and one by one they pulled the plug.

There is a strong movement of people who believe in open data and open data formats. Mapping sites' removal of free geocoding led directly to the creation of the free geocoder.us site. As William Gibson famously noted, "The street finds its own uses for things," and that use can transcend and exceed the original vision of the tool.

The Birth of geocoder.us

Strangely enough, the removal of useful features from online map services seemed to occur right before a surge of interest in free sources of geodata among the free and open source software community.

Collecting this data and keeping it up to date with "ground truth squads" who go around and verify that streets are where they are supposed to be and that houses haven't up and run off, is quite expensive.

An alternative to the full expense of this data lies in the U.S. Census Bureau. They have compiled TIGER (Topologically Integrated Geographic Encoding and Referencing system) data. TIGER data is used as part of the normal fulfillment of their duties to do an actual enumeration of the people every 10 years. This data is imperfect, but the regular tasks of census workers are similar to our own needs. They wish to identify the location of a residence based on a street address, just as we do when we geocode.

Again, it is important to stress that TIGER data is imperfect, however "imperfect but free" has its own charm. TIGER data is also used as the basis for the free TIGER Map Server offered by the Census Bureau at http://tiger.census.gov/cgi-bin/mapsurfer.

There is a lot of interesting information about geography and the challenges of capturing complex and inconsistent information to be found in the TIGER documentation. But for simple geocoding, all you really need to know is that the TIGER data endeavors to include information on every street segment in the U.S. For each block, the TIGER data includes the street name, the latitude and longitude at each end of the block, and the range of address numbers for the left and the right side of the street.

Here is the entry that includes 1005 Gravenstein Hwy North, Sebastopol, CA 95472 (O'Reilly Media's headquarters):

11003  67518936 A  Gravenstein                   Hwy   A31       1001       1019
1000  101801009547295472              06060970979298092980  
707707077015340315340320124009-122816102+38390313-122815686+38389814

This street segment goes from (38.390313, -122.816102) to (38.389814, -122.81515686); one side of the street includes addresses from 1001 through 1019, and the other covers addresses from 1000 to 1018. We can interpolate that "1005" is about a fifth of the way from 1001 to 1019 and, assuming the street is straight, that it will be about a fifth of the way between the ends of the blocks.

There is a lot of other information in this line, and in the other files that make up the data set for a county. TIGER/Line comprises some 24 gigabytes of data for the whole country, including information on curves in the road that are not the ends of street segments. But in the interests of compressing that 24 GB into something searchable, we will simplify that extra information.

Fortunately for us, Schuyler Erle has stripped away all of that complexity at http://geocoder.us/, a free geocoding web site and web service for U.S. addresses based on the U.S. Census TIGER/Line data.

You may use the web site to geocode individual addresses or use one of three web service interfaces to geocode via code, as illustrated in [Hack #80] . You can even download the source code from CPAN (the Perl code repository) at http://cpan.org, and the TIGER/Line data from the census to create your own geocoding service.

The site provides a text box for entry of an address or an intersection. So entering "1005 Gravenstein Highway North, Sebastopol, CA" will return the location of O'Reilly Media. You can also enter an intersection, like "Hollywood and Vine, Hollywood, CA" or "Florence Ave and Wilton, Sebastopol, CA 95472."

If your address is one of the majority of those that geocoder.us successfully geocodes, it will return the latitude and longitude. As a bonus, it will display a map, created dynamically by the TIGER/Line Map Server, with your address marked and centered.

The results with lat/long appear quickly, but it can take longer for the map to be fetched from the TIGER/Line Map Server. The map will be blank and the little circle on the right will be red until the map is loaded.

In Seattle, Washington, you can indirectly use the geocoder at Caffeinated and Unstrung to find the nearest location that offers coffee and free wireless access, as illustrated in Figure 7-2.

figure 7-2
Figure 7-2. Caffeinated and Unstrung: building on Geocoder.us

See Also

The U.S. Census Bureau and Geography page provides lots of great information. (http://www.census.gov/geo/www/index.html)

Automatically Geocode U.S. Addresses

expert iconhack 80

Use the geocoder.us web services to geocode batches of address

In [Hack #79], you saw how easy it was to geocode an individual address. But what about a whole database of addresses? What about geocoding addresses as people enter them into a web form? You don't need to webscrape geocoder.us! There are three different web service interfaces. geocoder.us supports XML-RPC and a lightweight REST-ful interface. There is also an embryonic SOAP interface. (For more information and a code sample, consult the geocoder.us web site.)

A web service is a way for a program to communicate with another program over the Web. In this case, it is as though you had a magical assistant entering addresses into the geocoder.us site and returning the resulting coordinates in your program.

Except this assistant is itself a program, and it is optimized to get just the information that you need and return that information to your program. An example is "Caffeinated and Unstrung: A Guide to Seattle's Free Wireless Coffee Shops" (http://seattle.wifimug.org), created by Kellan Elliot-McCrea. You can go to the site, select the "Search Nearby" option, enter your address, and find a spot that provides both coffee and a wireless connection. At first glance, this may seem like overkill, but given the coffee habit of the Seattle wireless community, a place to get connected is a fine thing!

geocoder.us can be queried via the XML-RPC and REST-ful interfaces, which are available to any reasonable programming language. The basic steps are:

  1. Get an address from a web form, database, or file.

  2. Format that address and create a web service request.

  3. Call the geocoder.

  4. Do something interesting with the result.

Geocoding with XML-RPC

XML-RPC is a way of making a request to a remote system (a Remote Procedure Call, or RPC) and receiving the results in XML. The XML response from geocoder.us is easy to script in Perl by using the XMLRPC::Lite module. Most modern languages have a library that will parse XML-RPC and return results in an easy to manage form:

#!/usr/bin/perl
   
use XMLRPC::Lite;
use Data::Dumper;
use strict;
use warnings;
   
my $where = shift @ARGV
    or die "Usage: $0 \"111 Main St., Anytown, KS\"\n";
   
my $result = XMLRPC::Lite
  -> proxy( 'http://rpc.geocoder.us/service/xmlrpc' )
  -> geocode( $where )
  -> result;
   
print Dumper $result;

Before running the code, you need to install the XMLRPC::Lite Perl module. This can be done via CPAN from the shell by typing sudo perl -MCPAN -e "XMLRPC::Lite".

Running the Hack

Write the previous script to a file called simplest_xmlrpc.pl and run it like this:

./simplest_xmlrpc.pl "1005 Gravenstein Hwy North, Sebastopol, CA  95472"

It should show you the following data structure:

$VAR1 = [
          {
            'lat' => '38.411908',
            'state' => 'CA',
            'zip' => '95472',
            'prefix' => '',
            'long' => '-122.842232',
            'suffix' => 'N',
            'number' => '1005',
            'type' => 'Hwy',
            'city' => 'Sebastopol',
            'street' => 'Gravenstein'
          }
        ];

In this example, the geocoder found one and only one possible match, but in the case of ambiguous addresses, you can come up with multiple possible matches. For example, try geocoding this address:

./simplest_xmlrpc.pl "800 Oxford Ave., Los Angeles, CA"

The geocoder finds three possible addresses in different parts of the city: an Oxford Avenue North, South, and unknown. This often occurs when you try to identify a location from incomplete information, and it is also a potential trouble spot if you are geocoding a full database where you don't have additional context. Fortunately each of the Oxford Avenues is in a different ZIP Code, and they can be further disambiguated by including the full address, including the directional. The important point is to remember that you can get multiple results, so plan accordingly. In the sample batch geocoding script in Section 7.4.4, (later in this hack), multiple addresses will be marked and specifically not geocoded, following the theory that bad data is worse than no data.

The XMLRPC::Lite method returns an array of hash refs, one array element for each address that is geocoded. Processing the returned value is trivial in Perl. The last line of the sample is:

print Dumper $result;

This uses the built-in Perl Data::Dumper module to print complex data structures. Replacing that line with the following code will walk through all the returned addresses and print out the city, state, ZIP, latitude, and longitude:

foreach my $row (@$result) {
        print $row->{city} . ',' . $row->{state} . ',';
        print  $row->{zip} .  $row->{lat} . ',' . $row->{long} . "\n";
}

The geocoder also returns the address that you passed to it in a cleaned-up form, split back into fields. So you can use the XML-RPC interface as a poor man's address parser.

Casey West is working on a Perl module to extract the address-splitting functionality of the geocoder and put it into its own module. As always, keep an eye on CPAN!

Geocoding with the RDF/REST Interface

REST stands for "Representational State Transfer" and is a way to treat web services requests as parameters to standard GET and POST requests. This means that you enter a normal human-readable URL. To make a RESTful request to the geocoder, you need to create a URI-safe version of the address. The address needs to be converted to a form that can appear on the address line of your browser (which means replacing spaces with + signs and using special escape sequences). Here is an example of a RESTful call. The advantage over the XML-RPC version is that you can paste this directly into your browser, so there is no need for XML parsing libraries:

http://rpc.geocoder.us/service/rest?address=1005+Gravenstein+Hwy+N+  sebastopol+ca

This returns an RDF/XML document that includes the results of your request, which will be displayed in different ways depending on your browser. Apple's Safari browser displays the full RDF/XML document, as shown in Figure 7-3.

figure 7-3
Figure 7-3. The results of a REST-ful RDF request shown in Safari

Older or non-RDF-aware browsers will ignore the tags that they don't recognize (such as <geo:Point>), leaving just the coordinates. Opera reveals the bare coordinates:

-122.842232 38.411908

Here's an example of a simple program to script the REST interface with Perl:

#!/usr/bin/perl
   
use LWP::Simple;
use URI::Escape;
   
my $where = shift @ARGV
    or die "Usage: $0 \"111 Main St, Anytown, KS\"\n";
   
my $addr = uri_escape($where);
print get "http://rpc.geocoder.us/service/rest?address=$addr";

Call the program by putting an address on the command line:

./simplest_rest.pl "1005 Gravenstein Hwy North, Sebastopol, CA"

You can also substitute + for spaces and skip the quotes:

./simplest_rest.pl 1005+Gravenstein+Hwy+North+Sebastopol+CA

The full RDF document as shown in Figure 7-3 is returned. This can be parsed with the Perl module RDF::Simple::Parser.

Geocoding a List of Addresses

The Monterey Express is a dive boat in Monterey, California. A list of dive-related resources for the boat is maintained at http://www.montereyexpress.com/DiveLinks.htm. A real-world application would be to geocode these addresses in order to create a "find your closest dive resource" application. This sample Perl code fetches the list, does a simplistic (and demonstrably wrong in some cases) parse to get the addresses, geocodes the addresses, and returns the results:

#!/usr/bin/perl
   
# divecode.pl - Geocode the Monterey Express dive resources list
   
use LWP::Simple;
use XMLRPC::Lite;
   
my $lines;
#$lines = get "http://www.montereyexpress.com/DiveLinks.htm";
   
#or use STDIN
{local $/; undef $/; $lines = <>;}
   
my ($shop_name, $shop_address);
while ($lines =~ s/(.+)<br>//m) {
        $st = $1;
        chomp $st;
        # is this the address?
        ($shop_address)  = ($st =~ /^\s*Address:(.+)/);
        if ($shop_address) {
                $shop_address =~ s/<br>//;
                print "$shop_name\n";
                print "$shop_address\n";
                my $result = XMLRPC::Lite
                  -> proxy( 'http://rpc.geocoder.us/service/xmlrpc' )
                  -> geocode( $shop_address )
                  -> result;
                # assume we only get one address
                print $result->[0]->{lat} . ',' . $result->[0]->{long};
                print "\n\n";
        }
        # just assume that the shop name is the line before the address
        $shop_name = $st;
}

To run this hack, just execute the script, which produces the following results:

Rich-Gibson-iBook:~/wa/geohacks/geocode_web_service rich$ ./divecode.pl 
Aquatic Dreams Scuba Center,1212 Kansas Avenue, Modesto, CA 95351,37.647585,
-121.028297
Bamboo Reef (Monterey),614 Lighthouse Avenue, Monterey, CA 93940,36.613716,
-121.901494
Bamboo Reef,584 4th Street, San Francisco, CA 94107,37.778529,-122.396631
...

Out of 26 "legitimate" addresses, all but four were successfully geocoded. The remaining four didn't work because the addresses were broken across extra lines in the original document, and I didn't write a very good parser. Better results could be obtained by using the HTML::Parser module and spending a bit more time studying this particular data set, but the goal was to illustrate how easy it can be to get 85% success with such a simplistic approach (and 37 lines of Perl).

Happy geocoding!

Setting up Your Own Geocoding Server

Do you have too many addresses to geocode, or want the control of running your own server? You can set up your own geocoder with the Geocoder code.

You need to install the Perl Module Geo::Coder::US from CPAN, follow the instructions to download the relevant TIGER data for the areas you wish to cover, and then refer to the Geo::Coder::US documentation on how to load the database.


View catalog information for Mapping Hacks

Return to the O'Reilly Network.