Web Client Programming in Python
Pages: 1, 2
URLLIB provides a sophisticated interface to the functionality found in HTTPLIB. It is best used for getting at the data itself, rather than analyzing the web site. Here's the same interaction as above, using URLLIB (NOTE: we've had to break the last line into two for the sake of display, but don't use a line break in your script):
>>> import urllib >>> u = urllib.urlopen ('http://www.oreillynet.com/meerkat/?_fl=minimal')
That's all there is to it! With one line you've accessed Meerkat, obtained the data, and placed it in a temporary cache. To access the header information:
>>> print u.headers
And to view the entire file:
But that's not all. In addition to HTTP, URLLIB can also access FTP, Gopher, and even local files in the same manner. The module also contains many utility functions, including those for parsing urls, encoding strings into a url-safe format, and providing progress indication during lengthy data transfers.
An example using Meerkat
Imagine that you have a group of clients who expect you to keep them informed by mail of the latest happenings regarding Linux. We can write a short script using URLLIB to get this information from Meerkat, build a listing of links, and store those links in a file for later transmission. The author of Meerkat, Rael Dornfest, has already done most of the work for us through the Meerkat API. All that is left is to construct the request, parse out the links, and store the results for later transmission.
Why do this rather than just have the users head to Meerkat? Providing this "passive" service allows the user to view the information at leisure, and provides them with the ability to selectively store the information in a familar (e.g., e-mail) format. With the news waiting in their mailbox on Monday morning, they won't miss information that "scrolls by" over the weekend.
Since Meerkat's minimal flavor is limited to 15 stories, we will run the script every hour (e.g., as a Unix cron job or using NT's AT command) to lessen the chances of missing any data. Here is the url we will use (NOTE: we've had to break this into two lines for the sake of our display. You can see the results of using this URL here).
This will pull in all Linux stories (profile=5) from the last hour, presenting the data in minimal flavor, with no descriptions, no category info, no channel info, and no date info. We will also use the regular expression module to help us extract the link information and redirect our output to a file object opened in append mode.
View the complete script here.
We've only scratched the surface of these modules, and there are many other network programming modules available for Python than can be used for web client tasks. Web client programming is especially useful when processing large amounts of tabular data. Using web client programming in a recent Electronic Data Interchange project, we bypassed an unwieldy, proprietary software package. We took the updated price information we needed directly from the web and put it into our database. It saved us a lot of time and frustration.
Web client programming can also be useful for testing the structure and consistency of web sites. The most common procedure is checking for dead links. The standard Python distribution comes with a complete example of this, based upon URLLIB. Webchecker, along with a Tk-based front end, can be found under the tools subdirectory of the distribution. Another Python tool, Linbot, improves on this. It provides everything you need for web-site troubleshooting. As web sites become more and more complex, other web client applications will become necessary to ensure your web site's quality.
There is a pitfall to web client programming. Your programs are often susceptible to small changes in the way a page is formatted. How a site displays its data today may not be how it displays it tomorrow. When the format changes, so must your programs. This is one reason XML is so exciting: With data on the web tagged for meaning, format is less important. As XML standards evolve and become universally accepted, processing XML data will be even easier and more robust.
There are also some limits to the tools we covered here. Although they are excellent for client-based tasks, the HTTPLIB and URLLIB modules shouldn't be used to build a production http server, since they can only handle one request at a time. To provide asynchronous processing, Sam Rushing has built an impressive set of tools, including asyncore.py, which comes with the standard Python distribution. The most powerful example of this approach is ZOPE, an application server that includes a fast http server built using Sam Rushing's Medusa engine.
In a future article I will show you how you can combine XML and web-client programming with the XMLRPCLIB module. You can use XML to squeeze even more functionality out of the Meerkat API.
Dave Warner is a senior programmer and DBA at Federal Data Corporation. He specializes in accessing relational databases with languages that begin with the letter P: Python, Perl, and PowerBuilder.
Discuss this article in the O'Reilly Network Python Forum.
Return to the Python DevCenter.