Web Client Programming in Pythonby Dave Warner
Web client programming is a powerful technique for querying the Web. A web client is any program that retrieves data from a web server using the Hyper Text Transfer Protocol (the http in your URLs). A web browser is a client; so are web crawlers, programs that traverse the Web automatically to gather information. You can also use web clients to take advantage of services offered by others on the Web and add dynamic features to your own web site.
Web client programming belongs in any developer's toolbox. Perl aficionados have employed it for years. In Python, the process reaches even higher levels of convenience and flexibility. Three modules provide most of the functionality you will need: HTTPLIB, URLLIB, and a newer addition, XMLRPCLIB. In true Pythonesque fashion, each module builds upon its predecessor, providing a solid, well-designed base for your applications. We will cover the first two modules in this article, saving XMLRPCLIB for a later time.
For our examples, we will use Meerkat. If you are like me, you invest time tracking trends and developments in the open source community to give you a competitive edge. Meerkat is a tool that makes that task much easier. It is an open wire service that collects and collates an enormous amount of information on open source computing. Although its browser interface is flexible and customizable, using web client programming we can scan, extract, and even store this information off-line for later use. We will first access Meerkat using HTTPLIB interactively, and then move on to accessing Meerkat's Open API via URLLIB to create a customizable information-collecting tool.
HTTPLIB is a lightweight wrapper around the socket module. Of the three libraries I have mentioned, HTTPLIB provides the most control when accessing a web site. That control, however, comes at the cost of requiring more work to accomplish your task. The http protocol is "stateless," so it doesn't remember anything about your previous requests. You must construct a new HTTPLIB object to connect to the web site for each request. The requests form a conversation with the web server, mimicking a web browser. Let's connect to Meerkat using Rael Dornfest's Open API interactively and see what results we get. The conversation begins by building up a series of statements that first state what action you want to take, and then identify you to the web server:
>>> import httplib >>> host = 'www.oreillynet.com' >>> h = httplib.HTTP(host) >>> h.putrequest('GET', '/meerkat/?_fl=minimal') >>> h.putheader('Host', host) >>> h.putheader('User-agent', 'python-httplib') >>> h.endheaders() >>>
GET request tells the server which page you want to receive. The Host
header tells it the domain name you are querying. Modern servers using HTTP 1.1 can host several domains at the
same address. If you don't tell it which domain name you want, you will get a
'302' redirection response as your return code. The User-agent header tells
the server what kind of client you are so it knows what it can and cannot send
you. This is all the information you need for the web server to process your
request. Next you ask for the
>>> returncode, returnmsg, headers = h.getreply() >>> if returncode == 200: #OK ... f = h.getfile() ... print f.read() ...
This will print out the current Meerkat page in the minimal flavor. The
response header and content are returned separately, which aids in both
troubleshooting and parsing any returned data. If you want to see the
response headers use
HTTPLIB hides the mechanics of socket programming, and its use of a file
object for buffering lets you use a familiar approach to manipulating the
data. It is, however, best suited as a building block for more powerful web
client applications, or for interactive conversations with a troubled web site.
To aid in both areas, HTTPLIB has a useful debug capability. You access it by
calling the method
h.set_debuglevel(1) at any point after object
initialization (the line
h = httplib.HTTP(host) in our example).
With the debug level set to 1, the module will echo requests and the results
of any calls to
getreply() to the screen.
The interactive nature of Python makes analyzing websites using HTTPLIB a joy. Familiarize yourself with this module and you will have a powerful, flexible tool for diagnosing web site problems. Take time to look at the source for HTTPLIB as well. With less than 200 lines of code, HTTPLIB is a quick and easy introduction to socket programming using Python.
Pages: 1, 2