O'Reilly Network    


 Published on The O'Reilly Network (http://www.oreillynet.com/)
 http://www.oreillynet.com/pub/wlg/3210

A handy Python script for parsing the O'Reilly book catalogue.

by Jacek Artymiak
May. 20, 2003

I'm working on a on-line bookstore project, which I'd like to automate as much as possible. Things like book insertion and deletion ought to happen automatically, with only occasional need to mess with code or HTML. My choice of language for this project is Python, with a touch of urllib and re. Since one of the sources of book titles and ISBN numbers I use is the O'Reilly book catalogue, I thought I'd share this little script with other ORA fans.

#!/usr/bin/python

import urllib, re, sys

try:
        page = urllib.urlopen('http://www.oreilly.com/catalog/prdindex.html')
except IOError, (errno, strerror):
        sys.exit ("I/O error(%s): %s" % (errno, strerror))

title = ""
isbn = ""
price = ""

page = page.read()
page = page.replace("\n", "")
page = page.replace("\r", "")
page = page.replace("> ", ">")
page = page.replace("  ", " ")
page = page[page.find("<b>Examples</b></td>") + len("<b>Examples</b></td>"):]

while(1):
        page = page[page.find("<tr ") + len("<tr "):]

        if (len(page) == 1):
                break

        page = page[page.find("http://www.oreilly.com/catalog/"):]

        if (len(page) == 1):
                break

        page = page[page.find("\">"):]
        page = page[2:]

        title = page[:page.find("</a>")]

        page  = page[page.find("\">"):] 
        page  = page[2:]

        isbn  = page[:page.find("</td>")]
        isbn  = isbn.replace("-", "")

        page  = page[page.find("\">"):] 
        page  = page[2:]

        price = page[:page.find("</td>")]

        print title + ":" + isbn + ":" + price

Jacek Artymiak started his adventure with computers in 1986 with Sinclair ZX Spectrum. He's been using various commercial and Open Source Unix systems since 1991. Today, Jacek runs devGuide.net, writes and teaches about Open Source software and security, and tries to make things happen.

oreillynet.com Copyright © 2006 O'Reilly Media, Inc.