advertisement

Listen Print
Querying Web Applications

Mining the Web with REBOL

07/21/2000

This is the second column in a series exploring methods of retrieving data from back-end applications. In this column, Piroz Mohseni introduces REBOL and tells why he likes its simple power to work with the Web interface.

In my first column, I described the Web as an enormous library of information and proposed the need for methodologies that allow applications to mine and query the Web. The Web should be considered as another interface for applications. Just as applications have a COM interface, a JDBC interface, or an e-mail interface, they can have an HTTP-based interface. These interfaces allow for a program to receive and send information. In a world of disparate applications, such interfaces provide a common ground for inter-application communication.

If you accept the above as the essence of an application interface, then the Web can also be considered as an interface. Each application is uniquely identified by an URL. The application can accept input via the post or get methods (part of the HTTP spec) and send its output in any format (although HTML seems to be the most common format these days).

Most programming languages support HTTP and as a result support the Web interface. With some languages, programming the Web interface is easier than others. A relatively new language called REBOL offers some unique capabilities when it comes to using the Web interface.

Built-in network support

REBOL is a scripting language similar to Perl and Tcl. It has a very small footprint (less than 300 Kbytes) which makes it speedier than other popular scripting languages. Of course, REBOL is a lot simpler than its sister languages. While that does translate to some loss of functionality, overall, I think REBOL competes very well.

More on REBOL

REBOL's home page at REBOL Technologies of Ukiah, California.

REBOL is Cool. Read Piroz's introduction to REBOL on CMPnet's WebTools site.

Rebol Might Be the Language for the Rest of Us. An introduction to REBOL from Web Review.

One aspect of REBOL that quickly caught my attention was its support for network protocols like HTTP, FTP, SMTP, and POP3 within the language. No additional library, extensions, or modules are necessary.

Furthermore, the simplicity of the language means that many operations that are traditionally different in other languages are grouped together. For example, the "read" function can take a filename, a URL, or a network socket connection as its argument.

To demonstrate the simplicity, consider the following script:

sites: [
     http://www.oreilly.com
     http://www.cnet.com
     http://www.amazon.com
]

foreach url sites [
        if find read url "Network"  [print url]
]

Here we define a block containing three URLs. (Blocks are a common data structure to hold data.) Any data can be grouped into a list or treated as a single-item list. So a text file can be represented as a block with each line of the file corresponding to an item on the list. In our example, the block is used to hold three URL items. We can then go through each item on the list with a foreach loop. For each item, we read the URL (read url) and search within the read text for the word Network. If the word is found, we print out the URL. In most languages that I know, such an operation takes more than one line of code, but not with REBOL.

Because of its simplicity and inherent awareness of network protocols, REBOL is a good tool for mining the Web. If I have to create a Web page that contains the latest headline from cnn.com, the latest stock prices for the Dow's top ten most active stocks, and the image showing the latest weather condition in New York, REBOL can be a valuable friend. All of the content mentioned above resides on non-Web systems, but Web interfaces for those systems have been created. Rather than trying to access and consolidate the non-Web systems, we could utilize their Web interfaces to extract the data.

Parsing engine

Aside from simplicity in accessing Web interfaces of applications, the other main task is parsing the data after it is retrieved. REBOL offers a sophisticated parsing engine for that purpose. Consider the following three lines:

page: read http://www.foo.com
parse page [thru <title> copy mytitle to </title>]
print mytitle

The first line reads a URL and stores its content in a variable called page. The next line parses the content of page using a simple rule specified by a block. The rule requires REBOL to search for the string <title> and go past it (thru), then copy the text to a variable called mytitle and continue the copy operation until it gets to another string </title>. The effect is that the TITLE of the page is extracted. The parse function accepts grammar rules that are written in a dialect of REBOL. Within this dialect the grammar and vocabulary of REBOL is altered to make it similar in structure to the well known BNF (Backus-Naur Form), which is commonly used to specify language grammars, network protocols, header formats, and so on.

This article has focused on two features of REBOL that are very useful for accessing Web interfaces of applications. These features are the relative ease of accessing network resources since the common protocols are built into the language, and the ability to parse the retrieved data (HTML, XML, text, etc.) with an extensible parsing engine.

You can learn more about REBOL by visiting www.rebol.com and downloading the engine. The site also contains example code and documentation. It should come as no surprise that most of the site is created via REBOL scripts.

Piroz Mohseni is president of Bita Technologies; his areas of interest include enterprise Java, XML, and e-commerce applications.

Read more Querying Web Applications columns.


Discuss this article in the O'Reilly Network Forum.

Return to the O'Reilly Network Hub.