Print

Writing Google Desktop Search Plugins

by Jeremy Jones
06/02/2005
SEO + Web Traffic = Money

The Google Desktop Search (GDS) engine is a tool created by Google that indexes all of the files on your (Microsoft-Windows-based) computer and then provides the ability to search those files. The types of files that it indexes include all files written to disk (text files, web pages, media files, etc.), email, instant messages, and web pages and media files visited on the Web. GDS creates a deskbar in the toolbar which enables quick searching on criteria you specify. It returns search results by directing your default browser to a web server running on your machine. The browser-based search interface has an obviously Googlish look and feel. Interestingly, if you have GDS installed, you will have a Desktop search option (in addition to Web, Images, Groups, News, Froogle, and Local) when you visit Google. When you perform a search on the main Google page, GDS matches for that search may also show up in the form of "<Some number of> results stored on your computer" as the first search result.

As cool as this is, an even cooler aspect of GDS is that it is an extensible framework. Google has released an SDK so developers can write plugins for GDS. One such plugin is Kongulo, a web spider. Kongulo provides a command-line interface to crawl, starting at a specified URL, and index the resources it finds there within GDS. Command-line options include depth, URL match, loop, sleep time between loops, and passwords. Kongulo can be a useful tool for indexing intranets or private wikis ... or to see an example of a good plugin written for GDS.

How does a plugin tie into GDS? COM. As I mentioned above, GDS is an application for MS Windows systems. On Friday, May 27, 2005, Google released the source code for Kongulo. Here is the meat of how Kongulo pushes spidered web pages to GDS. (The pieces of the code that pertain specifically to spidering are interesting, but this article won't detail that aspect of Kongulo.)

First, Kongulo creates an event factory object attached to the Crawler object, like this:

self.event_factory = win32com.client.Dispatch('GoogleDesktopSearch.EventFactory')

An item of note here is that Kongulo uses the win32com libraries, so if you plan on running the source code, install the Win32 extensions for Python or use the ActiveState Python distribution.

Next, every time Kongulo wants GDS to index a page, it has to create an event from the event factory like this:

event = self.event_factory.CreateEvent(_GUID, 'Google.Desktop.WebPage')

The first argument the crawler passes into CreateEvent is the guid that Kongulo registers for itself the first time it runs. The second argument is a text string containing the fully qualified name of the type of event. Kongulo only uses Google.Desktop.WebPage, but other options include Google.Desktop.Indexable (which is the parent of all of the following indexable resources), Google.Desktop.Email, Google.Desktop.IM, Google.Desktop.File, and Google.Desktop.MediaFile.

The next steps entail adding properties. The event object has an AddProperty method that takes two arguments: a property name and a property value. The crawler adds the following four properties to all pages it finds:

event.AddProperty('format', doctype)
event.AddProperty('content', content)
event.AddProperty('uri', url)
event.AddProperty('last_modified_time', pywintypes.Time(time.time() + time.timezone))

doctype is the document type, pulled from the HTTP headers. Kongulo will only index documents of the type text/html or text/plain. content is the body of the web page. uri is the web location of the resource, and last_modified_time is actually the current local time, but there is a note in the source code to use the last-modified HTTP header instead.

The crawler adds the following property for HTML pages that contain a title:

event.AddProperty('title', title)

Interestingly, Kongulo uses regular expressions to find titles, frames, and links, as opposed to using an HTML parser. The Kongulo team felt this would provide a less strict processing of web pages.

The final step is to send the page to GDS, like this:

event.Send(0x01)

Send expects a bitwise OR of the following values:

EventFlagIndexable   = 0x00000001
EventFlagHistorical  = 0x00000010

where EventFlagIndexable just indicates an event that GDS should index, and EventFlagHistorical indicates a historical event (as opposed to an event that is currently happening in realtime). The Kongulo source code indicates that if the crawler passes in the historical flag, GDS will not process the event until the user's system becomes idle.

At this point, GDS has the web page and it is available for searching. That's all there is to it.

The GDS team has done an excellent job of providing a great tool that is easy to extend. The more I play with GDS, the more it impresses me. Of course, I would play with it more if it ran on Linux (hint, hint). Likewise, the Kongulo team has done an excellent job of providing a useful plugin to GDS, but more importantly, of providing clean, readable source code (being written in Python doesn't hurt its readability) to serve as an example of how to write a plugin for GDS. While there are plenty of plugins already available for GDS, this ease of creating a plugin makes me expect many more in the future.

Jeremy Jones is a software engineer who works for Predictix. His weapon of choice is Python.

Search Engine Optimization

Essential Reading

Search Engine Optimization
Building Traffic and Making Money with SEO
By Harold Davis

SEO--short for Search Engine Optimization--is the art, craft, and science of driving web traffic to web sites.

Web traffic is food, drink, and oxygen--in short, life itself--to any web-based business. Whether your web site depends on broad, general traffic, or high-quality, targeted traffic, this PDF has the tools and information you need to draw more traffic to your site. You'll learn how to effectively use PageRank (and Google itself); how to get listed, get links, and get syndicated; SEO best practices; and much more.

When you approach SEO, you must take some time to understand the characteristics of the traffic that you need to drive your business. Then go out and use the techniques explained in this PDF to grab some traffic--and bring life to your business.


Read Online--Safari
Search this book on Safari:
 

Code Fragments only

Return to the Python DevCenter.